遇到的问题,KMeans算法中,通过map算子,两种处理方法,结果不一样,第一种莫名的增加了很多列?
1.数据内容如下 test1.csv
2.完整代码如下
val rawData = sc.textFile("E:\\test1.csv")
println("----11122221-----")
rawData.foreach(println )
val labelsAndData = rawData.map{ line =>
val label = line.split(',').toString
println("lable:...."+label)
val vector = Vectors.dense(label.map(_.toDouble).toArray)
println("vector11111:......"+vector)
(label,vector)
/**
* 或者这样写
*/
val label2 = line.split(',')
val aa2 = label2.map(_.toDouble)
val vector2 = Vectors.dense(label2.map(_.toDouble))
println("vector22222:...."+vector2)
(label2,vector2)
}
labelsAndData.foreach(println )
val data = labelsAndData.values
println("---------******---------")
println("data:"+data)
data.foreach(println )
val dataAsArray = data.map(_.toArray)
println("dataAsArray:"+dataAsArray)
dataAsArray.foreach(println )
val sums = dataAsArray.reduce(
(a,b) => a.zip(b).map( t => t._1 + t._2)
)
for(ele <- sums) println(ele)
println("sums 数量:"+sums.length)
输出结果如下:
17/06/01 13:29:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60540 (size: 9.8 KB, free: 1132.5 MB)
17/06/01 13:29:48 INFO SparkContext: Created broadcast 0 from textFile at zip.scala:72
----11122221-----
17/06/01 13:29:48 INFO FileInputFormat: Total input paths to process : 1
17/06/01 13:29:48 INFO SparkContext: Starting job: foreach at zip.scala:74
17/06/01 13:29:48 INFO DAGScheduler: Got job 0 (foreach at zip.scala:74) with 1 output partitions
17/06/01 13:29:48 INFO DAGScheduler: Final stage: ResultStage 0 (foreach at zip.scala:74)
17/06/01 13:29:48 INFO DAGScheduler: Parents of final stage: List()
17/06/01 13:29:48 INFO DAGScheduler: Missing parents: List()
17/06/01 13:29:48 INFO DAGScheduler: Submitting ResultStage 0 (E:\test1.csv MapPartitionsRDD[1] at textFile at zip.scala:72), which has no missing parents
17/06/01 13:29:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
17/06/01 13:29:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1799.0 B, free 122.2 KB)
17/06/01 13:29:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60540 (size: 1799.0 B, free: 1132.5 MB)
17/06/01 13:29:48 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/06/01 13:29:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (E:\test1.csv MapPartitionsRDD[1] at textFile at zip.scala:72)
17/06/01 13:29:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/06/01 13:29:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2120 bytes)
17/06/01 13:29:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/06/01 13:29:48 INFO HadoopRDD: Input split: file:/E:/test1.csv:0+21
17/06/01 13:29:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/06/01 13:29:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/06/01 13:29:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/06/01