求助KMeans算法关于转换矩阵Vectors问题

遇到的问题,KMeans算法中,通过map算子,两种处理方法,结果不一样,第一种莫名的增加了很多列?

1.数据内容如下 test1.csv 


2.完整代码如下

val rawData = sc.textFile("E:\\test1.csv")
    println("----11122221-----")
    rawData.foreach(println )
    val labelsAndData = rawData.map{ line =>

      val label = line.split(',').toString
      println("lable:...."+label)
      val vector = Vectors.dense(label.map(_.toDouble).toArray)
      println("vector11111:......"+vector)
      (label,vector)
      /**
        * 或者这样写
        */
      val label2 = line.split(',')
      val aa2 = label2.map(_.toDouble)
      val vector2 = Vectors.dense(label2.map(_.toDouble))
      println("vector22222:...."+vector2)
      (label2,vector2)
    }


    labelsAndData.foreach(println )
    val data = labelsAndData.values
    println("---------******---------")
    println("data:"+data)
    data.foreach(println )

    val dataAsArray = data.map(_.toArray)
    println("dataAsArray:"+dataAsArray)
    dataAsArray.foreach(println )
    val sums = dataAsArray.reduce(
      (a,b) => a.zip(b).map( t => t._1 + t._2)
    )
    for(ele <- sums) println(ele)
    println("sums 数量:"+sums.length)

输出结果如下:

17/06/01 13:29:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60540 (size: 9.8 KB, free: 1132.5 MB)
17/06/01 13:29:48 INFO SparkContext: Created broadcast 0 from textFile at zip.scala:72
----11122221-----
17/06/01 13:29:48 INFO FileInputFormat: Total input paths to process : 1
17/06/01 13:29:48 INFO SparkContext: Starting job: foreach at zip.scala:74
17/06/01 13:29:48 INFO DAGScheduler: Got job 0 (foreach at zip.scala:74) with 1 output partitions
17/06/01 13:29:48 INFO DAGScheduler: Final stage: ResultStage 0 (foreach at zip.scala:74)
17/06/01 13:29:48 INFO DAGScheduler: Parents of final stage: List()
17/06/01 13:29:48 INFO DAGScheduler: Missing parents: List()
17/06/01 13:29:48 INFO DAGScheduler: Submitting ResultStage 0 (E:\test1.csv MapPartitionsRDD[1] at textFile at zip.scala:72), which has no missing parents
17/06/01 13:29:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
17/06/01 13:29:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1799.0 B, free 122.2 KB)
17/06/01 13:29:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60540 (size: 1799.0 B, free: 1132.5 MB)
17/06/01 13:29:48 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/06/01 13:29:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (E:\test1.csv MapPartitionsRDD[1] at textFile at zip.scala:72)
17/06/01 13:29:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/06/01 13:29:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2120 bytes)
17/06/01 13:29:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/06/01 13:29:48 INFO HadoopRDD: Input split: file:/E:/test1.csv:0+21
17/06/01 13:29:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/06/01 13:29:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/06/01 13:29:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/06/01
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值