(很多东西总是遗忘……总结一下吧)
1. local[*]模式下,spark://xxxx:7077页面看不到app状态
2. cluster模式下的spark-shell,rdd.foreach(println)的输出是在app的log里面(很傻的一条总结……)
3. spark-shell,cluster模式下,想在driver端看输出要用collect(),但可能会导致driver outofmemory,要用take()。
例如:rdd.take(1000).foreach(println) // 拿1000条来~
4. filter用法,local[4]模式下测试(博客的表格用法略死板……)
filter(func)
|
Return a new dataset formed by selecting those elements of the source on which func returns true.
|
scala> val data1=Array(1,2,3,4,5,6,7,8,9,10,11,12) data1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) scala> def getEven(x:Int) = { if (x%2 == 1) false else true } getEven: (x: Int)Boolean scala> var rdd1=sc.parallelize(data1) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26 scala> rdd1.filter(getEven).foreach(println) |
5. ParallelCollectionRDD里面是ParallelCollectionPartition
scala> val data1=Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) data1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) scala> var rdd1=sc.parallelize(data1, 4) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26 scala> rdd1.partitions res0: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691, org.apache.spark.rdd.ParallelCollectionPartition@692, org.apache.spark.rdd.ParallelCollectionPartition@693, org.apache.spark.rdd.ParallelCollectionPartition@694) |