RDD个人学习总结

(很多东西总是遗忘……总结一下吧)

1. local[*]模式下,spark://xxxx:7077页面看不到app状态

2. cluster模式下的spark-shell,rdd.foreach(println)的输出是在app的log里面(很傻的一条总结……)

3. spark-shell,cluster模式下,想在driver端看输出要用collect(),但可能会导致driver outofmemory,要用take()。

例如:rdd.take(1000).foreach(println) // 拿1000条来~

4. filter用法,local[4]模式下测试(博客的表格用法略死板……)

filter(func)
Return a new dataset formed by selecting those elements of the source on which func returns true.
scala> val data1=Array(1,2,3,4,5,6,7,8,9,10,11,12)

data1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)


scala> def getEven(x:Int) = { if (x%2 == 1) false else true }

getEven: (x: Int)Boolean


scala> var rdd1=sc.parallelize(data1)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26


scala> rdd1.filter(getEven).foreach(println)
2
10
12
4
6
8

5. ParallelCollectionRDD里面是ParallelCollectionPartition

scala> val data1=Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
data1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

scala> var rdd1=sc.parallelize(data1, 4)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> rdd1.partitions
res0: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691, org.apache.spark.rdd.ParallelCollectionPartition@692, org.apache.spark.rdd.ParallelCollectionPartition@693, org.apache.spark.rdd.ParallelCollectionPartition@694)
(未完待续……续……+1)


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值