spark 简单算子学习

最新推荐文章于 2024-07-30 14:37:58 发布

原创最新推荐文章于 2024-07-30 14:37:58 发布 · 277 阅读

0 ·

CC 4.0 BY-SA版权

spark 专栏收录该内容

17 篇文章

订阅专栏

foreach，遍历每一行数据，
map，一行数据执行一次，通过回调函数返回一个对象，public Object call(String arg0) throws Exception {
flatMap，一行数据执行一次，通过回调函数返回一个或者多个对象， public Iterable<Object> call(String arg0) throws Exception
mapToPair，一行数据执行一次，通过回调函数返回一个键值对，public Tuple2<K2, V2> call(String arg0) throws Exception
flatMapToPair，一行数据执行一次，通过回调函数返回一个或者多个键值对，public Iterable<Tuple2<K2, V2>> call(String arg0) throws Exception
can not return null, in anyway, otherwise the result RDD will be null.
flat 可以将一行拆成多行
mapPartitions, map的输入是RDD的每一个元素，大多数情况可以认为是每一行数据，mapPartitions的输入是RDD的每一个分区，以Iterable<>的形式做为输入
filter方法过滤之后生产一个新的RDD，回调函数返回boolean
aggregateByKey返回值的类型不需要和RDD中value的类型一致,

对于group by 然后取最大最小或者平均值类型的需求，可以先Rdd.groupByKey()获得一个JavaPairRDD<String, Iterable<String>>，然后rdd.mapToPair
public Tuple2<String, Integer> call(Tuple2<String, Iterable<String>> arg0) throws Exception 对Iterable<String>进行计算。

reduceByKey的回调方法：public Record call(Record arg0, Record arg1) throws Exception {
return record;
}
该方法第一次执行时，arg0和arg1分别是第一个和第二个记录，后面每次执行arg0是上一次执行的返回值，arg1是下一条未处理数据。

如果是取最大值最小值，可以以第一个Record为Buffer，如果是取平均数等，可以在Record对象中加入一些字段来实现业务。

aggregate
聚合，没有key，
第一个回调类，new Function2<ZeroValueBuffer, Record, ZeroValueBuffer>()
其中回调方法 public ZeroValueBuffer call(ZeroValueBuffer arg0, Record record)
第一个参数在第一次调用时会有一个默认值(即aggregate方法传入的第一个参数值)，从第二次开始，都是方法的返回值。第二个参数是每一条记录。
第二个回调类，new Function2<ZeroValueBuffer, ZeroValueBuffer, ZeroValueBuffer>()
其中回调方法 public ZeroValueBuffer call(ZeroValueBuffer arg0, Record record)
该方法只被调用了一次【setMaster("local")】，第一个参数即aggregate方法传入的第一个参数值，第二个参数是第一个回调类回调方法的返回值。