filter算子源码
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
Java代码demo
public class FilterOperator {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("FilterOperator").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> numberRDD = sc.parallelize(numbers);
/**
* Return a new dataset formed by selecting those elements of the source on which func returns true.
*
* filter算子是过滤,逻辑返回true保留,false就过滤掉
*/
JavaRDD<Integer> results = numberRDD.filter(new Function<Integer, Boolean>() {
private static final long serialVersionUID = 1L;
@Override
public Boolean call(Integer number) throws Exception {
return number % 2 == 0;
}
});
results.foreach(new VoidFunction<Integer>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Integer result) throws Exception {
System.out.println(result);
}
});
sc.close();
}
}
输出
2
4

本文深入探讨了Spark中filter算子的源码实现及其工作原理。通过Java代码示例展示了如何使用filter算子从RDD中筛选出特定元素,具体演示了筛选偶数的过程。filter算子是Spark的重要算子之一,用于数据清洗和预处理。
1654

被折叠的 条评论
为什么被折叠?



