spark算子整理

本文详细介绍了Spark中的几个关键算子:使用aggregate计算元素总和与个数;mapToPair转换数据为键值对;自定义分区类MyPartitioner实现特定分区策略;通过mapPartitions执行按分区的复杂操作,例如SequenceForCalc的Sch计算。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

**************************************
1、aggregate,第一个为初始值,第二个为element类型,第三个为返回值,其中第一个和第三个可以为自定义类型
如同时返回元素的和与元素的个数,第一个值需要传入初始化的一个自定义对象
Double distenceAccount = sequencesRDD.aggregate(0.0, new Function2<Double, SequenceForCalc, Double>() {
@Override
public Double call(Double v1, SequenceForCalc v2) throws Exception {
v1 += v2.getDistanceOfSequence();
return v1;
}
}, new Function2<Double, Double, Double>() {
@Override
public Double call(Double v1, Double v2) throws Exception {
v1 += v2;
return v1;
}
});
double distencePerPartition = distenceAccount/numSlices;
************************************************************
2、mapToPair
JavaPairRDD<SequenceForCalc, Integer> rddKeyVal =
sequencesRDD.mapToPair(new PairFunction<SequenceForCalc, SequenceForCalc, Integer>(){
@Override
public Tuple2<SequenceForCalc, Integer> call(SequenceForCalc sequenceForCalc) {
return new Tuple2<>(sequenceForCalc, 1);
}
});


**********************************************
3、自定义分区,之前的应用需要先排序然后统一操作
MyPartitioner为自定义分区的类,numSlices为分区个数


JavaPairRDD<SequenceForCalc, Integer> rdd4Repartition = rddKeyVal.sortByKey(false).coalesce(1);
JavaPairRDD<SequenceForCalc, Integer> rdd4Calculate = rdd4Repartition.partitionBy(new MyPartitioner(numSlices));
public class MyPartitioner extends Partitioner{
    private int partitionNum;
    double[] partitionDisLst;


    public MyPartitioner(int num) {
        this.partitionNum = num;
        iniPartitionLst();
    }


    private void iniPartitionLst(){
        partitionDisLst = new double[partitionNum];
        for (int i = 0; i < partitionNum; i++) {
            partitionDisLst[i] = 0;
        }
    }


    @Override
    public int numPartitions(){
        return partitionNum;
    }


    @Override
    public int getPartition(Object obj){
        //System.out.println("enter into getPartition");
        SequenceForCalc sequence = (SequenceForCalc)obj;
        // 遍历分区,找一个串长度最小的分区,加入当前串的长度并返回此分区号
        int iLenMin = 0;
        for(int i = 1; i < partitionDisLst.length; i++){
            if (partitionDisLst[i] < partitionDisLst[iLenMin]) {
                iLenMin = i;
            }
        }
        partitionDisLst[iLenMin] += sequence.getDistanceOfSequence();
        //System.out.println("repar len = " + sequence.getDistanceOfSequence() + " iLenMin = " + iLenMin);
        return iLenMin;
    }


    @Override
    public boolean equals(Object obj) {
        if (null == obj) {
            return false;
        }
        if (obj == this) {
            return true;
        }
        if (!(obj instanceof MyPartitioner)) {
            return false;
        }
        MyPartitioner myPartitioner = (MyPartitioner)obj;
        if (this.partitionNum == myPartitioner.partitionNum) {
            return true;
        }
        else {
            return false;
        }
    }


    @Override
    public int hashCode(){
        return this.partitionNum;
    }
}
********************************************************
4、mapPartition操作
JavaRDD<Tuple3<String, Integer, String>> rdd = rdd4Calculate
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<SequenceForCalc,Integer>>, Tuple3<String, Integer, String>>() {
/**

*/
private static final long serialVersionUID = 1L;


@Override
public Iterator<Tuple3<String, Integer, String>> call(Iterator<Tuple2<SequenceForCalc,Integer>> item)
throws Exception {
List<Tuple3<String, Integer, String>> set = new LinkedList<>();
try {
while (item.hasNext()) {
SequenceForCalc sequenceForCalc = item.next()._1;
*******
try {
info = String.format("%s-%d", sequenceForCalc.isLinks() ? "link" : "lane",
sequenceForCalc.getGroupID());
pw = new PrintWriter(sw);
LOG.info("sch calculator: " + info);
sequenceForCalc.CalcSCH(oracleConnForSCHCalc);
} catch (Exception ex) {
LOG.error(ex.getMessage(), ex);
pw.print(ex.getMessage());
result = CALC_FAILURE;
} finally {
if (null != pw) {
pw.close();
}
}
set.add(new Tuple3<>(info, result, sw.toString()));
}
return set.iterator();
} catch (Exception ex) {
throw new Exception(ex);
} finally {
if (null != oracleConnForSCHCalc) {
oracleConnForSCHCalc.close();
}
}
}
});


// sch计算结果打印
List<Tuple3<String, Integer, String>> results = rdd.collect();
List<Tuple3<String, Integer, String>> failure = new LinkedList<>();
*************************************************************************
5、
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值