Beam中的join

++Beam版本:2.3++


Beam中默认的API中没有提供join算子,但是提供了一个额外的库,可以进行join。将以下添加到pom.xml文件中即可使用:

<dependency>
    <groupId>org.apache.beam</groupId>
   <artifactId>beam-sdks-java-extensions-join-library</artifactId>
    <version>2.3.0</version>
</dependency>

用法示例:

PCollection<KV<String, String>> leftPcollection = ...
PCollection<KV<String, Long>> rightPcollection = ...

PCollection<KV<String, KV<String, Long>>> joinedPcollection =
  Join.innerJoin(leftPcollection, rightPcollection);

参考页面:
https://beam.apache.org/documentation/sdks/java-extensions/

测试代码:

/**
 * Beam中的Join测试
 * Beam版本:2.3
 *
 * @author:maqy
 * @date:2018.09.22
 */

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.joinlibrary.Join;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.*;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;

public class JoinTest {
    //用于输出Pcollection中的字符串
    static class printString extends DoFn<KV<String ,String> ,KV<String ,String>> {
        @ProcessElement
        public void processElement(ProcessContext c) {
            System.out.println("c.element:" + c.element());
            c.output(c.element());
        }
    }
    //将每行数据分为 key value
    static class SetValue extends DoFn<String ,KV<String ,String>>{
        @ProcessElement
        public void processElement(ProcessContext c){
            System.out.println("c.element:"+c.element());
            String[] temps=c.element().split(",");
//            for(String temp:temps){
//                System.out.println(temp);
//            }
            KV<String ,String> kv=KV.of(temps[0],temps[1]);
            c.output(kv);
        }
    }

    //预处理,即将每行为 a,b De 的数据转化为KV<a,b>
    public static class Preprocess extends PTransform<PCollection<String>,PCollection<KV<String,String>>> {
        @Override
        public PCollection<KV<String,String>> expand(PCollection<String> lines){
            //String[] temps = lines.toString().split(",");
            PCollection<KV<String,String>> result = lines.apply(ParDo.of(new SetValue()));
            return result;
        }

    }

    //用于输出到文件
    public static class FormatAsTextFn extends SimpleFunction<KV<String,KV<String,String>>, String> {
        @Override
        public String apply(KV<String,KV<String,String>> input) {
            return "key:"+input.getKey()+"   value:"+input.getValue();
        }
    }

    public interface JoinTestOptions extends PipelineOptions,StreamingOptions {
        /**
         * By default, this example reads from a public dataset containing the text of
         * King Lear. Set this option to choose a different input file or glob.
         */
        @Description("Path of the file to read from")
        @Default.String("/home/maqy/Documents/beam_samples/output/test.txt")
        //Default.String("gs://apache-beam-samples/shakespeare/kinglear.txt")
        String getInputFile();
        void setInputFile(String value);

        /**
         * Set this required option to specify where to write the output.
         */
        @Description("Path of the file to write to")
        @Validation.Required
        @Default.String("/home/maqy/文档/beam_samples/output/GroupbyTest")
        String getOutput();
        void setOutput(String value);
    }

    public static void main(String[] args){
        JoinTestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(JoinTestOptions.class);
        options.setStreaming(true); //通过这里可以显示的指定批或流
        Pipeline p= Pipeline.create(options);

        //第一个文件
        PCollection<String> lines1=p.apply(TextIO.read().from("/home/maqy/桌面/output/BeamJoin1"));
        //第二个文件
        PCollection<String> lines2=p.apply(TextIO.read().from("/home/maqy/桌面/output/BeamJoin2"));
        //预处理,即将每行为 a,b De 的数据转化为KV<a,b>
        PCollection<KV<String,String>> leftPcollection=lines1.apply(new Preprocess());
        PCollection<KV<String,String>> rightPcollection=lines2.apply(new Preprocess());

        PCollection<KV<String,KV<String,String>>> joinedPcollection = Join.innerJoin(leftPcollection,rightPcollection);
//
//        //results.apply(ParDo.of(new printString()));
        joinedPcollection.apply(MapElements.via(new FormatAsTextFn()))
                .apply("WriteCounts", TextIO.write().to("/home/maqy/桌面/output/BeamJoinout1"));

        p.run().waitUntilFinish();
    }
}

逗号前的作为key

leftPcollection内容:

a,b
c,d
ma,li
wo shi shui,maqy

rightPcollection内容:

a,mmmm
c,d
wo shi shui,lix

连接后的结果:

key:a   value:KV{b, mmmm}
key:wo shi shui   value:KV{maqy, lix}
key:c   value:KV{d, d}
### Beam Search Algorithm Implementation in Large Language Models In large language models (LLMs), beam search serves as an essential decoding strategy during inference, particularly when generating sequences such as text. Unlike greedy search, which selects the highest probability token at each step without considering future tokens, beam search maintains multiple hypotheses and explores several potential paths simultaneously by keeping track of the top k candidates at every time step[^1]. This approach increases the likelihood of finding higher-quality outputs compared to single-path strategies. The core mechanism behind beam search involves maintaining a list of partial sequences or beams. At each generation step t, for all possible next words w_t given previous context c_(t-1), probabilities P(w_t|c_(t-1)) are computed using the model's parameters. The algorithm then expands these into new candidate sequences based on cumulative log-probabilities up until this point. After expanding, only those within the predefined width K remain, discarding less probable ones. Finally, after reaching the end-of-sequence marker or maximum length limit, the best sequence according to overall score is selected as output. To implement beam search efficiently in LLMs: - Initialize with start tokens forming initial beams. - For each position i in generated sentence: - Calculate scores for extending current beams with different vocabulary items. - Sort extended beams by total score so far. - Retain top B elements where B equals beam size. This method allows exploration beyond immediate local optima while controlling computational cost through limiting active branches considered per iteration. ```python def beam_search(model, input_ids, max_length=50, num_beams=4): """Performs beam search decoding.""" # Prepare inputs batch_size = input_ids.shape[0] seq_len = [[input_ids]] * num_beams # Main loop over target positions for _ in range(max_length): logits = [] # Compute predictions for all beams for s in seq_len[-num_beams:]: out = model(torch.tensor(s).unsqueeze(0)) logits.append(out.logits[:, -1]) # Convert logits to probs & select top-k probas = torch.softmax(torch.stack(logits), dim=-1) values, indices = torch.topk(probas, num_beams) # Update sequences with chosen extensions updated_sequences = [ prev + [idx.item()] for idxes, val in zip(indices, values) for prev, v in zip(seq_len[-num_beams:], val) ] # Keep most likely `num_beams` continuations sorted_scores_idx = (-torch.cat([v.flatten() for v in values]).argsort()) seq_len.extend([ updated_sequences[i] for i in sorted_scores_idx[:num_beams] ]) if len(set(map(tuple,seq_len))) < num_beams: break return [" ".join(str(x) for x in s) for s in seq_len] ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值