Apache Flink 简单易懂渐进式学习教程(二): 流计算DataStream教程: 接 wikiedits数据

本文介绍了使用Flink读取Wikipedia IRC通道,计算用户编辑字节数的实践。先快速构建Maven工程,接着进行Flink编码,最后提交到Flink集群运行,并将结果写入Kafka,还可通过Flink仪表板进行业务监控。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一,前言:

Wikipedia提供了一个IRC通道,其中会记录所有用户对Wiki编辑。我们将在Flink中读取此通道,并计算每个用户在给定时间窗口内编辑的字节数。用flink实现如上功能将非常简单便捷,通过本次的学习希望对flink的认知更进一步,从而有能力写出更为复杂的专业应用。

二,快速构建maven工程

使用Flink Maven Archetype来创建我们的项目结构,运行命令如下:

$ mvn archetype:generate \
    -DarchetypeGroupId=org.apache.flink \
    -DarchetypeArtifactId=flink-quickstart-java \
    -DarchetypeVersion=1.8.0 \
    -DgroupId=wiki-edits \
    -DartifactId=wiki-edits \
    -Dversion=0.1 \
    -Dpackage=wikiedits \
    -DinteractiveMode=false

查看目录结构:

$ tree wiki-edits
wiki-edits/
├── pom.xml
└── src
    └── main
        ├── java
        │   └── wikiedits
        │       ├── BatchJob.java
        │       └── StreamingJob.java
        └── resources
            └── log4j.properties

添加pom依赖关系:

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-wikiedits_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
</dependencies>

 

三,flink编码

启动IDE编码工具,创建文件:src/main/java/wikiedits/WikipediaAnalysis.java

package wikiedits;

import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;

public class WikipediaAnalysis {

  public static void main(String[] args) throws Exception {


    //Flink程序首先是创建一个StreamExecutionEnvironment(或者批处理的 ExecutionEnvironment),这可用于设置执行参数并创建从外部系统读取的源。
    StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
    //添加wiki的数据源
    DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());

    //keyby方法将同一个用户名的数据聚集在一起
    KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
      .keyBy(new KeySelector<WikipediaEditEvent, String>() {
        @Override
        public String getKey(WikipediaEditEvent event) {
          return event.getUser();
        }
      });
    //聚合计算5秒内累计
    DataStream<Tuple2<String, Long>> result = keyedEdits
      .timeWindow(Time.seconds(5))
      .fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
          acc.f0 = event.getUser();
          acc.f1 += event.getByteDiff();
          return acc;
        }
      });

    result.print();
    //执行代码
    see.execute();
  }
}

使用Maven在IDE或命令行上运行此示例:

$ mvn clean package  #构建我们的项目
$ mvn exec:java -Dexec.mainClass=wikiedits.WikipediaAnalysis #执行我们的主类

输出应该类似于:

1> (Fenix down,114)
6> (AnomieBOT,155)
8> (BD2412bot,-3690)
7> (IgnorantArmies,49)
3> (Ckh3111,69)
5> (Slade360,0)
7> (Narutolovehinata5,2195)
6> (Vuyisa2001,79)
4> (Ms Sarah Welch,269)
4> (KasparBot,-245)

 

四,提交flink集群运行,并将结果写入kafka

1,接着之前的项目,添加maven依赖:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
    <version>${flink.version}</version>
</dependency>

2,在WikipediaAnalysis类中 添加如下代码:(可以替换result.print() )

result
    .map(new MapFunction<Tuple2<String,Long>, String>() {
        @Override
        public String map(Tuple2<String, Long> tuple) {
            return tuple.toString();
        }
    })
    .addSink(new FlinkKafkaProducer011<>("localhost:9092", "wiki-result", new SimpleStringSchema()));

3,导入相关类:

import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.functions.MapFunction;

4,使用Maven命令构建项目,打包后生成的jar文件将位于target子文件夹中:target/wiki-edits-0.1.jar

$ mvn clean package

5,创建kafka的topic

$ cd my/kafka/directory
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wiki-results

6,启动flink并提交jar到集群

$ cd my/flink/directory
$ bin/start-cluster.sh

$ bin/flink run -c wikiedits.WikipediaAnalysis path/to/wikiedits-0.1.jar

正常运行日志如下:

03/08/2016 15:09:27 Job execution switched to status RUNNING.
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to SCHEDULED
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to DEPLOYING
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to SCHEDULED
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to DEPLOYING
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to RUNNING
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to RUNNING

7,使用kafka消费者查看输出:

bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wiki-result

还可以通过查看 http:// localhost:8081 上运行的Flink仪表板,进行业务的监控:

 

 

 

 

参考:https://www.t9vg.com/archives/469

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值