在 Flink集群上运行流分析程序。 维基百科提供一个IRC频道,所有编辑会被记录。程序读取这些记录到flink然后分析每个用户在给定的时间内编辑的字节数。这个功能用Flink花费极少的时间就可实现,在此就作为你开始构建复杂分析程序的基础。
In this guide we will start from scratch and go from setting up a Flink project to running a streaming analysis program on a Flink cluster.
Wikipedia provides an IRC channel where all edits to the wiki are logged. We are going to read this channel in Flink and count the number of bytes that each user edits within a given window of time. This is easy enough to implement in a few minutes using Flink, but it will give you a good foundation from which to start building more complex analysis programs on your own.
环境win7,jdk1.8,apache-maven-3.6.0
-
建maven工程
Cd进入maven的bin文件夹
D:\Java1.8\apache-maven-3.6.0\bin apache-maven-3.6.0\bin
输入如下命令在记事本里面撑开,以免后面参数有不必要的回车符
mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-java -DarchetypeVersion=1.8.0 -DgroupId=wiki-edits -DartifactId=wiki-edits -Dversion=0.1 -Dpackage=wikiedits -DinteractiveMode=false
红色圈处要等很久。
2、编辑工程文件
Eclipse的file-import 导入存在的 maven文件,选择你上面用moven构建好的工程wiki-edites。
得到工程目录如下:
pox.xlm 文件下依赖 替换为下面的代码。
此处是替换不是追加,我一直是追加,或者保留日志的依赖结果运行一直通过不了。参考官方说明,As a last step we need to add the Flink Wikipedia connector as a dependency so that we can use it in our program. Edit the dependencies
section of the pom.xml
so that it looks like this:
哈哈划重点‘ so that it looks like this: ’
所以是直接替换,哇哇基础差抄错不大会改只能先抄袭,为什么自动生成的以来要删除,然后替换,不能直接追加,只能拜托看到的大神给留言解释下了。
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-wikiedits_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
然后把BatchJob.java 在eclipse中直接rename为WikipediaAnalysis.java,至于为什么是在eclipse中rename,是直接删除增加WikipediaAnalysis文件会导致一些依赖删除不干净吧,反正我试过,最终这样不报错。最后结果如下
替换BatchJob的代码为 WikipediaAnalysis 的如下代码:
package wikiedits;
import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
@Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
acc.f0 = event.getUser();
acc.f1 += event.getByteDiff();
return acc;
}
});result.print();
see.execute();
}
}
搞定。
3、运行
1)前提是Flink服务已经开启
Cmd下输入D:\Java1.8\flink-1.6.2\bin\start-cluster.bat,开启Flink
2)运行程序
还在cmd下,进入程序目录:
如下方式进行运行:
mvn clean package
提示build sucsess之后,输入
mvn
exec:java
-Dexec.mainClass
=wikiedits.WikipediaAnalysis
就ok了。
参考:
https://ci.apache.org/projects/flink/flink-docs-stable/tutorials/datastream_api.html
https://ci.apache.org/projects/flink/flink-docs-release-1.4/quickstart/java_api_quickstart.html
工程代码打包下:
依赖包也在,如果有下载直接解压放在D:\Java1.8\apache-maven-3.6.0\bin下,直接进行此处第三部分3、运行