前言
流式计算可能在日常不多见,主要统计一个阶段内的PV、UV,在风控场景很常见,比如统计某个用户一天内同地区下单总量来判断该用户是否为异常用户。还有一些大数据处理场景,如将某一段时间生成的日志按需要加工后倒入到存储DB中做查询报表。为什么要学习Flink,因为最近碰到一些实时计算性能问题,其次也不太理解实时计算底层实现原理,这里拿当下很流行的开源工具Flink作为待学习对象,一步一步深入Flink底层探索实时计算奥秘。
第一个程序
导maven依赖,主要依赖项如下:
<properties>
<blink.version>1.5.1</blink.version>
<scala.binary.version>2.11</scala.binary.version>
<blink-streaming.version>1.5.1</blink-streaming.version>
<log4j.version>1.2.17</log4j.version>
<slf4j-log4j.version>1.7.9</slf4j-log4j.version>
</properties>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-core</artifactId>
<version>${blink.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${blink.version}</version>
</dependency>
<!-- blink stream java -->
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${blink-streaming.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-test-utils-junit</artifactId>
<version>${blink.version}</version>
<scope>test</scope>
</dependency>
<!-- logging framework -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j-log4j.version}</version>
</dependency>
</dependencies>
</dependencyManagement>
这里引入比较干净,只包含flink相关核心包+日志包,接下来开始使用flink API完成第一个Hello World程序,这里我用的是flink官方WordCount Demo,代码如下:
package com.alibaba.security.blink;
import com.alibaba.security.blink.util.WordCountData;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.util.Collector;
public class WordCount {
private final ParameterTool params;
private final ExecutionEnvironment env;
public WordCount(String[] args) {
this.params = ParameterTool.fromArgs(args);
this.env = ExecutionEnvironment.createLocalEnvironment();
env.getConfig().setGlobalJobParameters(params);
}
public static void main(String[] args) throws Exception {
WordCount wordCount = new WordCount(args);
DataSet<String> dataSet = wordCount.getDataSetFromCommandLine();
wordCount.executeFrom(dataSet);
}
private DataSet<String> getDataSetFromCommandLine() {
DataSet<String> text;
if (params.has("input")) {
text = env.readTextFile(params.get("input"));
} else {
System.out.println("Executing WordCount example with default input data set.");
System.out.println("Use --input to specify file input.");
text = WordCountData.getDefaultTextLineDataSet(env);
}
return text;
}
private void executeFrom(DataSet<String> text) throws Exception {
DataSet<Tuple2<String, Integer>> counts =