一、Flink基本了解
Apache Flink其核心是用Java和Scala编写的分布式流数据流引擎。Flink以数据并行和流水线方式执行任意流数据程序,Flink的流水线运行时系统可以执行批处理和流处理程序。
二、环境说明
scala、 flink 、 kafka、 hadoop
三、主要代码
1. 初始化flink流处理的运行环境
//初始化flink流处理的运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并行度
env.setParallelism(1);
2. 创建数据源
// 创建数据源
Properties properties = new Properties();
//封装kafka的连接地址
properties.setProperty("bootstrap.servers", "node01:9092");
//封装zooke的连接地址
properties.setProperty("zookeeper.connect", "node01:2181");
//指定消费者id
properties.setProperty("group.id", "test");
//属性key.serializer和value.serializer就是key和value指定的序列化方式。
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
//读取kafka数据的时候需要指定消费策略,如果不指定会使用auto.offset.reset设置
//earliest当各分区下有已提交的offset时,从提交的offset开始消费;
//无提交的offset时,从头开始消费;
//latest,当各分区下有已提交的offset时,从提交的offset开始消费;
//无提交的offset时,消费新产生的该分区下的数据;
//none,topic各分区都存在已提交的offset时,从offset后开始消费;
//只要有一个分区不存在已提交的offset,则抛出异常
properties.setProperty("auto.offset.reset", "earliest");
//enable.auto.commit 的默认值是 true;就是默认采用自动提交的机制。
properties.setProperty("enable.auto.commit", "true");
3.定义kafka的消费者实例
//定义kafka的消费者实例,如果从Kafka中读取输入流,默认提供的是String类型的Schema
FlinkKafkaConsumer011<String> kafkaConsumer = new FlinkKafkaConsumer011<>("test", new SimpleStringSchema(), properties);
4. 处理数据
flink常见的数据流类型:
(1)DataStream:DataStream 上的转换操作都是逐条的,比如 map(),flatMap(),filter()
DataStream<String> stream= env.addSource(Consumer1);
stream.map(new parseAndWriteData().name("map-parseAndWriteData"));
(2)WindowedStream & AllWindowedStream:
AllWindowedStream<Tuple2<JsonObject,List<SourceData>>,TimeWindow> sourcedatalist
=env.addSource(Consumer1)
.flatMap(new FilebeatFlatMap().name("get filebeat message"))
.flatMap(new FieldsFlatMap().name("get fields"))
.timeWindowAll(Time.seconds(10))
.trigger(new CountTriggerWithTimeout<>(1000,TimeCharacteristic.ProcessingTime));
//timeWindowAll:按时间窗口滚动,对前2秒内的输入数据流超过100条,提交一次
5. 打印输出
stream.print();
6. 启动作业
env.execute();
四、完整代码展示——增量程序部分代码
主要的jar包
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.8.1</version>
</dependency>
Flink2HbaseDriver.class
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
public class Flink2HbaseDriver{
public static void main(String[] args) throws Exception {
//TODO 1)初始化flink流处理的运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//TODO 2)创建数据源
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "node01:9092");
properties.setProperty("zookeeper.connect", "node01:2181");
properties.setProperty("group.id", "test");
properties.setProperty("flink.partition-discovery.interval-millis", "30000");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "earliest");
properties.setProperty("enable.auto.commit", "true");
FlinkKafkaConsumer011<String> Consumer1 = new FlinkKafkaConsumer<>("topic-name", new SimpleStringSchema(), properties);
//kafkaConsumer.setStartFromEarliest();
DataStreamSource<String> stream= env.addSource(Consumer1);
stream.map(new parseAndWriteData().name("map-parseAndWriteData"));
stream.print();
env.execute();
}
}
parseAndWriteData.class
public class parseAndWriteData extends RichMapFunction<String,String>{
@Override
public String map(String s){
if(s==null || s.equals("[]") || s.length()==0 || s.equals("")){
System.out.println("s is not null:"+s);
}else{
System.out.println("s is null:");
}
return "1";
}
}
五、完整代码展示——基础表增量代码
FlinkHbaseDriver.class
public class FlinkHbaseDriver{
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//TODO 2)创建数据源
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "node01:9092");
properties.setProperty("zookeeper.connect", "node01:2181");
properties.setProperty("group.id", "test");
properties.setProperty("flink.partition-discovery.interval-millis", "30000");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "earliest");
properties.setProperty("enable.auto.commit", "true");
FlinkKafkaConsumer011<String> Consumer1 = new FlinkKafkaConsumer<>("topic-name", new SimpleStringSchema(), properties);
AllWindowedStream<Tuple2<JsonObject,List<SourceData>>,TimeWindow> sourcedatalist
=env.addSource(Consumer1)
.flatMap(new FilebeatFlatMap().name("get filebeat message"))
.flatMap(new FieldsFlatMap().name(""))
.timeWindowAll(Time.seconds(10))
.trigger(new CountTriggerWithTimeout<>(1000,TimeCharacteristic.ProcessingTime));
sourcedatalist.apply(new JSONALLWindow())
.addSink(new HbaseSink(hbase_ip,hbase_port).name("put hbase"));
env.execute();
FilebeatFlatMap.class
public class FilebeatFlatMap implements FlatMapFunction<String,Tuple2<JsonObject,List<SourceData>>>{
public void flatmap(String s,Collector<Tuple2<JsonObject,List<SourceData>>> collector){
System.out.println(s);
......
collector.collect(new Tuple2<>(tablename,list));
}
}
FieldsFlatMap.class
public class FieldsFlatMap implements FlatMapFunction<Tuple2<String,List<SourceData>>,Tuple2<JsonObject,List<SourceData>>>
{
...
}
六、部分flink程序无数据显示的原因
1.查看kafka的topic是否有数据;
2.消费kafka的topic的用户分组的group_id已经消费过了,需要换group_id;
3.检查flink程序,是否被过滤条件过滤;
完结撒花~~~~~(0 o 0)