Data Sink
Data sinks负责消费Data Stream的数据,将数据写出到外围系统。例如:文件/网络/NoSQL/RDBMS/Message Queue等。Flink底层也预定义了一些常用的Sinks,同时用户也可以根据实际需求定制Data Sink通过集成SinkFunction或者RichSinkFunction。
File Based
- writeAsText()|writeAsCsv(…)|writeUsingOutputFormat() 处理语义: at-least-once(至少一次)
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val output = new CsvOutputFormat[Tuple2[String, Int]](new Path("file:///D:/fink-results"))
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.map(t=> new Tuple2(t._1,t._2))
.writeUsingOutputFormat(output)
fsEnv.execute("FlinkWordCountsQuickStart")
- Bucketing File Sink 处理语义:exactly-once(精确一次)
引入依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.9.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.9.2</version>
</dependency>
实例代码
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val bucketSink = new BucketingSink[String]("hdfs://Spark:9000/bucketSink")
bucketSink.setBucketer(new DateTimeBucketer("yyyy-MM-dd-HH", ZoneId.of("Asia/Shanghai")))
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.map(t=>t._1+"\t"+t._2)
.addSink(bucketSink)
fsEnv.execute("FlinkWordCountsQuickStart")
print()/ printToErr()
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print("error")
fsEnv.execute("FlinkWordCountsQuickStart")
自定义Sink
class UserDefineSink extends RichSinkFunction[(String,Int)]{
override def open(parameters: Configuration): Unit = {
println("打开连接")
}
override def invoke(value: (String, Int)): Unit = {
println("insert into 数据" + value)
}
override def close(): Unit = {
println("关闭连接")
}
}
object FlinkUserDefineSink {
def main(args: Array[String]): Unit = {
// 1.创建StreamExecutionEnvironment
val flinkEnv = StreamExecutionEnvironment.getExecutionEnvironment
// 使用用户自定义的数据源
val dataStream : DataStream[String] = flinkEnv.addSource[String](
new UserDefineDataSource
)
dataStream
.flatMap(_.split("\\s+"))
.map((_, 1))
.keyBy(0)
.sum(1)
// 使用用户自定义的sink组件
.addSink(new UserDefineSink)
// 执行计算
flinkEnv.execute("FlinkWordCount")
}
}
Redis Sink
参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/
引入依赖
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
实例代码
object FlinkRedisSink {
def main(args: Array[String]): Unit = {
// 1.创建StreamExecutionEnvironment
val flinkEnv = StreamExecutionEnvironment.getExecutionEnvironment
val flinkJedis = new FlinkJedisPoolConfig.Builder().setHost("Spark").setPort(6379).build()
// 2.创建DataStream
val dataStream : DataStream[String] = flinkEnv.socketTextStream("Spark", 6666)
// 3.对数据做转换
dataStream
.flatMap(_.split("\\s+"))
.map((_, 1))
.keyBy(0)
.sum(1)
// 把数据存储到redis中
.addSink(new RedisSink(flinkJedis, new UserDefineRedisMapper))
// 执行计算
flinkEnv.execute("FlinkWordCount")
}
}
class UserDefineRedisMapper extends RedisMapper[(String,Int)]{
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET, "word-count")
}
override def getKeyFromData(t: (String, Int)): String = {
t._1
}
override def getValueFromData(t: (String, Int)): String = {
t._2.toString
}
}
在安装Redis如果访问不到,需要关闭Redis protect-model:no
Kafka Sink
引入依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.8.1</version>
</dependency>
实例代码
object FlinkKafkaSink {
def main(args: Array[String]): Unit = {
// 1.创建StreamExecutionEnvironment
val flinkEnv = StreamExecutionEnvironment.getExecutionEnvironment
// 2.创建DataStream
val dataStream : DataStream[String] = flinkEnv.socketTextStream("Spark", 6666)
val prop = new Properties()
prop.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "Spark:9092")
//不建议覆盖
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer])
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer])
prop.put(ProducerConfig.RETRIES_CONFIG, "3")
prop.put(ProducerConfig.ACKS_CONFIG, "-1")
prop.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")
prop.put(ProducerConfig.BATCH_SIZE_CONFIG, "100")
prop.put(ProducerConfig.LINGER_MS_CONFIG, "500")
// 3.对数据做转换
dataStream
.flatMap(_.split("\\s+"))
.map((_, 1))
.keyBy(0)
.sum(1)
// 使用Kafka的sink组件
.addSink(new FlinkKafkaProducer[(String, Int)]("flink", new UserDefineKafkaSchema, prop))
// 执行计算
flinkEnv.execute("FlinkWordCount")
}
}
自定义转换规则
class UserDefineKafkaSchema extends KeyedSerializationSchema[(String, Int)]{
override def serializeKey(t: (String, Int)): Array[Byte] = {
t._1.getBytes()
}
override def serializeValue(t: (String, Int)): Array[Byte] = {
t._2.toString.getBytes()
}
override def getTargetTopic(t: (String, Int)): String = {
"flink"
}
}