Flink DataStream使用

最新推荐文章于 2025-04-30 10:48:57 发布

原创

最新推荐文章于 2025-04-30 10:48:57 发布 · 1.4k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#flink

本文深入探讨Flink数据流处理技术，涵盖数据源、数据转换及数据接收等核心环节，详细解析DataStream API的使用，包括自定义SourceFunction、ParallelSourceFunction及RichParallelSourceFunction，同时介绍了如何读取文件、Socket及集合数据，以及如何接收Kafka数据。文章还提供了丰富的案例代码，如MySQLSource、ScalikeJDBCMySQLSource、FlinkKafkaConsumer、数据写入Kafka、MySQL及Redis等。

本章节介绍Flink DataStream常见的使用，主要从DataSources、operate、DataSinks三大主题切入进行讲解

DataSources

SourceFunction简介

通过env我们是可以addSource进来的，需要传入SourceFunction，而SourceFunction也是实现了Function接口的；SourceFunction是实现所有流式数据的顶层接口，我们可以基于该接口进行自定义实现数据源，Flink提供了3种方式：

SourceFunction接口是不支持并发的，并行度为1，一般情况下用的不多
ParallelSourceFunction接口
RichParallelSourceFunction接口，生产上推荐使用

Stream Sources

File-based

主要用来读取文件类型的数据：

readTextFile
readFile
值得注意的是，线上流式处理的场景，用这些API的可能性是不大的

Socket-based

读取数据冲socket中过来，使用socketTextStream即可

Collectionbased

一般用于数据测试的时候来造数据的，见代码CollectionSourceApp

fromCollection(Seq)
fromCollection(Iterator)
fromElements(elements: _*)
fromParallelCollection(SplittableIterator) 用的不多
generateSequence(from, to) 用的不多

Custom

可以通过addSource来添加新的Source Function，比如可以通过addSource(new FlinkKafkaConsumer08<>(…))这种方式去读取kafka的数据

SourceFunction基本使用

SourceFunction& ParallelSourceFunction

案例代码

AccessSourceFunction.scala：

/**
  * 自定义SourceFunction，并行度只能为1
  * 自定义ParallelSourceFunction，只需将extends SourceFunction改为ParallelSourceFunction即可，其余代码无需变动
  *
  * @Author: huhu
  * @Date: 2020-03-07 21:15
  */
class AccessSourceFunction extends SourceFunction[Access]{
   
   

  var running = true

  override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
   
   

    val random = new Random()
    val domains = Array("ruozedata.com","zhibo8.cc","dongqiudi.com")

    // 模拟数据产生
    while (running) {
   
   
      val timestamp = System.currentTimeMillis()
      1.to(10).map(x => {
   
   
        ctx.collect(Access(timestamp, domains(random.nextInt(domains.length)), random.nextInt(1000+x)))
      })
      Thread.sleep(5000)
    }

  }

  override def cancel(): Unit = {
   
   
    running = false
  }

}

SourceFunctionApp.scala：

/**
  * @Author: huhu
  * @Date: 2020-03-07 21:22
  */
object SourceFuctionApp {
   
   
  def main(args: Array[String]): Unit = {
   
   
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    // 并行度只能为1，若设置大于1则会报错
    env.addSource(new AccessSourceFunction).setParallelism(1).print()
    env.addSource(new AccessRichParallelSourceFunction).setParallelism(3).print()

    env.execute(this.getClass.getSimpleName)
  }
}

具体讲解

如果设置并行度为1，则会产生如下报错：
在这里插入图片描述
查看源码DataStreamSource中我们也可以发现在代码中对并行度进行了判断：

对于ParallelSourceFunction只需要将extends SourceFunction改为ParallelSourceFunction即可，其余代码不需要做变动

RichParallelSourceFunction

与SourceFunction、ParallelSourceFunction不同的是，RichParallelSourceFunction的顶层接口是AbstractRichFunction，因此它是有对应的生命周期的
见代码AccessRichParallelSourceFunction，里面重写了open、close方法，其中对于open来说，1个task就会执行1次

实现MySQLSource

实现思路

采用原生JDBC的方式去实现MySQLSource，见代码MySQLSource
使用ScalikeJDBC的方式来实现MySQLSource，这种方式更加的优雅一些，见代码ScalikeJDBCMySQLSource

案例代码

MySQLSource.scala：

class MySQLSource extends RichSourceFunction[Student]{
   
   

  // 用_占坑得带上类型,不确定类型是占坑不了的
  var connection:Connection = _
  var pstmt:PreparedStatement = _

  /**
    * 在open方法中建立连接
    * @param parameters
    */
  override def open(parameters: Configuration): Unit = {
   
   
    super.open(parameters)

    connection = MySQLUtils.getConnection()
    pstmt = connection.prepareStatement("select * from student")
  }

  /**
    * 释放连接
    */
  override def close(): Unit = {
   
   
    super.close()

    MySQLUtils.release(connection, pstmt)
  }

  override def run(ctx: SourceFunction.SourceContext[Student