Flink DataStream 数据源及并行度

  • 一个 Flink 应用程序执行需要以下五个步骤:
    1. Obtain an execution environment:获取执行环境;
    2. Load/create the initial data:加载/创建初始数据;
    3. Specify transformations on this data:指定对数据的转换操作;
    4. Specify where to put the results of your computations:指定计算结果存放的位置;
    5. Trigger the program execution:触发程序执行。
  • 本次主要探讨的是数据源

1. Flink 内置数据源

1.1 读取一个文件
package com.xk.bigdata.flink.datastream.datasource.buildin

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._

/**
 * Flink Build-in Read File DataSource
 */
object FileDataSourceApp {
   
   

  def main(args: Array[String]): Unit = {
   
   

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    // 设置 env 的并行度
    env.setParallelism(4)
    val fileStream = env.readTextFile("data/wc.txt")
    // 得到 fileStream 的并行度
    println(fileStream.parallelism)

    val mapStream = fileStream.map(x => x)
    // 得到 mapStream 的并行度
    println(mapStream.parallelism)

    env.execute(this.getClass.getSimpleName)
  }

}
  • 运行结果
4
4
  • 由此看来,Flink 内置数据源中读取文件是可以支持多个线程一起读取数据。
1.2 读取自定义数据
  • 读取自定义的 List、Seq 或者是可以迭代的数据有以下几个API:
    • fromElements
    • fromCollection
    • fromParallelCollection
  • 调用 fromElements API
val env = StreamExecutionEnvironment.getExecutionEnvironment
// 设置 env 的并行度
env.setParallelism(4)
val dataStream = env.fromElements("spark", 1L, 2D, '1')
println(dataStream.parallelism)
val mapStream = dataStream.map(x => x)
println(mapStream.parallelism)
  • 可以看出 fromElements API 参数里面可以支持有多个不同类型的入参。
  • 运行结果
1
4
  • 查看 fromElements 源代码
Note that this operation will result in a non-parallel data source, i.e. a data source with a parallelism of one.
  • fromElements 只支持单个并行度,如果强制更改 fromElements 的并行度则会报错。
Exception in thread "main" java.lang.IllegalArgumentException: The parallelism of non parallel operator must be 1.
	at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:139)
	at org.apache.flink.api.common.operators.util.OperatorValidationUtils.validateParallelism(OperatorValidationUtils.java:38)
	at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:85)
	at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:36)
	at org.apache.flink.streaming.api.scala.DataStream.setParallelism(DataStream.scala:131)
	at com.xk.bigdata.flink.datastream.datasource.buildin.CollectionDataSource$.main(CollectionDataSource.scala:16)
	at com.xk.bigdata.flink.datastream.datasource.buildin.CollectionDataSource.main(CollectionDataSource.scala)
  • 调用 fromCollection API
val dataStream2 = env.fromCollection(List("spark,hadoop", 1L))
println(dataStream2.parallelism)
val mapStream2 = dataStream2.map(x => x)
println(mapStream2.parallelism)
  • 运行结果
1
4
  • 查看 fromCollection API 源代码
Note that this operation will result in a non-parallel data source, i.e. a data source with a parallelism of one.
  • fromCollection 也只支持单个并行度
  • 调用 fromParallelCollection API
  • 查看 fromParallelCollection 源代码,入参类型需要 SplittableIterator,SplittableIterator 是抽象类,实现类就只有:LongValueSequenceIterator、NumberSequenceIterator。
  /**
   * Creates a DataStream from the given [[SplittableIterator]].
   */
  def fromParallelCollection[T: TypeInformation] (data: SplittableIterator[T]):
      DataStream[T] = {
   
   
    val typeInfo = implicitly[TypeInformation[T]]
    asScalaStream(javaEnv.fromParallelCollection(data, typeInfo))
  }
val dataStream3 = env.fromParallelCollection(new LongValueSequenceIterator(1L, 10L))
println(dataStream3
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值