文章目录
1.概念
当前主流的(Tez/Spark/Flink)计算都是通过构建DAG图,然后触发执行的。Flink 的DataSet和DataStream的source编程很是类似。只是调用的API不一样而已。Source编程官网指导
2.DataStream Source编程参考)
DataStream API 是用于进行Stream流计算开发的API。如下是列举的几种简单的Source编程API。
2.1读取Socket中数据生成Stream
省略,请参Flink快速入门一(简介以及WC编程)
2.2读取text文件(夹)中的数据生成Stream
def fromTextFile(env: StreamExecutionEnvironment): Unit ={
// 读取文件
// val texts = env.readTextFile(" data/input/source/text/hello.txt")
//读取文件夹
val texts = env.readTextFile(" data/input/source/text")
texts.print().setParallelism(1)
}
2.3读取集合中的数据生成Stream
此Api非常的方便测试
def fromCollection(env: StreamExecutionEnvironment): Unit ={
val nums: DataStream[Int] = env.fromCollection(1 to 10)
nums.map(_+1).print().setParallelism(1)
}
2.4测试代码
package com.wsk.flink.source
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
/**
* DataStream Source 编程
*/
object StreamSourceApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// fromCollection(env)
fromTextFile(env)
env.execute("SourceApp")
}
/**
*从 集合中生成 DataStream
* @param env
*/
def fromCollection(env: StreamExecutionEnvironment): Unit ={
val nums: DataStream[Int] = env.fromCollection(1 to 10)
nums.map(_+1).print().setParallelism(1)
}
/**
* 从text文件(夹)中生成 DataStream
*
* @param env
*/
def fromTextFile(env: StreamExecutionEnvironment): Unit ={
// 读取文件
// val texts = env.readTextFile(" data/input/source/text/hello.txt")
//读取文件夹
val texts = env.readTextFile(" data/input/source/text")
texts.print().setParallelism(1)
}
}
3.DataSet Source编程
DataSet API 是用于进行批计算开发的API,一次将所有的数据全部读取过来。如下是列举的几种简单的Source编程API。
3.1读取csv文件中的数据生成DataSet
def fromCsv(env: ExecutionEnvironment): Unit = {
val dataSet = env.readCsvFile[Teacher](" data/input/source/csv/test.csv",
ignoreFirstLine = true,
pojoFields = Array("name", "age"))
dataSet.print()
}
3.2读取递归文件夹下的数据生成DataSet
def fromRecursive(env: ExecutionEnvironment): Unit = {
val conf = new Configuration()
conf.setBoolean("recursive.file.enumeration", true)
val dataSet = env.readCsvFile[Teacher]("data/input/source/recursive",
ignoreFirstLine = true,
pojoFields = Array("name", "age")).withParameters(conf)
dataSet.print()
}
3.3读取压缩文件
Flink默认支持多种压缩格式文件的自动解压读取,但是要注意压缩文件本身不支持并行读取和计算。
def fromComCompressFile(env: ExecutionEnvironment): Unit = {
val dataSet = env.readTextFile("data/input/source/CompressFile")
dataSet.print()
}
2.4测试代码
package com.wsk.flink.source
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
/**
* Batch source 编程
*/
object BatchSourceApp {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
// fromCsv(env)
// fromRecursive(env)
fromComCompressFile(env)
}
/**
* 从 CSv中获取 DataSet
*
* @param env
*/
def fromCsv(env: ExecutionEnvironment): Unit = {
val dataSet = env.readCsvFile[Teacher](" data/input/source/csv/test.csv",
ignoreFirstLine = true,
pojoFields = Array("name", "age"))
dataSet.print()
}
/**
* 从递归的文件夹中读取数据
*
* @param env
*/
def fromRecursive(env: ExecutionEnvironment): Unit = {
val conf = new Configuration()
conf.setBoolean("recursive.file.enumeration", true)
val dataSet = env.readCsvFile[Teacher]("data/input/source/recursive",
ignoreFirstLine = true,
pojoFields = Array("name", "age")).withParameters(conf)
dataSet.print()
}
/**
* 读取压缩文件
*
* @param env
*/
def fromComCompressFile(env: ExecutionEnvironment): Unit = {
val dataSet = env.readTextFile("data/input/source/CompressFile")
dataSet.print()
}
case class Teacher(name: String, age: Int)
}