Flink 程序开发步骤
1:获得一个执行环境
2:加载/创建 初始化数据
3:指定操作数据的transaction算子
4:指定把计算好的数据放在哪
5:调用execute()触发执行程序
注意:Flink程序是延迟计算的,只有最后调用execute()方法的时候才会真正触发执行程序。
延迟计算好处:你可以开发复杂的程序,但是Flink可以将复杂的程序转成一个Plan,将Plan作为一个整体单元执行!
一、Flink StreamingWindowWordCount
需求分析:
手工通过socket实时产生一些单词,使用flink实时接收数据,对指定时间窗口内(例如:2秒)的数据进行聚合统计,并且把时间窗口内计算的结果打印出来
执行:
- 在hadoop100上执行 nc -l 9000
- 在本机启动idea中的代码
执行代码:
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
/**
* 滑动窗口计算
*
* 每一秒统计最近2秒内的数据,打印到控制台
*
* */
object SockWindowCountScala {
def main(args: Array[String]): Unit = {
//获取socket端口号
val port: Int = try{
ParameterTool.fromArgs(args).getInt("port")
}catch{
case e:Exception=>{
System.err.println("No port set use default port 9000")
}
9000
}
//获取运行环境
val env:StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//获取运行环境
val text = env.socketTextStream("qyl01",port,'\n')
//解析数据(把数据打平),分组,窗口计算,并且聚合求sum
//必须导入隐式转换,否则会报错
import org.apache.flink.api.scala._
val windowcounts = text.flatMap(line => line.split("\\s"))
.map(w => wordwithcount(w, 1))
.keyBy("word")
.timeWindow(Time.seconds(2), Time.seconds(1))
.sum("count")
//打印到控制台
windowcounts.print().setParallelism(1)
//执行任务
env.execute("scoket window count")
}
case class wordwithcount(word:String,count:Long)
}
二、Flink BatchWordCount-scala
需求分析:
将本地文件中的单词进行统计
代码实现:
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
object batchWordCountScala {
def main(args: Array[String]): Unit = {
val input = "D:\\data\\file"
val output = "D:\\data\\result"
val env = ExecutionEnvironment.getExecutionEnvironment
val text: DataSet[String] = env.readTextFile(input)
import org.apache.flink.api.scala._
val counts = text.flatMap(_.toLowerCase.split("\\s"))
.filter(_.nonEmpty)
.map((_, 1))
.groupBy(0)
.sum(1)
counts.writeAsCsv(output,"\n"," ").setParallelism(1)
env.execute("bath word count")
}
}
三、Flink Streaming和Batch的区别
流处理Streaming:
StreamExecutionEnvironment
DataStreaming
批处理Batch:
ExecutionEnvironment
DataSet
四、Flink依赖
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>flink</groupId>
<artifactId>finkexample</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.0</scala.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.6.1</version>
<!-- provided在这表示此依赖只在代码编译的时候使用,运行和打包的时候不使用 -->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.6.1</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.6.1</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.6.1</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.3</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>