flink standalone 客户端提交源码分析

最新推荐文章于 2024-10-04 03:40:27 发布

画画的老顽童

最新推荐文章于 2024-10-04 03:40:27 发布

阅读量407

点赞数

分类专栏： flink 文章标签： flink big data 1024程序员节

本文链接：https://blog.youkuaiyun.com/m0_46449152/article/details/120798275

版权

启动入口

CliFrontend.main ->  cli.parseParameters  -> ACTION_RUN run(params); -> executeProgram -> invokeInteractiveModeForExecution
 -> callMainMethod(){
   
   mainMethod = entryClass.getMethod("main", String[].class);
   mainMethod.invoke(null, (Object) args);
 }
 --> SocketWindowWordCount.main(){
   
 		/*************************************************
		 * TODO 
		 *  注释： 解析 host 和 port
		 */
		// the host and the port to connect to
		final String hostname;
		final int port;
		try {
   
			final ParameterTool params = ParameterTool.fromArgs(args);
			hostname = params.has("hostname") ? params.get("hostname") : "localhost";
			port = params.getInt("port");
		} catch(Exception e) {
   
			return;
		}

		/*************************************************
		 * TODO 
		 *  注释： 获取 StreamExecutionEnvironment
		 *  它呢，还是 跟 Spark 中的 SparkContext 还是有区别的！
		 */
		// get the execution environment
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		/*************************************************
		 * TODO 
		 *  注释： 加载数据源得到数据抽象：DataStream
		 *  其实最终，只是创建了一个 DataStreamSource 对象，然后把 SourceFunction（StreamOperator）和 StreamExecutionEnvironment
		 *  设置到了 DataStreamSource 中， DataStreamSource 是 DataStream 的子类
		 *  -
		 *  DataStream 的主要分类：
		 *  	DataStreamSource	流数据源
		 *  	DataStreamSink		流数据目的地
		 *  	KeyedStream			按key分组的数据流
		 *  	DataStream			普通数据流
		 *  -
		 *  关于函数理解：
		 *  	Function			传参
		 *  	Operator			Graph 中抽象概念
		 *  	Transformation		一种针对流的逻辑操作
		 *	 最终： Function ---> Operator ---> Transformation
		 */
		// get input data by connecting to the socket
		DataStream<String> text = env.socketTextStream(hostname, port, "\n");

		// parse the data, group it, window it, and aggregate the counts
		DataStream<WordWithCount> windowCounts = text

			// TODO 注释： 讲算子生成 Transformation 加入到 Env 中的 transformations 集合中
			.flatMap(new FlatMapFunction<String, WordWithCount>() {
   
				@Override
				public void flatMap(String value, Collector<WordWithCount> out) {
   
					for(String word : value.split("\\s")) {
   
						out.collect(new WordWithCount(word, 1L));
					}
				}
			})

			// TODO 注释： 依然创建一个 DataStream(KeyedStream)
			.keyBy(value -> value.word)
			.timeWindow(Time.seconds(5))

			// TODO 注释：
			.reduce(new ReduceFunction<WordWithCount>() {
   
				@Override
				public WordWithCount reduce(WordWithCount a, WordWithCount b) {
   
					return new WordWithCount(a.word, a.count + b.count);
				}
			});

		// print the results with a single thread, rather than in parallel
		windowCounts.print().setParallelism(1);

		/*************************************************
		 * TODO 
		 *  注释： 提交执行
		 */
		env.execute("Socket Window WordCount");
 }
--> StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment 是 Flink 应用程序的执行入口，提供了一些重要的操作机制：
1、提供了 readTextFile(), socketTextStream(), createInput(), addSource() 等方法去对接数据源
2、提供了 setParallelism() 设置程序的并行度
3、StreamExecutionEnvironment 管理了 ExecutionConfig 对象，该对象负责Job执行的一些行为配置管理。
   还管理了 Configuration 管理一些其他的配置
4、StreamExecutionEnvironment 管理了一个 List<Transformation<?>> transformations
成员变量，该成员变量，主要用于保存 Job 的各种算子转化得到的 Transformation，把这些Transformation 按照逻辑拼接起来，就能得到 StreamGragh（Transformation ->StreamOperator -> StreamNode）
5、StreamExecutionEnvironment 提供了 execute() 方法主要用于提交 Job 执行。该方法接收的参数就是：StreamGraph

--> env.socketTextStream -> addSource(){
   
		/*************************************************
		 * TODO 
		 *  注释： 获取输出数据类型
		 */
		TypeInformation<OUT> resolvedTypeInfo = getTypeInfo(function, sourceName, SourceFunction.class, typeInfo);
		// TODO 注释： 判断是否是并行
		boolean isParallel = function instanceof ParallelSourceFunction;
		clean(function);
		/*************************************************
		 * TODO 
		 *  注释： 构建 SourceOperator
		 *  它是 SourceFunction 的子类，也是  StreamOperator 的子类
		 */
		final StreamSource<OUT, ?> sourceOperator = new StreamSource<>(function);
		/*************************************************
		 * TODO 
		 *  注释： 返回 DataStreamSource
		 *  关于这个东西的抽象有四种：
		 *  1、DataStream
		 *  2、KeyedDataStream
		 *  3、DataStreamSource
		 *  4、DataStreamSink
		 */
		return new DataStreamSource<>(this, resolvedTypeInfo, sourceOperator, isParallel, sourceName);
}
--> text.flatMap(讲算子生成 Transformation 加入到 Env 中的 transformations 集合中){
   
		/*************************************************
		 * TODO 
		 *  注释： 通过反射拿到 算子的类型
		 */
		TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes();

		/*************************************************
		 * TODO 
		 *  注释： 算子执行的真正操作逻辑是： 将算子构建成 Transformation 加入 Env 中的 transformation 中的
		 *  transformations 集合中。将来执行 StreamGraph 生成的时候，会将 Transformation 变成 Operator
		 *  -
		 *  flatMap 到最后，还是构建一个 DataStream （SingleOutputStreamOperator）对象返回，然后将 Transformation 加入到
		 *  transformations 集合中，等待将来提交的之后，构建成 StreamGraph
		 */
		return flatMap(flatMapper, outType);
		
		--> flatMap(){
   
				/*************************************************
				 * TODO 
				 *  注释： flink把每一个算子transform成一个对流的转换
				 *  并且注册到执行环境中，用于生成StreamGraph
				 *  -
				 *  第一步：用户代码里定义的UDF会被当作其基类对待，然后交给 StreamFlatMap 这个 operator 做进一步包装。
				 *  事实上，每一个Transformation都对应了一个StreamOperator。
				 *  -
				 *  flink流式计算的核心概念，就是将数据从输入流一个个传递给Operator进行链式处理，最后交给输出流的过程
				 *  -
				 *  StreamFlatMap 是一个 Function 也是一个 StreamOperator
				 *  -
				 *  StreamFlatMap = StreamOperator
				 *  flatMapper = Function
				 *  -最终调用 transform 方法来把 StreamFlatMap 这种StreamOperator 转换成 Transformation
				 *  最终加入到 StreamExectiionEnvironment 的 List<Transformation<?>> transformations
				 */
				return transform("Flat Map", outputType, new StreamFlatMap<>(clean(flatMapper)));
		}
		---> doTransform(){
   
				// read the output type of the input Transform to coax out errors about MissingTypeInfo
				transformation.getOutputType();
		
				/*************************************************
				 * TODO 
				 *  注释： 构建： OneInputTransformation
				 *  由于 flatMap 这个操作只接受一个输入，所以再被进一步包装为 OneInputTransformation
				 */
				OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(this.transformation, operatorName, operatorFactory, outTypeInfo,environment.getParallelism());
		
				/*************************************************
				 * TODO 
				 *  注释： 构建： SingleOutputStreamOperator
				 */
				@SuppressWarnings({
   "unchecked", "rawtypes"}) SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);
		
				/*************************************************
				 * TODO 重点
				 *  注释： 把 Operator 注册到执行环境中，用于生成 StreamGraph
				 *  最后，将该 transformation 注册到执行环境中，当执行 generate 方法时，生成 StreamGraph 图结构。
				 */
				getExecutionEnvironment().addOperator(resultTransform);
		
				/*************************************************
				 * TODO 
				 *  注释：
				 *  SingleOutputStreamOperator 也是 DataStream 的子类，也就是返回了一个新的 DataStream
				 * 	然后调用新的 DataStream 的某一个算子，又生成新的 StreamTransformation，
				 * 	继续加入到 StreamExecutionEnvironment 的 transformations
				 */
				return returnStream;
		}
}
-> env.execute(提交执行)

StreamGraph

env.execute() -> StreamGraph sg = getStreamGraph(jobName); ->  getStreamGraphGenerator()

最低0.47元/天解锁文章