Flink源码解读(三)：ExecutionGraph源码解读_flink executiongraphstore-优快云博客

本文链接：https://blog.youkuaiyun.com/Stray_Lambs/article/details/124597132

ExectionGraph执行图

4、IntermediateResultPartition

1、Flink客户端提交JobGraph给JobManager

2、构建ExecutionGraph对象

参考

ExectionGraph执行图

在Flink中ExecutionGraph执行图是协调数据流的分布式执行的中心数据结构，它保留了每个并行任务、每个中间流以及它们之间的的通信表示。StreamGraph和JobGraph的转化生成都是在Flink客户端，而最终Flink作业运行时调度层的核心执行图ExecutionGraph是在服务器的JobManager中生成的。

ExecutionGraph在实际处理转换上只是改动了GobGraph的每个节点，并没有对整个拓扑结构进行改变。主要发生以下转变：

加入了并行度的概念，成为真正可调度的图结构；
生成了与JobVertex对应的ExecutionJobVertex和ExecutionVertex，以及IntermediateDataSet对应的IntermediateResult和IntermediateResultPartition等，并行将通过这些类实现。

ExecutionGraph的核心对象

1、ExecutionJobVertex

ExecutionJobVertex和JobGraph中的JobVertex一一对应。ExecutionJobVertex表示执行过程中来自JobGraph的一个顶点，它保存所有并行子任务的聚合状态。每个ExecutionJobVertex都有和并行度一样多的ExecutionVertex。

2、ExecutionVertex

ExecutionVertex表示ExecutionVertex的其中一个并发子任务，输入是ExecutionEdge，输出是IntermediateResultPartition。ExecutionVertex由ExecutionJobVertex和并行子任务的索引标识。

3、IntermediateResult

IntermediateResult和JobGraph中的IntermediateDataSet一一对应。一个IntermediateResult包含多个IntermediateResultPartition，其个数等于该算子的并行度。

4、IntermediateResultPartition

IntermediateResultPartition表示ExecutionVertex的一个输出分区(中间结果)，生产者是ExecutionVertex，消费者是若干个ExecutionEdge。

5、ExecutionEdge

ExecutionEdge表示ExecutionVertex的输入，输入源是IntermediateResultPartition，目的地是ExecutionVertex。源和目的地都只能有一个。

6、Execution

ExecutionVertex可以被执行多次(用于恢复、重新计算、重新配置)，Execution负责跟踪该顶点和资源的一次执行的状态信息。

为了防止出现故障，或者在某些数据需要重新计算的情况下，ExecutionVertex可能会有多次执行。因为在以后的操作请求中，它不再可用。执行由ExecutionAttemptID标识。JobMananger和TaskManager之间关于任务部署和任务状态更新的所有消息都是使用ExecutionAttemptID来定位消息接受者的。

ExecutionGraph具体生成流程

ExecutionGraph是一个描述并行化JobGraph执行图，生成并维护CheckpointCordinator，TaskTracker，ExecutionVertex，ExecutionEdge，IntermediateResult等组件的细粒度数据结构。而ExecutionGraph只是Flink核心数据结构的一部分，在JM的SchedulerNG生成ExecutionGraph之前，Flink会执行一系列在RM分配的Container中的操作，跟Yarn集群的交互，基于Akka的RpcServer（Flink封装的actor）通信模式注册，基于ZK的高可用选举等，而其中涉及到的核心管理和调度组件包括ResourceManager，Dispatcher，JobManager，Scheduler等。

本章设计的框架较多，在构造ExecutionGraph过程中会依次解析Flink跟Yarn的交互，基于Akka的通信模式以及基于ZK的HA等服务。

上一章节说到了JobGraph的建立，JobGraph创建完毕后，Flink会创建YarnJobClusterExecutor向集群提交任务。在StreamExecutionEnvironment类中，调用public JobExecutionResult execute(StreamGraph streamGraph) throws Exception中的final JobClient jobClient = executeAsync(streamGraph)方法。

public JobClient executeAsync(StreamGraph streamGraph) throws Exception {
		checkNotNull(streamGraph, "StreamGraph cannot be null.");
		checkNotNull(configuration.get(DeploymentOptions.TARGET), "No execution.target specified in your configuration file.");

		final PipelineExecutorFactory executorFactory =
			executorServiceLoader.getExecutorFactory(configuration);

		checkNotNull(
			executorFactory,
			"Cannot find compatible factory for specified execution.target (=%s)",
			configuration.get(DeploymentOptions.TARGET));
        // 从这里调用execute方法开始创建ExecutionGraph
		CompletableFuture<JobClient> jobClientFuture = executorFactory
			.getExecutor(configuration)
			.execute(streamGraph, configuration, userClassloader);

		try {
			JobClient jobClient = jobClientFuture.get();
			jobListeners.forEach(jobListener -> jobListener.onJobSubmitted(jobClient, null));
			return jobClient;
		} catch (ExecutionException executionException) {
			final Throwable strippedException = ExceptionUtils.stripExecutionException(executionException);
			jobListeners.forEach(jobListener -> jobListener.onJobSubmitted(null, strippedException));

			throw new FlinkException(
				String.format("Failed to execute job '%s'.", streamGraph.getJobName()),
				strippedException);
		}
	}

调用在AbstractJobClusterExecutor类中的execute方法。

public CompletableFuture<JobClient> execute(@Nonnull final Pipeline pipeline, @Nonnull final Configuration configuration, @Nonnull final ClassLoader userCodeClassloader) throws Exception {
        // 生成JobGraph
		final JobGraph jobGraph = PipelineExecutorUtils.getJobGraph(pipeline, configuration);

		try (final ClusterDescriptor<ClusterID> clusterDescriptor = clusterClientFactory.createClusterDescriptor(configuration)) {
			final ExecutionConfigAccessor configAccessor = ExecutionConfigAccessor.fromConfiguration(configuration);

			final ClusterSpecification clusterSpecification = clusterClientFactory.getClusterSpecification(configuration);
            // 开始向集群发布部署任务
			final ClusterClientProvider<ClusterID> clusterClientProvider = clusterDescriptor
					.deployJobCluster(clusterSpecification, jobGraph, configAccessor.getDetachedMode());
			LOG.info("Job has been submitted with JobID " + jobGraph.getJobID());

			return CompletableFuture.completedFuture(
					new ClusterClientJobClientAdapter<>(clusterClientProvider, jobGraph.getJobID(), userCodeClassloader));
		}
	}

大概步骤为创建YarnClient，check环境，比如检查Yarn队列是否存在，注册需要申请的内存，CPU等资源，初始化Jar等文件存放环境（hdfs），kerberos身份认证等。所有的申请资源和执行动作都会put进Yarn的ApplicationSubmissionContext上下文中，最后通过yarnClient.submitApplication(appContext)向Yarn集群提交任务。

private ApplicationReport startAppMaster(
			Configuration configuration,
			String applicationName,
			String yarnClusterEntrypoint,
			JobGraph jobGraph,
			YarnClient yarnClient,
			YarnClientApplication yarnApplication,
			ClusterSpecification clusterSpecification) throws Exception {
 
		// ------------------ Initialize the file systems -------------------------
...
 
LOG.info("Submitting application master " + appId);
		yarnClient.submitApplication(appContext);
...
}

当Yarn集群接收到Client申请后它会做一系列资源验证动作，包括是否有空闲的内存，CPU能给到用户（flink）申请的资源，用户权限检验，是否有空闲的Container，是否有指定的Yarn队列等，若满足以上条件则开始通知RM分配一个Container到空闲的NM上用作启动Flink的AM，在Container分配完毕后会去hdfs（默认）加载用户上传的Jar包，反射加载完成后调用deployInternal指定的入口类的Main方法。这里除了最后一步是Flink的内部代码，其余的都是在Yarn集群内部完成。

这里我们指定的是Per-Job模式：

/**
	 * The class to start the application master with. This class runs the main
	 * method in case of the job cluster.
	 */
	protected String getYarnJobClusterEntrypoint() {
		return YarnJobClusterEntrypoint.class.getName();
	}
 
 
    //当flink client提交的任务到ResourceManager完成后
	//ResourceManager会做资源验证后在集群找到有空闲资源的NodeManager分配一个container作为flink的AppMaster再调用下面的main函数
	public static void main(String[] args) {
		// startup checks and logging
		....
 
		try {
			YarnEntrypointUtils.logYarnEnvironmentInformation(env, LOG);
		} catch (IOException e) {
			LOG.warn("Could not log YARN environment information.", e);
		}
 
		Configuration configuration = YarnEntrypointUtils.loadConfiguration(workingDirectory, env);
 
		YarnJobClusterEntrypoint yarnJobClusterEntrypoint = new YarnJobClusterEntrypoint(configuration);
 
		//准备启动Flink集群
		ClusterEntrypoint.runClusterEntrypoint(yarnJobClusterEntrypoint);
	}

ClusterEntrypoint是FlinkOnYarn在AM的主入口类，它主要做两件事，

第一，启动部分外部依赖服务，列如：RpcService（Akka），HAService，BlobService等

第二，启动自身的RM和Dispatcher，这里RM是Flink自身内部维护的资源控制器，比如Flink的Slot是两阶段提交申请（后面会解析），Dispatcher主要是生成JobManager，

而RM和Dispatcher均是可配置HA的（LeaderContender）

private void runCluster(Configuration configuration, PluginManager pluginManager) throws Exception {
		synchronized (lock) {
 
			//创建RPCService（封装的Akka的ActorSystem，用作创建和维护后面flink各服务的actors）
			//跟spark一样，flink的消息通讯基于Akka的actor异步事件通信模式，而数据交互是基于擅长高并发异步IO框架Netty
			//启动HaService(HA模式会使用zk作为主备master元数据的同步保存点)，负责大文件传输服务的blobServer等
			//使用连接zk客户端的Curator框架在zk创建flink目录并开启监听该目录数据变化的线程
			//像Dispatcher，JobManager等基于zk的HA模式也是调用Curator的LeaderLatch来选主并且它们的核心逻辑入口也是在选主成功的代码块中
			initializeServices(configuration, pluginManager);
 
			// write host information into configuration