Flink源码阅读(三)JobManager

最新推荐文章于 2025-06-26 18:30:00 发布

abcdef324

最新推荐文章于 2025-06-26 18:30:00 发布

阅读量425

点赞数

CC 4.0 BY-SA版权

分类专栏： Flink源码学习

本文链接：https://blog.youkuaiyun.com/abcdef324/article/details/89472344

Flink源码学习专栏收录该内容

5 篇文章

订阅专栏

本文介绍了Flink中的JobManager，它作为协调者负责任务调度、检查点协调及故障恢复。在高可用设置中，会有多个JobManager，其中一个是领导者。启动流程从StandaloneSessionClusterEntrypoint开始，初始化DispatcherResourceManagerComponent，Dispatcher作为接收Job提交的入口，其submitJob方法处理Job提交。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

简介

参照官方文档的描述,可以知道JobManager主要是以一个协调者的身份存在的。

The JobManagers (also called masters) coordinate the distributed execution. They schedule tasks, coordinate checkpoints, coordinate recovery on failures, etc.
There is always at least one Job Manager. A high-availability setup will have multiple JobManagers, one of which one is always the leader, and the others are standby.

启动

入口是org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint

//StandaloneSessionClusterEntrypoint类
public static void main(String[] args) {
		
		........//一些初始化信息
		
		StandaloneSessionClusterEntrypoint entrypoint = new StandaloneSessionClusterEntrypoint(configuration);

		ClusterEntrypoint.runClusterEntrypoint(entrypoint);
	}

然后初始化一些service,使用工厂模式创建DispatcherResourceManagerComponent

//ClusterEntrypoint类
private void runCluster(Configuration configuration) throws Exception {
		synchronized (lock) {
			initializeServices(configuration);

			// write host information into configuration
			configuration.setString(JobManagerOptions.ADDRESS, commonRpcService.getAddress());
			configuration.setInteger(JobManagerOptions.PORT, commonRpcService.getPort());

			final DispatcherResourceManagerComponentFactory<?> dispatcherResourceManagerComponentFactory = createDispatcherResourceManagerComponentFactory(configuration);

			clusterComponent = dispatcherResourceManagerComponentFactory.create(
				configuration,
				commonRpcService,
				haServices,
				blobServer,
				heartbeatServices,
				metricRegistry,
				archivedExecutionGraphStore,
				new AkkaQueryServiceRetriever(
					metricQueryServiceActorSystem,
					Time.milliseconds(configuration.getLong(WebOptions.TIMEOUT))),
				this);

			clusterComponent.getShutDownFuture().whenComplete(
				(ApplicationStatus applicationStatus, Throwable throwable) -> {
					if (throwable != null) {
						shutDownAsync(
							ApplicationStatus.UNKNOWN,
							ExceptionUtils.stringifyException(throwable),
							false);
					} else {
						// This is the general shutdown path. If a separate more specific shutdown was
						// already triggered, this will do nothing
						shutDownAsync(
							applicationStatus,
							null,
							true);
					}
				});
		}
	}

重点关注create方法，方法里面启动了resourcemanager和dispatcher

	@Override
	public DispatcherResourceManagerComponent<T> create(
			Configuration configuration,
			RpcService rpcService,
			HighAvailabilityServices highAvailabilityServices,
			BlobServer blobServer,
			HeartbeatServices heartbeatServices,
			MetricRegistry metricRegistry,
			ArchivedExecutionGraphStore archivedExecutionGraphStore,
			MetricQueryServiceRetriever metricQueryServiceRetriever,
			FatalErrorHandler fatalErrorHandler) throws Exception {

		LeaderRetrievalService dispatcherLeaderRetrievalService = null;
		LeaderRetrievalService resourceManagerRetrievalService = null;
		WebMonitorEndpoint<U> webMonitorEndpoint = null;
		ResourceManager<?> resourceManager = null;
		JobManagerMetricGroup jobManagerMetricGroup = null;
		T dispatcher = null;

		try {
			dispatcherLeaderRetrievalService = highAvailabilityServices.getDispatcherLeaderRetriever();

			resourceManagerRetrievalService = highAvailabilityServices.getResourceManagerLeaderRetriever();

			final LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever = new RpcGatewayRetriever<>(
				rpcService,
				DispatcherGateway.class,
				DispatcherId::fromUuid,
				10,
				Time.milliseconds(50L));

			final LeaderGatewayRetriever<ResourceManagerGateway> resourceManagerGatewayRetriever = new RpcGatewayRetriever<>(
				rpcService,
				ResourceManagerGateway.class,
				ResourceManagerId::fromUuid,
				10,
				Time.milliseconds(50L));

			webMonitorEndpoint = restEndpointFactory.createRestEndpoint(
				configuration,
				dispatcherGatewayRetriever,
				resourceManagerGatewayRetriever,
				blobServer,
				WebMonitorEndpoint.createExecutorService(
					configuration.getInteger(RestOptions.SERVER_NUM_THREADS),
					configuration.getInteger(RestOptions.SERVER_THREAD_PRIORITY),
					"DispatcherRestEndpoint"),
				metricQueryServiceRetriever,
				highAvailabilityServices.getWebMonitorLeaderElectionService(),
				fatalErrorHandler);

			log.debug("Starting Dispatcher REST endpoint.");
			webMonitorEndpoint.start();

			jobManagerMetricGroup = MetricUtils.instantiateJobManagerMetricGroup(
				metricRegistry,
				rpcService.getAddress(),
				ConfigurationUtils.getSystemResourceMetricsProbingInterval(configuration));

			resourceManager = resourceManagerFactory.createResourceManager(
				configuration,
				ResourceID.generate(),
				rpcService,
				highAvailabilityServices,
				heartbeatServices,
				metricRegistry,
				fatalErrorHandler,
				new ClusterInformation(rpcService.getAddress(), blobServer.getPort()),
				webMonitorEndpoint.getRestBaseUrl(),
				jobManagerMetricGroup);

			final HistoryServerArchivist historyServerArchivist = HistoryServerArchivist.createHistoryServerArchivist(configuration, webMonitorEndpoint);

			dispatcher = dispatcherFactory.createDispatcher(
				configuration,
				rpcService,
				highAvailabilityServices,
				resourceManager.getSelfGateway(ResourceManagerGateway.class),
				blobServer,
				heartbeatServices,
				jobManagerMetricGroup,
				metricRegistry.getMetricQueryServicePath(),
				archivedExecutionGraphStore,
				fatalErrorHandler,
				webMonitorEndpoint.getRestBaseUrl(),
				historyServerArchivist);

			log.debug("Starting ResourceManager.");
			resourceManager.start();
			resourceManagerRetrievalService.start(resourceManagerGatewayRetriever);

			log.debug("Starting Dispatcher.");
			dispatcher.start();
			dispatcherLeaderRetrievalService.start(dispatcherGatewayRetriever);

			return createDispatcherResourceManagerComponent(
				dispatcher,
				resourceManager,
				dispatcherLeaderRetrievalService,
				resourceManagerRetrievalService,
				webMonitorEndpoint,
				jobManagerMetricGroup);

		} catch (Exception exception) {
		     //...一些异常处理逻辑
        }
	}

而 Dispatcher类就是客户端提交Job的入口,参见Dispatcher注释,具体来说是其中的submitJob方法

Base class for the Dispatcher component. The Dispatcher component is responsiblefor receiving job submissions, persisting them, spawning JobManagers to executethe jobs and to recover them in case of a master failure. Furthermore, it knowsabout the state of the Flink session cluster.