1. 简介
Apache Flink由两类运行时JVM进程管理分布式集群的计算资源。
- JobManager进程负责分布式任务管理,如任务调度、检查点、故障恢复等。在高可用性(HA)分布式部署时,系统存在多个JobManager,一个leader和多个standby。JobManager是Flink主从架构中的master。
- TaskManager进程负责执行任务线程(即子任务subtask)、缓存和传输stream。TaskManager是Flink主从架构中的slave。
Task.java类表示在TaskManager上执行的operator subtask(子任务),这些operator subtask在不同的线程、不同的物理机或不同的容器中彼此互不依赖得执行。
每个operator subtask由一个专用的线程运行。
2. 代码分析
org.apache.flink.runtime.taskmanager.Task类在flink 1.8中有1645行,是个非常冗长的类,它实现了Runnable、TaskActions、CheckpointListener接口。
public interface TaskActions {
/**
* Check the execution state of the execution producing a result partition.
*
* @param jobId ID of the job the partition belongs to.
* @param intermediateDataSetId ID of the parent intermediate data set.
* @param resultPartitionId ID of the result partition to check. This
* identifies the producing execution and partition.
*/
void triggerPartitionProducerStateCheck(
JobID jobId,
IntermediateDataSetID intermediateDataSetId,
ResultPartitionID resultPartitionId);
/**
* Fail the owning task with the given throwable.
*
* @param cause of the failure
*/
void failExternally(Throwable cause);
}
TaskActions接口定义了Task可以被执行的操作,目前包含两个方法:
- triggerPartitionProducerStateCheck:检查执行状态
- failExternally:根据输入的Throwable令当前Task失败
public interface CheckpointListener {
/**
* This method is called as a notification once a distributed checkpoint has been completed.
*
* Note that any exception during this method will not cause the checkpoint to
* fail any more.
*
* @param checkpointId The ID of the checkpoint that has been completed.
* @throws Exception
*/
void notifyCheckpointComplete(long checkpointId) throws Exception;
}
CheckpointListener接口定义了checkpoint完成后的通知逻辑。
2.1 构造函数
public Task(
JobInformation jobInformation,
TaskInformation taskInformation,
ExecutionAttemptID executionAttemptID,
AllocationID slotAllocationId,
int subtaskIndex,
int attemptNumber,
Collection<ResultPartitionDeploymentDescriptor> resultPartitionDeploymentDescriptors,
Collection<InputGateDeploymentDescriptor> inputGateDeploymentDescriptors,
int targetSlotNumber,
MemoryManager memManager,
IOManager ioManager,
NetworkEnvironment networkEnvironment,
BroadcastVariableManager bcVarManager,
TaskStateManager taskStateManager,
TaskManagerActions taskManagerActions,
InputSplitProvider inputSplitProvider,
CheckpointResponder checkpointResponder,
GlobalAggregateManager aggregateManager,
BlobCacheService blobService,
LibraryCacheManager libraryCache,
FileCache fileCache,
TaskManagerRuntimeInfo taskManagerConfig,
@Nonnull TaskMetricGroup metricGroup,
ResultPartitionConsumableNotifier resultPartitionConsumableNotifier,
PartitionProducerStateChecker partitionProducerStateChecker,
Executor executor) {
Preconditions.checkNotNull(jobInformation);
Preconditions.checkNotNull(taskInformation);
Preconditions.checkArgument(0 <= subtaskIndex, "The subtask index must be positive.");
Preconditions.checkArgument(0 <= attemptNumber, "The attempt number must be positive.");
Preconditions.checkArgument(0 <= targetSlotNumber, "The target slot number must be positive.");
this.taskInfo = new TaskInfo(
taskInformation.getTaskName(),
taskInformation.getMaxNumberOfSubtaks(),
subtaskIndex,
taskInformation.getNumberOfSubtasks(),
attemptNumber,
String.valueOf(slotAllocationId));
this.jobId = jobInformation.getJobId();
this.vertexId = taskInformation.getJobVertexId();
this.executionId = Preconditions.checkNotNull(executionAttemptID);
this.allocationId = Preconditions.checkNotNull(slotAllocationId);
this.taskNameWithSubtask = taskInfo.getTaskNameWithSubtasks();
this.jobConfiguration = jobInformation.getJobConfiguration();
this.taskConfiguration = taskInformation.getTaskConfiguration();
this.requiredJarFiles = jobInformation.getRequiredJarFileBlobKeys();
this.requiredClasspaths = jobInformation.getRequiredClasspathURLs();
this.nameOfInvokableClass = taskInformation.getInvokableClassName();
this.serializedExecutionConfig = jobInformation.getSerializedExecutionConfig();
Configuration tmConfig = taskManagerConfig.getConfiguration();
this.taskCancellationInterval = tmConfig.getLong(TaskManagerOptions.TASK_CANCELLATION_INTERVAL);
this.taskCancellationTimeout = tmConfig.getLong(TaskManagerOptions.TASK_CANCELLATION_TIMEOUT);
this.memoryManager = Preconditions.checkNotNull(memManager);
this.ioManager = Preconditions.checkNotNull(ioManager);
this.broadcastVariableManager = Preconditions.checkNotNull(bcVarManager);
this.taskStateManager = Preconditions.checkNotNull(taskStateManager);
this.accumulatorRegistry = new AccumulatorRegistry(jobId, executionId);
this.inputSplitProvider = Preconditions.checkNotNull(inputSplitProvider);
this.checkpointResponder = Preconditions.checkNotNull(checkpointResponder);
this.aggregateManager = Preconditions.checkNotNull(aggregateManager