从源码看Yarn上的MapReduce（一）

最新推荐文章于 2025-01-27 16:25:44 发布

原创最新推荐文章于 2025-01-27 16:25:44 发布 · 502 阅读

1 ·

CC 4.0 BY-SA版权

大数据同时被 2 个专栏收录

31 篇文章

订阅专栏

hadoop源码

18 篇文章

订阅专栏

本文详细解析了MapReduce任务从本地提交到Hadoop集群的全过程，包括与ResourceManager的RPC连接建立、作业配置验证、工作目录设置及文件上传、任务分片生成等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本系列并不过多涉及Yarn的相关源码，重点在于，分析MapReduce这个计算模型，到底是如何跑起来的，无论是在Yarn上，还是在MR1上。

本文有些内容与关于Yarn源码系列有交汇，两相对照，学习更深入。

本文基于2.6.5的Hadoop源码：

我们从头来看，比如说我提交了一个简单的Job程序，其中有Main方法

		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args)
				.getRemainingArgs();
		if (otherArgs.length < 2) {
			System.err.println("Usage: wordcount <in> [<in>...] <out>");
			System.exit(2);
		}
		Job job = new Job(conf, "word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(TokenizerMapper.class);
		job.setCombinerClass(IntSumReducer.class);
		job.setReducerClass(IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		for (int i = 0; i < otherArgs.length - 1; ++i) {
			FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
		}
		FileOutputFormat.setOutputPath(job, new Path(
				otherArgs[otherArgs.length - 1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);

除了第一句新建一个Configuration，再加几句参数判断，余下的代码几乎全跟Job有关系，看起来Job这个类非常重要，我们选择性看下其中的代码：

static {
		ConfigUtil.loadResources();
	}

静态初始化的资源加载，很重要：

public static void loadResources() {
		addDeprecatedKeys();
		Configuration.addDefaultResource("mapred-default.xml");
		Configuration.addDefaultResource("mapred-site.xml");
		Configuration.addDefaultResource("yarn-default.xml");
		Configuration.addDefaultResource("yarn-site.xml");
	}

第一句是保留的，为了与前面的代码兼容，会加载一些已经过期不使用的配置，不说了。

接着，要加载四个配置文件，而且这四个还不过是默认的配置文件：

/**
	 * Add a default resource. Resources are loaded in the order of the
	 * resources added.
	 * 
	 * @param name
	 *            file name. File should be present in the classpath.
	 */
	public static synchronized void addDefaultResource(String name) {
		if (!defaultResources.contains(name)) {
			defaultResources.add(name);
			for (Configuration conf : REGISTRY.keySet()) {
				if (conf.loadDefaults) {
					conf.reloadConfiguration();
				}
			}
		}
	}

静态同步方法，而这里就需要注意了，类的初始化，是先进行static模块，而后才会进行构造器加载的，也就是说，而且默认loadFaults是true：

所以，系统最最开始就要加载这四个配置文件，而defaultResource是一个CopyOnWriteArrayList。

我们看看reloadConfiguration方法：

public synchronized void reloadConfiguration() {
		properties = null; // trigger reload
		finalParameters.clear(); // clear site-limits
	}

这个重加载的方法，首先就把所有加载过的参数清除地一干二净：

public Job(Configuration conf, String jobName) throws IOException {
		this(conf);
		setJobName(jobName);
	}

我们调用的初始化实际是这个方法，重点看第一个：

public Job(Configuration conf) throws IOException {
		this(new JobConf(conf));
	}

我们把这个方法拆开看看：

public JobConf(Class exampleClass) {
		setJarByClass(exampleClass);
		checkAndWarnDeprecation();
	}

public void setJarByClass(Class cls) {
		String jar = ClassUtil.findContainingJar(cls);
		if (jar != null) {
			setJar(jar);
		}
	}

就我们传入的conf来说，是不包含jar包的，所以这个方法略过，至于第二个，则是一些检查，略过不提。

外面还有一层，点开看看：

Job(JobConf conf) throws IOException {
		super(conf, null);
		// propagate existing user credentials to job
		this.credentials.mergeAll(this.ugi.getCredentials());
		this.cluster = null;
	}

毫无疑问，要看看super中的构造器：

public JobContextImpl(Configuration conf, JobID jobId) {
		if (conf instanceof JobConf) {
			this.conf = (JobConf) conf;
		} else {
			this.conf = new JobConf(conf);
		}
		this.jobId = jobId;
		this.credentials = this.conf.getCredentials();
		try {
			this.ugi = UserGroupInformation.getCurrentUser();
		} catch (IOException e) {
			throw new RuntimeException(e);
		}
	}

原来是在这儿，我们传入的conf已经是JobConf了：

好了，我们现在有个Job了，接下来五句话，给Job添加了很多的成员变量，掠过不提。

最后一倒数第二句实质上，也是给Job添加成员变量，这都不说，我们直接看waitForCompletion方法：

public boolean waitForCompletion(boolean verbose) throws IOException,
			InterruptedException, ClassNotFoundException {
		if (state == JobState.DEFINE) {
			submit();
		}
	}

默认情况下state==JobState.DEFINE，执行submit方法：

public void submit() throws IOException, InterruptedException,
			ClassNotFoundException {
		ensureState(JobState.DEFINE);
		setUseNewAPI();
		connect();
		final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(),
				cluster.getClient());
		status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
			public JobStatus run() throws IOException, InterruptedException,
					ClassNotFoundException {
				return submitter.submitJobInternal(Job.this, cluster);
			}
		});
		state = JobState.RUNNING;
		LOG.info("The url to track the job: " + getTrackingURL());
	}

一点点看：

private void ensureState(JobState state) throws IllegalStateException {
		if (state != this.state) {
			throw new IllegalStateException("Job in state " + this.state
					+ " instead of " + state);
		}

		if (state == JobState.RUNNING && cluster == null) {
			throw new IllegalStateException("Job in state " + this.state
					+ ", but it isn't attached to any job tracker!");
		}
	}

这没什么可说的，多加了几次验证，我们接着看setUseNewAPI：

int numReduces = conf.getNumReduceTasks();

public int getNumReduceTasks() {
		return getInt(JobContext.NUM_REDUCES, 1);
	}

我们发现，代码刚一开始运行的时候，就需要先确定Num_reduces的数量，这个很重要，而默认情况下，ReduceTask的数量是1，这方便我们最终结果的汇总。

		setUseNewAPI();

该方法内逻辑看起来复杂些，但都是为了确认使用新的API，我们这里默认是使用的，所以不分析。

		connect();

看这个方法，虽然简短，但非常非常重要：

private synchronized void connect() throws IOException,
			InterruptedException, ClassNotFoundException {
		if (cluster == null) {
			cluster = ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
				public Cluster run() throws IOException, InterruptedException,
						ClassNotFoundException {
					return new Cluster(getConfiguration());
				}
			});
		}
	}

前文，ugi已经初始化过了：

/**
 * User and group information for Hadoop. This class wraps around a JAAS Subject
 * and provides methods to determine the user's username and groups. It supports
 * both the Windows, Unix and Kerberos login modules.
 */
@InterfaceAudience.LimitedPrivate({ "HDFS", "MapReduce", "HBase", "Hive",
		"Oozie" })
@InterfaceStability.Evolving
public class UserGroupInformation

注释很清楚，我们对此不分析，主要是记录Hadoop系统使用的用户和组信息等。

我们注意看下Cluster的构造器，这里getConfiguration实际上就是传入的JobConf，不多说了：

public Cluster(InetSocketAddress jobTrackAddr, Configuration conf)
			throws IOException {
		this.conf = conf;
		this.ugi = UserGroupInformation.getCurrentUser();
		initialize(jobTrackAddr, conf);
	}

我们注意看第三个方法：

private void initialize(InetSocketAddress jobTrackAddr, Configuration conf)
			throws IOException {
		synchronized (frameworkLoader) {
			for (ClientProtocolProvider provider : frameworkLoader) {
				LOG.debug("Trying ClientProtocolProvider : "
						+ provider.getClass().getName());
				ClientProtocol clientProtocol = null;
				try {
					if (jobTrackAddr == null) {
						clientProtocol = provider.create(conf);
					} else {
						clientProtocol = provider.create(jobTrackAddr, conf);
					}

					if (clientProtocol != null) {
						clientProtocolProvider = provider;
						client = clientProtocol;
						LOG.debug("Picked " + provider.getClass().getName()
								+ " as the ClientProtocolProvider");
						break;
					} else {
						LOG.debug("Cannot pick "
								+ provider.getClass().getName()
								+ " as the ClientProtocolProvider - returned null protocol");
					}
				} catch (Exception e) {
					LOG.info("Failed to use " + provider.getClass().getName()
							+ " due to error: " + e.getMessage());
				}
			}
		}

		if (null == clientProtocolProvider || null == client) {
			throw new IOException(
					"Cannot initialize Cluster. Please check your configuration for "
							+ MRConfig.FRAMEWORK_NAME
							+ " and the correspond server addresses.");
		}
	}

这里面的frameworkLoader是个静态成员变量：

private static ServiceLoader<ClientProtocolProvider> frameworkLoader = ServiceLoader
			.load(ClientProtocolProvider.class);

内部方法我们不细看了，其实质就是找到当前类加载器中，所有继承了ClientProtocolProvider接口的类，而根据方法，我们第一次提交的jobTrackerAddr是Null，需要执行create方法：

			clientProtocol = provider.create(conf);

public ClientProtocol create(Configuration conf) throws IOException {
		String framework = conf.get(MRConfig.FRAMEWORK_NAME,
				MRConfig.LOCAL_FRAMEWORK_NAME);
		if (!MRConfig.LOCAL_FRAMEWORK_NAME.equals(framework)) {
			return null;
		}
		conf.setInt(JobContext.NUM_MAPS, 1);
		return new LocalJobRunner(conf);
	}

方法的含义在于，如果我们指定了mapreduce.framework.name为yarn，则调用的clientProtocol是YarnRunner，所以，想要使用Yarn，我们必须注意这个配置：

mapreduce.framework.name=yarn

否则就是本地提交了。

毫无疑问，我们必须看看这个YarnRunner的初始化了：

public YARNRunner(Configuration conf) {
		this(conf, new ResourceMgrDelegate(new YarnConfiguration(conf)));
	}

public YARNRunner(Configuration conf, ResourceMgrDelegate resMgrDelegate,
			ClientCache clientCache) {
		this.conf = conf;
		try {
			this.resMgrDelegate = resMgrDelegate;
			this.clientCache = clientCache;
			this.defaultFileContext = FileContext.getFileContext(this.conf);
		} catch (UnsupportedFileSystemException ufe) {
			throw new RuntimeException("Error in instantiating YarnClient", ufe);
		}
	}

注意，这里出现了强转，把原先传入的JobConf强转成了YarnConfiguration，至于YarnConfiguration的内容我们不看了，里面定义了很多Yarn自己的专属配置。

public ResourceMgrDelegate(YarnConfiguration conf) {
		super(ResourceMgrDelegate.class.getName());
		this.conf = conf;
		this.client = YarnClient.createYarnClient();
		init(conf);
		start();
	}

我们注意看下这个构造方法，其实质就是和RM形成了一个RPC的连接，并将这个连接拿出去给YarnRunner来用，既然牵涉到RPC了，那我们就看下，因为其中牵涉到了我们常用的一些配置的地址：

@Public
	public static YarnClient createYarnClient() {
		YarnClient client = new YarnClientImpl();
		return client;
	}

我们可以看到这里的JobClient及YarnClient，实际上是YarnClientImpl，我们仔细看看其代码，其初始化并不难，但是其中重要的是serviceInit和serviceStart方法，我们要养成一种看Yarn源码的习惯，那就是所有的服务类内，都有这两个方法，统一地进行初始化和服务启动：

这里，serviceInit的方法不说了，我们看serviceStart方法，其比较关键：

rmClient = ClientRMProxy.createRMProxy(getConfig(),
					ApplicationClientProtocol.class);

ApplicationClientProtocol是我们Application与RM打交道的通信协议，而这个协议是建立在RPC基础之上的，而这里，就是RPC建立的紧要之处，我们要看看这段代码，看看到底是如何定义了RM的地址以及其他的一些配置：

代码里的getConfig方法不说了，源自于AbstractService中的方法，获取当前配置：

public static <T> T createRMProxy(final Configuration configuration,
			final Class<T> protocol) throws IOException {
		return createRMProxy(configuration, protocol, INSTANCE);
	}

注意里面的INSTANCE，这个东西，实际上是静态初始化的：

	private static final ClientRMProxy INSTANCE = new ClientRMProxy();

接着看：

/**
	 * Create a proxy for the specified protocol. For non-HA, this is a direct
	 * connection to the ResourceManager address. When HA is enabled, the proxy
	 * handles the failover between the ResourceManagers as well.
	 */
	@Private
	protected static <T> T createRMProxy(final Configuration configuration,
			final Class<T> protocol, RMProxy instance) throws IOException {
		YarnConfiguration conf = (configuration instanceof YarnConfiguration) ? (YarnConfiguration) configuration
				: new YarnConfiguration(configuration);
		RetryPolicy retryPolicy = createRetryPolicy(conf);
		if (HAUtil.isHAEnabled(conf)) {
			RMFailoverProxyProvider<T> provider = instance
					.createRMFailoverProxyProvider(conf, protocol);
			return (T) RetryProxy.create(protocol, provider, retryPolicy);
		} else {
			InetSocketAddress rmAddress = instance.getRMAddress(conf, protocol);
			LOG.info("Connecting to ResourceManager at " + rmAddress);
			T proxy = RMProxy.<T> getProxy(conf, protocol, rmAddress);
			return (T) RetryProxy.create(protocol, proxy, retryPolicy);
		}
	}

这段代码很关键，我们的conf已经是YarnConfiguration了，不多说，这里，我们默认讨论非RM高可用的状况，直接进入else的代码：

protected InetSocketAddress getRMAddress(YarnConfiguration conf,
			Class<?> protocol) throws IOException {
		if (protocol == ApplicationClientProtocol.class) {
			return conf.getSocketAddr(YarnConfiguration.RM_ADDRESS,
					YarnConfiguration.DEFAULT_RM_ADDRESS,
					YarnConfiguration.DEFAULT_RM_PORT);
		}
	}

这下看到了吧，我们的配置实际上是在这里得到使用的，使用的YarnConfiguration里相关的配置，来达到我们建立RPC连接的目的：

	public static final String RM_PREFIX = "yarn.resourcemanager.";

        public static final String RM_ADDRESS = RM_PREFIX + "address";
	public static final int DEFAULT_RM_PORT = 8032;
	public static final String DEFAULT_RM_ADDRESS = "0.0.0.0:"
			+ DEFAULT_RM_PORT;

所以，默认情况下，我们会连接RM的8032端口，来建立RPC的连接，而如果我们不配置RM_ADDRESS的话，那就只能在RM上来提交任务了，因为0.0.0.0代表当前设备的IP：

到了这里，一个与RM沟通的RPC连接建立起来了：而实际上，持有这个连接的rmClient是YarnClientImpl的成员变量，而这个YarnClientImpl则是ResourceMgrDelegate的内部成员：

而实际上，这个ResourceMgrDelegate则是YARNRunner的内部成员，所以我们明白了：YarnRunner与RM的大多交互，都是通过持有的RPC连接来建立的。

接着继续看submitJobInternal的方法：

final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(),
				cluster.getClient());

/** Only for mocking via unit tests. */
	@Private
	public JobSubmitter getJobSubmitter(FileSystem fs,
			ClientProtocol submitClient) throws IOException {
		return new JobSubmitter(fs, submitClient);
	}

前文新建Cluster的时候，已经内置了一个与RM连接的RPC连接，即client，而这里，JobSubmitter内部持有client，两个操作实际上都是通过client来实现的：

接下来，我们认真看其中的逻辑：

		checkSpecs(job);

输出目录检查：

private void checkSpecs(Job job) throws ClassNotFoundException,
			InterruptedException, IOException {
		JobConf jConf = (JobConf) job.getConfiguration();
		// Check the output specification
		if (jConf.getNumReduceTasks() == 0 ? jConf.getUseNewMapper() : jConf
				.getUseNewReducer()) {
			org.apache.hadoop.mapreduce.OutputFormat<?, ?> output = ReflectionUtils
					.newInstance(job.getOutputFormatClass(),
							job.getConfiguration());
			output.checkOutputSpecs(job);
		} else {
			jConf.getOutputFormat().checkOutputSpecs(jtFs, jConf);
		}
	}

这段逻辑非常简单，具体不分析了，主要是用于判断输出目录是否存在，从方法名就可以看出：

我们看看第二句代码，比较复杂：

		Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);

调用静态方法，找到本次Job提交的工作目录：

/**
	 * Initializes the staging directory and returns the path. It also keeps
	 * track of all necessary ownership & permissions
	 * 
	 * @param cluster
	 * @param conf
	 */
	public static Path getStagingDir(Cluster cluster, Configuration conf)

public String getStagingAreaDir() throws IOException, InterruptedException {
		// Path path = new Path(MRJobConstants.JOB_SUBMIT_DIR);
		String user = UserGroupInformation.getCurrentUser().getShortUserName();
		Path path = MRApps.getStagingAreaDir(conf, user);
		LOG.debug("getStagingAreaDir: dir=" + path);
		return path.toString();
	}

这个是为数不多的YarnRunner调用的不需要与RM交互的操作，主要是在本地创建本次Job的工作目录。

还是多说一句，对checkSpecs方法做一定的介绍：

private void checkSpecs(Job job) throws ClassNotFoundException,
			InterruptedException, IOException {
		JobConf jConf = (JobConf) job.getConfiguration();
		// Check the output specification
		if (jConf.getNumReduceTasks() == 0 ? jConf.getUseNewMapper() : jConf
				.getUseNewReducer()) {
			org.apache.hadoop.mapreduce.OutputFormat<?, ?> output = ReflectionUtils
					.newInstance(job.getOutputFormatClass(),
							job.getConfiguration());
			output.checkOutputSpecs(job);
		} else {
			jConf.getOutputFormat().checkOutputSpecs(jtFs, jConf);
		}
	}

首先，反射方式获取到我们的OutputFormat的类，checkOutputSpecs方法则是在FileOutputFormat内：

public void checkOutputSpecs(JobContext job)
			throws FileAlreadyExistsException, IOException {
		// Ensure that the output directory is set and not already there
		Path outDir = getOutputPath(job);
		if (outDir == null) {
			throw new InvalidJobConfException("Output directory not set.");
		}

		// get delegation token for outDir's file system
		TokenCache.obtainTokensForNamenodes(job.getCredentials(),
				new Path[] { outDir }, job.getConfiguration());

		if (outDir.getFileSystem(job.getConfiguration()).exists(outDir)) {
			throw new FileAlreadyExistsException("Output directory " + outDir
					+ " already exists");
		}
	}

这里，假设我们提交的输出路径是：hdfs://host:ip/output，然后我们把重点放在最后一句判断上：

outDir.getFileSystem(job.getConfiguration()).exists(outDir

首先，getFileSystem：

/** Return the FileSystem that owns this Path. */
	public FileSystem getFileSystem(Configuration conf) throws IOException {
		return FileSystem.get(this.toUri(), conf);
	}

这是定义在Path内的方法，会根据Path的类型来寻找对应的文件系统；这里用到了FileSystem的静态方法，而且这里面，toUri方法是为了获取URI，而实际上，就可以理解为path对应的路径，带有hdfs协议的路径：

有这么一句，我们看看：

if (scheme != null && authority == null) { // no authority
			URI defaultUri = getDefaultUri(conf);
			if (scheme.equals(defaultUri.getScheme()) // if scheme matches
														// default
					&& defaultUri.getAuthority() != null) { // & default has
															// authority
				return get(defaultUri, conf); // return default
			}
		}

获取默认的文件系统URI:

/**
	 * Get the default filesystem URI from a configuration.
	 * 
	 * @param conf
	 *            the configuration to use
	 * @return the uri of the default filesystem
	 */
	public static URI getDefaultUri(Configuration conf) {
		return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, DEFAULT_FS)));
	}

public static final String  FS_DEFAULT_NAME_KEY = "fs.defaultFS";

public static final String  FS_DEFAULT_NAME_DEFAULT = "file:///";

所以，我们又看到了一个常用的配置，默认需要从这个值来读取对应的文件系统，默认情况下是本地文件系统：

if (conf.getBoolean(disableCacheName, false)) {
			return createFileSystem(uri, conf);
		}

createFileSystem不看了，就是创建一个文件系统，在我们的配置下，是hdfs文件系统，内部调用了DistributedFileSystem的initialize方法：

@Override
	public void initialize(URI uri, Configuration conf) throws IOException {
		super.initialize(uri, conf);
		setConf(conf);

		String host = uri.getHost();
		if (host == null) {
			throw new IOException("Incomplete HDFS URI, no host: " + uri);
		}
		homeDirPrefix = conf.get(DFSConfigKeys.DFS_USER_HOME_DIR_PREFIX_KEY,
				DFSConfigKeys.DFS_USER_HOME_DIR_PREFIX_DEFAULT);

		this.dfs = new DFSClient(uri, conf, statistics);
		this.uri = URI.create(uri.getScheme() + "://" + uri.getAuthority());
		this.workingDir = getHomeDirectory();
	}

我们注意到，这里建立了DFSClient，具体内容不说了，看下注释即可：

/********************************************************
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 *
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
 *
 ********************************************************/
@InterfaceAudience.Private
public class DFSClient implements java.io.Closeable, RemotePeerFactory,
    DataEncryptionKeyFactory

程序到这儿，我们看看这个注释，说明DFSClient需要完成于NN的交互，同时需要接受NN的命令，完成于DN的交互，在本文分析中，主要是要把本地文件目录复制到HDFS集群中。

余下有些代码我们直接略过，看这段;

/**
	 * configure the jobconf of the user with the command line options of
	 * -libjars, -files, -archives.
	 * 
	 * @param job
	 * @throws IOException
	 */
	private void copyAndConfigureFiles(Job job, Path jobSubmitDir)
			throws IOException {
		JobResourceUploader rUploader = new JobResourceUploader(jtFs);
		rUploader.uploadFiles(job, jobSubmitDir);

		// Set the working directory
		if (job.getWorkingDirectory() == null) {
			job.setWorkingDirectory(jtFs.getWorkingDirectory());
		}
	}

实际上，在这个方法执行前，我们的工作目录还依旧是在本地，这里通过JobResourceUploader，把相关文件上传到了HDFS上，具体方法可参见相关代码，内容很多，我们选择部分来看：

short replication = (short) conf.getInt(Job.SUBMIT_REPLICATION,
				Job.DEFAULT_SUBMIT_REPLICATION);

	public static final int DEFAULT_SUBMIT_REPLICATION = 10;

默认情况下，会上传十个副本，以尽可能避免任务启动时候齐刷刷下载文件的热点情况：

String files = conf.get("tmpfiles");
		String libjars = conf.get("tmpjars");
		String archives = conf.get("tmparchives");
		String jobJar = job.getJar();

这里，目录内容分为四种情况：临时文件，第三方jar包，临时文档，以及提交的任务的jar包：

			int maps = writeSplits(job, submitJobDir);

我们看看这句话，了解一下我们任务的分片到底是如何设计出来的：

@SuppressWarnings("unchecked")
	private <T extends InputSplit> int writeNewSplits(JobContext job,
			Path jobSubmitDir) throws IOException, InterruptedException,
			ClassNotFoundException {
		Configuration conf = job.getConfiguration();
		InputFormat<?, ?> input = ReflectionUtils.newInstance(
				job.getInputFormatClass(), conf);

		List<InputSplit> splits = input.getSplits(job);
		T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

		// sort the splits into order based on size, so that the biggest
		// go first
		Arrays.sort(array, new SplitComparator());
		JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
				jobSubmitDir.getFileSystem(conf), array);
		return array.length;
	}

这里，因为我们使用了newApi，进入了如上方法，这里，我们不会对与FS牵涉过重的逻辑进行分析，只要知道其大致是以Block的大小来为基准区分文件的即可：

status = submitClient.submitJob(jobId, submitJobDir.toString(),
					job.getCredentials());

最后的最后，我们把这个任务提交给了RM，jobId是第一次提交时候返回的ID，submitJobDir则是相关文件已经复制到服务器上的目录，credentials是各种授权信息。

最后明确一下这个整体过程：

整体过程中，我们把与Hadoop集群建立了两个连接，一个是YarnRunner持有的RPC连接，与RM交互；一个是DFSClient持有的流式接口，完成和NameNode的文件交互。
在正式提交ApplicationMaster之前，会先向集群申请一个ApplicationId，这是全局唯一的，而后我们会把运行程序所需要的本地文件，通过DFSClient持有的RPC连接与NN交互，然后通过文件流的形式，把文件传到HDFS集群中，因为文件不会很小，通过RPC连接比较麻烦