Untangling Apache Hadoop YARN, Part 1: Cluster and YARN Basics

In this multipart series, fully explore the tangled ball of thread that is YARN.

YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. YARN has been available for several releases, but many users still have fundamental questions about what YARN is, what it’s for, and how it works. This new series of blog posts is designed with the following goals in mind:

  • Provide a basic understanding of the components that make up YARN
  • Illustrate how a MapReduce job fits into the YARN model of computation. (Note: although Apache Spark integrates with YARN as well, this series will focus on MapReduce specifically. For information about Spark on YARN, see this post.)
  • Present an overview of how the YARN scheduler works and provide building-block examples for scheduler configuration

The series comprises the following parts:

  • Part 1: Cluster and YARN basics
  • Part 2: Global configuration basics
  • Part 3: Scheduler concepts
  • Part 4: FairScheduler queue basics
  • Part 5: Using FairScheduler queue properties

In this initial post, we’ll cover the fundamentals of YARN, which runs processes on a cluster similarly to the way an operating system runs processes on a standalone computer. Subsequent parts will be released every few weeks.

Cluster Basics (Master/Worker)

A host is the Hadoop term for a computer (also called a node, in YARN terminology). A cluster is two or more hosts connected by a high-speed local network. Two or more hosts—the Hadoop term for a computer (also called a node in YARN terminology)—connected by a high-speed local network are called a cluster. From the standpoint of Hadoop, there can be several thousand hosts in a cluster.

In Hadoop, there are two types of hosts in the cluster.

Figure 1: Master host and Worker hosts

Conceptually, a master host is the communication point for a client program. A master host sends the work to the rest of the cluster, which consists of worker hosts. (In Hadoop, a cluster can technically be a single host. Such a setup is typically used for debugging or simple testing, and is not recommended for a typical Hadoop workload.)

YARN Cluster Basics (Master/ResourceManager, Worker/NodeManager)

In a YARN cluster, there are two types of hosts:

  • The ResourceManager is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
  • A NodeManager is a worker daemon that launches and tracks processes spawned on worker hosts.

Figure 2: Master host with ResourceManager and Worker hosts with NodeManager

YARN Configuration File

The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.

YARN Requires a Global View

YARN currently defines two resources, vcores and memory. Each NodeManager tracks its own local resources and communicates its resource configuration to the ResourceManager, which keeps a running total of the cluster’s available resources. By keeping track of the total, the ResourceManager knows how to allocate resources as they are requested. (Vcore has a special meaning in YARN. You can think of it simply as a “usage share of a CPU core.” If you expect your tasks to be less CPU-intensive (sometimes called I/O-intensive), you can set the ratio of vcores to physical cores higher than 1 to maximize your use of hardware resources.)

Figure 3: ResourceManager global view of the cluster

Containers

Containers are an important YARN concept. You can think of a container as a request to hold resources on the YARN cluster. Currently, a container hold request consists of vcore and memory, as shown in Figure 4 (left).

Figure 4: Container as a hold (left), and container as a running process (right)

Once a hold has been granted on a host, the NodeManager launches a process called a task. The right side of Figure 4 shows the task running as a process inside a container. (Part 3 will cover, in more detail, how YARN schedules a container on a particular host.)

YARN Cluster Basics (Running Process/ApplicationMaster)

For the next section, two new YARN terms need to be defined:

  • An application is a YARN client program that is made up of one or more tasks (see Figure 5).
  • For each running application, a special piece of code called an ApplicationMaster helps coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run after the application starts.

An application running tasks on a YARN cluster consists of the following steps:

  1. The application starts and talks to the ResourceManager for the cluster:

    Figure 5: Application starting up before tasks are assigned to the cluster

  2. The ResourceManager makes a single container request on behalf of the application:

    Figure 6: Application + allocated container on a cluster

  3. The ApplicationMaster starts running within that container:

    Figure 7: Application + ApplicationMaster running in the container on the cluster

  4. The ApplicationMaster requests subsequent containers from the ResourceManager that are allocated to run tasks for the application. Those tasks do most of the status communication with the ApplicationMaster allocated in Step 3):

    Figure 8: Application + ApplicationMaster + task running in multiple containers running on the cluster

  5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated from the cluster.
  6. The application client exits. (The ApplicationMaster launched in a container is more specifically called a managed AM. Unmanaged ApplicationMasters run outside of YARN’s control. Llama is an example of an unmanaged AM.)

MapReduce Basics

In the MapReduce paradigm, an application consists of Map tasks and Reduce tasks. Map tasks and Reduce tasks align very cleanly with YARN tasks.

 

Figure 9: Application + Map tasks + Reduce tasks

Putting it Together: MapReduce and YARN

Figure 10 illustrates how the map tasks and the reduce tasks map cleanly to the YARN concept of tasks running in a cluster.

Figure 10: Merged MapReduce/YARN Application Running on a Cluster

In a MapReduce application, there are multiple map tasks, each running in a container on a worker host somewhere in the cluster. Similarly, there are multiple reduce tasks, also each running in a container on a worker host.

Simultaneously on the YARN side, the ResourceManager, NodeManager, and ApplicationMaster work together to manage the cluster’s resources and ensure that the tasks, as well as the corresponding application, finish cleanly.

Conclusion

Summarizing the important concepts presented in this section:

  1. A cluster is made up of two or more hosts connected by an internal high-speed network. Master hosts are a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster.
  2. In a cluster with YARN running, the master process is called the ResourceManager and the worker processes are called NodeManagers.
  3. The configuration file for YARN is named yarn-site.xml. There is a copy on each host in the cluster. It is required by the ResourceManager and NodeManager to run properly. YARN keeps track of two resources on the cluster, vcores and memory. The NodeManager on each host keeps track of the local host’s resources, and the ResourceManager keeps track of the cluster’s total.
  4. A container in YARN holds resources on the cluster. YARN determines where there is room on a host in the cluster for the size of the hold for the container. Once the container is allocated, those resources are usable by the container.
  5. An application in YARN comprises three parts:
    1. The application client, which is how a program is run on the cluster.
    2. An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application.
    3. One or more tasks that do the actual work (runs in a process) in the container allocated by YARN.
  6. A MapReduce application consists of map tasks and reduce tasks.
  7. A MapReduce application running in a YARN cluster looks very much like the MapReduce application paradigm, but with the addition of an ApplicationMaster as a YARN requirement.

Next Time…

Part 2 will cover calculating YARN properties for cluster configuration. In the meantime, consider this further reading:

Ray Chiang is a Software Engineer at Cloudera.

Dennis Dawson is a Senior Technical Writer at Cloudera.

基于数据驱动的 Koopman 算子的递归神经网络模型线性化,用于纳米定位系统的预测控制研究(Matlab代码实现)内容概要:本文围绕“基于数据驱动的 Koopman 算子的递归神经网络模型线性化,用于纳米定位系统的预测控制研究”展开,提出了一种结合数据驱动方法与Koopman算子理论的递归神经网络(RNN)模型线性化方法,旨在提升纳米定位系统的预测控制精度与动态响应能力。研究通过构建数据驱动的线性化模型,克服了传统非线性系统建模复杂、计算开销大的问题,并在Matlab平台上实现了完整的算法仿真与验证,展示了该方法在高精度定位控制中的有效性与实用性。; 适合人群:具备一定自动化、控制理论或机器学习背景的科研人员与工程技术人员,尤其是从事精密定位、智能控制、非线性系统建模与预测控制相关领域的研究生与研究人员。; 使用场景及目标:①应用于纳米级精密定位系统(如原子力显微镜、半导体制造设备)中的高性能预测控制;②为复杂非线性系统的数据驱动建模与线性化提供新思路;③结合深度学习与经典控制理论,推动智能控制算法的实际落地。; 阅读建议:建议读者结合Matlab代码实现部分,深入理解Koopman算子与RNN结合的建模范式,重点关注数据预处理、模型训练与控制系统集成等关键环节,并可通过替换实际系统数据进行迁移验证,以掌握该方法的核心思想与工程应用技巧。
基于粒子群算法优化Kmeans聚类的居民用电行为分析研究(Matlb代码实现)内容概要:本文围绕基于粒子群算法(PSO)优化Kmeans聚类的居民用电行为分析展开研究,提出了一种结合智能优化算法与传统聚类方法的技术路径。通过使用粒子群算法优化Kmeans聚类的初始聚类中心,有效克服了传统Kmeans算法易陷入局部最优、对初始值敏感的问题,提升了聚类的稳定性和准确性。研究利用Matlab实现了该算法,并应用于居民用电数据的行为模式识别与分类,有助于精细化电力需求管理、用户画像构建及个性化用电服务设计。文档还提及相关应用场景如负荷预测、电力系统优化等,并提供了配套代码资源。; 适合人群:具备一定Matlab编程基础,从事电力系统、智能优化算法、数据分析等相关领域的研究人员或工程技术人员,尤其适合研究生及科研人员。; 使用场景及目标:①用于居民用电行为的高效聚类分析,挖掘典型用电模式;②提升Kmeans聚类算法的性能,避免局部最优问题;③为电力公司开展需求响应、负荷预测和用户分群管理提供技术支持;④作为智能优化算法与机器学习结合应用的教学与科研案例。; 阅读建议:建议读者结合提供的Matlab代码进行实践操作,深入理解PSO优化Kmeans的核心机制,关注参数设置对聚类效果的影响,并尝试将其应用于其他相似的数据聚类问题中,以加深理解和拓展应用能力。
在大数据技术快速发展的背景下,网络爬虫已成为信息收集与数据分析的关键工具。Python凭借其语法简洁和功能丰富的优势,被广泛用于开发各类数据采集程序。本项研究“基于Python的企查查企业信息全面采集系统”即在此趋势下设计,旨在通过编写自动化脚本,实现对企查查平台所公示的企业信用数据的系统化抓取。 该系统的核心任务是构建一个高效、可靠且易于扩展的网络爬虫,能够模拟用户登录企查查网站,并依据预设规则定向获取企业信息。为实现此目标,需重点解决以下技术环节:首先,必须深入解析目标网站的数据组织与呈现方式,包括其URL生成规则、页面HTML架构以及可能采用的JavaScript动态渲染技术。准确掌握这些结构特征是制定有效采集策略、保障数据完整与准确的前提。 其次,针对网站可能设置的反爬虫机制,需部署相应的应对方案。例如,通过配置模拟真实浏览器的请求头部信息、采用多代理IP轮换策略、合理设置访问时间间隔等方式降低被拦截风险。同时,可能需要借助动态解析技术处理由JavaScript加载的数据内容。 在程序开发层面,将充分利用Python生态中的多种工具库:如使用requests库发送网络请求,借助BeautifulSoup或lxml解析网页文档,通过selenium模拟浏览器交互行为,并可基于Scrapy框架构建更复杂的爬虫系统。此外,json库用于处理JSON格式数据,pandas库则协助后续的数据整理与分析工作。 考虑到采集的数据规模可能较大,需设计合适的数据存储方案,例如选用MySQL或MongoDB等数据库进行持久化保存。同时,必须对数据进行清洗、去重与结构化处理,以确保其质量满足后续应用需求。 本系统还需包含运行监控与维护机制。爬虫执行过程中可能遭遇网站结构变更、数据格式调整等意外情况,需建立及时检测与自适应调整的能力。通过定期分析运行日志,评估程序的效率与稳定性,并持续优化其性能表现。 综上所述,本项目不仅涉及核心爬虫代码的编写,还需在反爬应对、数据存储及系统维护等方面进行周密设计。通过完整采集企查查的企业数据,该系统可为市场调研、信用评价等应用领域提供大量高价值的信息支持。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值