配置Spring YARN框架用于开发Apache Hadoop YARN应用程序实际上涉及到的是配置Spring Application Contexts

翻译已于 2025-01-30 10:15:15 修改 · 383 阅读

CC 4.0 BY-SA版权

原文链接：https://spring.io/blog/2013/09/10/introducing-the-spring-yarn-framework-for-developing-apache-hadoop-yarn-applications

文章标签：

#java

于 2020-05-17 18:16:09 首次发布

Spring cloud(as vast as cloud) 同时被 2 个专栏收录

2156 篇文章

订阅专栏

Hadoop(HDFS MapReduce)

106 篇文章

订阅专栏

本文介绍如何使用Spring框架简化YARN应用开发，包括配置、组件介绍及应用逻辑实现。

配置Spring YARN框架用于开发Apache Hadoop YARN应用程序实际上涉及到的是配置Spring Application Contexts，而非直接配置YARN本身。

这意味着开发者应当关注于定义三个关键组件的上下文环境：YarnClient, YarnAppMaster 和 YarnContainer.

<!-- Example of defining beans in a Spring XML configuration file -->
<beans>
    <bean id="yarnClient" class="org.apache.hadoop.yarn.client.api.YarnClient">
        <!-- Configuration properties go here -->
    </bean>

    <bean id="appMaster" class="com.example.MyApplicationMaster">
        <!-- Configuration properties specific to the application master -->
    </bean>

    <bean id="container" class="com.example.MyContainer">
        <!-- Container-specific configurations -->
    </bean>
</beans>

每个上述提到的对象代表了与YARN交互的不同方面，并允许通过标准的Spring机制如依赖注入来管理它们的生命期和相互关系。这种方式简化了创建复杂的分布式应用的过程，同时也让程序员能够利用Spring所提供的丰富的特性集。
Spring Boot确实能简化各种类型的Java应用程序开发，包括与Apache YARN集成的应用程序。然而，在给定的内容里并没有直接提及如何利用Spring Boot来具体简化YARN应用程序的构建过程。

尽管如此，可以推测出通过结合Spring Boot的特点和优势来进行间接支持：

使用自动配置特性减少手动设置的工作量，使得即使是在复杂环境下的YARN应用也能更加快捷地上手；
```
// 自动配置YARN客户端连接信息示例
spring.application.name=my-yarn-app
yarn.resourcemanager.address=localhost:8032
```
利用@SpringBootApplication简化主类定义并扫描组件以注册到IoC容器中，从而让开发者专注于业务逻辑而非基础架构代码；
如果涉及到RESTful API交互，则可以通过添加@RestController快速实现微服务风格的服务端点。

对于特定于YARN的功能模块或API调用，仍然可能需要依赖Hadoop生态系统内的库文件和支持工具完成相应的操作。

不能完全替代，但可以增强功能。

Spring Boot本身不是用来取代Hadoop自带的YARN管理工具的框架。然而，基于Spring Boot构建的应用程序可以通过集成如Spring for Hadoop等技术栈组件，提供更简洁易用的方式去操作和交互Hadoop生态系统的各个部分。这意味着虽然不能直接替换原有的YARN管理界面或命令行工具，但是确实能够创建更加用户友好的前端应用或者API服务来辅助管理和监控YARN集群的工作负载，并且这种做法不会影响到核心调度逻辑和服务稳定性。

对于具体的资源配置以及连接设置，则依旧依赖于标准的Hadoop配置文件中的定义，比如hadoop.fs.defaultFS指向默认文件系统地址而yarn.resourcemanager.*系列参数指定了ResourceManager的服务端口信息。因此，在这些方面也没有发生改变，仍旧沿用了原生Hadoop所提供的机制来进行网络通信及资源分配工作。

综上所述，Spring Boot并非意图成为另一个版本的YARN管理工具；相反地，借助其灵活性与扩展性优势，可以在现有基础上增加更多实用特性以满足不同业务场景下的需求。

在Spring Boot应用程序中整合并调用Hadoop MapReduce作业可以通过以下几个方式实现：

创建REST端点用于触发MapReduce作业

// 控制器类定义了一个名为 "/run" 的 REST 端点，
// 当访问此端点时会调用 Hadoop 服务类来启动 MapReduce 作业。
@RestController
public class HadoopController {

    @Autowired
    private HadoopService hadoopService;

    @GetMapping("/run")
    public String runJob() {
        return hadoopService.execute();
    }
}

配置YML文件支持HDFS连接和其他设置

确保项目中有正确的 application.yml 文件，以便能够与Hadoop集群通信以及上传必要的输入数据。

编写专门的服务层逻辑

设计一个如下的服务组件负责实际的MapReduce任务提交工作流程；这可能涉及读取来自本地或分布式文件系统的源文本，之后向 YARN 资源管理器提交作业请求等步骤

@Service
public class HadoopService {

    // 假设这里存在一些初始化代码...

    public String execute(){
        try{
            Configuration conf = new Configuration(); 
            Job job = Job.getInstance(conf, "word count");
            
            FileInputFormat.addInputPath(job, new Path("input"));
            FileOutputFormat.setOutputPath(job, new Path("output"));

            boolean success = job.waitForCompletion(true);
            if (success){
                return "Success";
            } else {
                throw new RuntimeException("Failed to complete.");
            }

        } catch(Exception e){
            log.error(e.getMessage());
            throw new RuntimeException("Error during execution",e);  
        }
        
    }
    
}

最后一步是确保所有必需的 Maven 或 Gradle 构建工具插件都已经添加到了项目的构建脚本之中，从而保证可以顺利下载所需库和依赖项。

We’re super excited to let the cat out of the bag and release support for writing YARN based applications as part of the Spring for Apache Hadoop 2.0 M1 release. In this blog post I’ll introduce you to YARN, what you can do with it, and how Spring simplifies the development of YARN based applications.

If you have been following the Hadoop community over the past year or two, you’ve probably seen a lot of discussions around YARN and the next version of Hadoop’s MapReduce called MapReduce v2. YARN (Yet Another Resource Negotiator) is a component of the MapReduce project created to overcome some performance issues in Hadoop’s original design. The fundamental idea of MapReduce v2 is to split the functionalities of the JobTracker, Resource Management and Job Scheduling/Monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and a per-application Application Master (AM). A generic diagram for YARN component dependencies can be found from YARN architecture.

MapReduce Version 2 is an application running on top of YARN. It is also possible to make similar custom YARN based application which have nothing to do with MapReduce, it is simply running YARN application. However, writing a custom YARN based application is difficult. The YARN APIs are low-level infrastructure APIs and not developer APIs. Take a look at the documentation for writing a YARN application to get an idea of what is involved.

Starting with the 2.0 version, Spring for Apache Hadoop introduces the Spring YARN sub-project to provide support for building Spring based YARN applications. This support for YARN steps in by trying to make development easier. “Spring handles the infrastructure so you can focus on your application” applies to writing Hadoop applications as well as other types of Java applications. Spring’s YARN support also makes it easier to test your YARN application.

With Spring’s YARN support, you’re going to use all familiar concepts of Spring Framework itself, including configuration and generally speaking what you can do in your application. At a high level, Spring YARN provides three different components, YarnClient, YarnAppmaster and YarnContainer which together can be called a Spring YARN Application. We provide default implementations for all components while still giving the end user an option to customize as much as he or she wants. Lets take a quick look at a very simplistic Spring YARN application which runs some custom code in a Hadoop cluster.

The YarnClient is used to communicate with YARN’s Resource Manager. This provides management actions like submitting new application instances, listing applications and killing running applications. When submitting applications from the YarnClient, the main concerns relate to how the Application Master is configured and launched. Both the YarnAppmaster and YarnContainer share the same common launch context config logic so you’ll see a lot of similarities in YarnClient and YarnAppmaster configuration. Similar to how the YarnClient will define the launch context for the YarnAppmaster, the YarnAppmaster defines the launch context for the YarnContainer. The Launch context defines the commands to start the container, localized files, command line parameters, environment variables and resource limits(memory, cpu).

The YarnContainer is a worker that does the heavy lifting of what a YARN application will actually do. The YarnAppmaster is communicating with YARN Resource Manager and starts and stops YarnContainers accordingly.

You can create a Spring application that launches an ApplicationMaster by using the YARN XML namespace to define a Spring Application Context. Context configuration for YarnClient defines the launch context for YarnAppmaster. This includes resources and libraries needed by YarnAppmaster and its environment settings. An example of this is shown below.

<yarn:configuration />

yarn:localresources
<yarn:hdfs path=“/path/to/my/libs/*.jar”/>
</yarn:localresources>

yarn:environment
yarn:classpath/
</yarn:environment>

<yarn:client app-name=“my-yarn-app”>
<yarn:master-runner />
</yarn:client>

Note: Future releases will provide a Java based API for configuration, similar to what is done in Spring Security 3.2.

The purpose of YarnAppmaster is to control the instance of a running application. The YarnAppmaster is responsible for controlling the lifecycle of all its YarnContainers, the whole running application after the application is submitted, as well as itself.

<yarn:configuration />

yarn:localresources
<yarn:hdfs path=“/path/to/my/libs/*.jar”/>
</yarn:localresources>

yarn:environment
yarn:classpath/
</yarn:environment>

yarn:master
yarn:container-allocator/
yarn:container-runner/
</yarn:master>

The example above is defining a context configuration for the YarnAppmaster. Similar to what we saw in YarnClient configuration, we define local resources for the YarnContainer and its environment. The classpath setting picks up hadoop jars as well as your own application jars in default locations, change the setting if you want to use non-default directories. Also within the YarnAppmaster we define components handling the container allocation and bootstrapping. Allocator component is interacting with YARN resource manager handling the resource scheduling. Runner component is responsible for bootstrapping of allocated containers.

<yarn:container container-class=“org.example.MyCustomYarnContainer”/>

Above example defines a simple YarnContainer context configuration.

To implement the functionality of the container, you implement the interface YarnContainer. The YarnContainer interface is similar to Java’s Runnable interface, its has a run() method, as well as two additional methods related to getting environment and command line information.

Below is a simple hello world application that will be run inside of a YARN container:

public class MyCustomYarnContainer implements YarnContainer {

private static final Log log = LogFactory.getLog(MyCustomYarnContainer.class);

@Override
public void run() {
log.info(“Hello from MyCustomYarnContainer”);
}

@Override
public void setEnvironment(Map<String, String> environment) {}

@Override
public void setParameters(Properties parameters) {}

}

We just showed the configuration of a Spring YARN Application and the core application logic so what remains is how to bootstrap the application to run inside the Hadoop cluster. The utility class, CommandLineClientRunner provides this functionality.

You can you use CommandLineClientRunner either manually from a command line or use it from your own code.

java -cp org.springframework.yarn.client.CommandLineClientRunner application-context.xml yarnClient -submit

A Spring YARN Application is packaged into a jar file which then can be transferred into HDFS with the rest of the dependencies. A YarnClient can transfer all needed libraries into HDFS during the application submit process but generally speaking it is more advisable to do this manually in order to avoid unnecessary network I/O. Your application wont change until new version is created so it can be copied into HDFS prior the first application submit. You can i.e. use Hadoop’s hdfs dfs -copyFromLocal command.

Below you can see an example of a typical project setup.

src/main/java/org/example/MyCustomYarnContainer.java
src/main/resources/application-context.xml
src/main/resources/appmaster-context.xml
src/main/resources/container-context.xml

As a wild guess, we’ll make a bet that you have now figured out that you are not actually configuring YARN, instead you are configuring Spring Application Contexts for all three components, YarnClient, YarnAppmaster and YarnContainer.

We have just scratched the surface of what we can do with Spring YARN. While we’re preparing more blog posts, go ahead and check existing samples in GitHub. Basically, to reflect the concepts we described in this blog post, see the multi-context example in our samples repository.

Future blog posts will cover topics like Unit Testing and more advanced YARN application development.
comments powered by Disqus

当前提供的引用内容并未直接涉及Spring Boot 应用如何监控正在运行的 Hadoop 作业进度的具体细节。然而，可以根据所提供的参考资料构建一种间接的方法来达成这一目的。

利用 Spring Boot 支持的大数据组件和工具，例如 Apache Spark 和 GraphX 的集成实例展示了 Spring Boot 可以与其他大数据技术相结合。

// 示例代码片段展示了一个简单的Spark Job监听器配置类

@Configuration
public class SparkJobListenerConfig {
    @Bean
    public MyCustomSparkListener myCustomSparkListener() {
        return new MyCustomSparkListener();
    }
    
    static class MyCustomSparkListener extends SparkListener {
        @Override
        public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
            System.out.println("Stage completed: " + stageCompleted.stageInfo().stageId());
        }

        @Override
        public void onTaskStart(SparkListenerTaskStart taskStart) {
            System.out.println("Task started in Stage: " + taskStart.stageId());
        }
        
        // More overrides...
    }
}

结合上述提到的技术栈，在实际场景中可以考虑创建自定义监听器用于捕捉并报告Hadoop MapReduce任务的状态变化信息给到Web应用界面；
如果是针对YARN上的MR作业，则可通过访问ResourceManager REST API 来获取集群上所有应用程序的信息，并解析返回结果过滤出特定用户的活动job及其进展状况；

当前提供的信息中并没有直接提及关于在Spring Boot应用中如何实现在Yarn上资源使用的实时跟踪的方法。通常来说，在Hadoop生态系统里，Yarn作为资源管理器负责分配集群资源给不同的任务。而要从Spring Boot的角度来监控这些由Yarn管理的任务或容器所消耗的资源，则可能涉及到与Yarn REST APIs交互以获取相关信息。

如果希望集成此类功能到基于Spring Boot的应用程序之中，或许可以考虑：

构建自定义的服务接口去调用Yarn ResourceManager所提供的RESTful API来查询活动的应用详情以及各个节点上的资源使用状况

// 创建RestTemplate实例用于发起HTTP请求
RestTemplate restTemplate = new RestTemplate();

// 构造访问YARN RM Web Services URL
String url = "http://<resourcemanager-host>:8088/ws/v1/cluster/metrics";

// 发送GET请求并解析返回JSON结果为ClusterMetricsInfo对象
ClusterMetricsInfo metrics = 
restTemplate.getForObject(url, ClusterMetricsInfo.class);

上述代码片段仅作为一个概念性示例展示怎样通过Java/Spring方式连接至Yarn Resource Manager收集整体度量指标；具体实现时还需要结合实际场景设计更复杂的逻辑包括错误处理、认证授权机制等。

为了展示如何在Spring Boot项目中集成并使用Spark Streaming进行实时数据处理和流式计算，以下几点能够提供指导：

添加依赖
要在Spring Boot项目里整合Spark Streaming，首先需要向pom.xml文件中加入相应的Maven依赖以引入所需的库:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.4.8</version>
</dependency>

初始化Spark环境
创建必要的配置实例（比如SparkConf)以及用于定义输入源与执行转换操作上下文(JavaStreamingContext)的对象，在应用程序内完成这些初始化工作.
编写业务逻辑和服务层接口实现
定义好服务类及其方法签名之后就可以着手编码具体的处理流程了；通常会把这部分放在专门的服务bean当中以便更好地组织代码结构。当一切准备就绪后，则可以通过自动装配的方式注入到应用主类或其他地方调用了。
启动任务
最终一步是在应用程序启动期间触发整个过程——这可以经由标记有@PostConstruct注解的方法达成目的，该处会调用之前提到过的service bean中的相应函数来开启实际的数据流转作业:

@SpringBootApplication 
public class YourApplication {
    
    @Autowired 
    private SparkStreamingService sparkStreamingService;

    public static void main(String[] args) { 
        SpringApplication.run(YourApplication.class, args);
     }

   @PostConstruct  
   public void startSparkStreaming() {   
       sparkStreamingService.processStream();
   }
}

在这里插入图片描述