大数据学习教程SD版第五篇【Hadoop Yarn】_大数据学习教程sd版下载-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_41200768/article/details/121796729

文章目录

5. Hadoop Yarn

5. Hadoop Yarn

资源调度平台，负责运行任务时的资源分配

5.1 Yarn 基础架构

ResourceManager

集群资源Master

NodeManager

单个节点资源的Mater

ApplicationMaster

应用程序的监控Mater

Container

资源的抽象表示

5.2 Yarn 工作机制

Client 像RM 申请运行一个Application
RM 给一个上传资源的路径
Client上传资源(xxx.split,job.xml,xxx.jar)到指定路径
Client 上传成功之后，申请开启mrAppMater
RM 把各个Clinet提交的任务放入调度队列中，当空闲时
RM 开启一个NM 来运行mrAppMaster
NM 先创建一个Container
NM 下载Job资源到本地
mrAppMaster 向RM 申请MapTask容器
RM开启对应的NM来运行任务，并在NM内创建容器
mrAppMaater MapTask所在的NM发送启动脚本，执行完成之后
mrAppMaster向RM 申请ReduceTask容器
ReduceTask从MapTask指定分区获取数据，进行处理
程序运行完成之后，MR会向NM注销自己

5.3 Yarn 调度器

三种常用调度器：FIFO、容量调度器(Capacity Scheduler)、公平调度器(Fair Scheduler);

Apache Hadoop 的Yarn默认使用的是Capaciry Scheduler

CDH 默认的是Fair Scheduler

FIFO Scheduler

一个队列，先到先服务
基本不会使用

Capacity Scheduler

特点
- 多队列，单个队列是FIFO,单个队列任务可以同事运行
- 每个队列都有容量保证（最低和上限）
- 灵活，可以从其他队列借资源
- 多租户，可以对用户进行资源量限定
资源分配算法
- 队列资源分配：使用深度优先算法，优先选择资源占用率最低的队列分配资源
- 作业资源分配：按照作业优先级和提交时间的顺序分配资源
- 容器资源分配：按照容器优先级，其次按照数据节点距离最近原则

Fair Scheduler

特点
- 多队列、容量保证、灵活、多租户
- 优先选择对资源缺额比列大的分配资源，为了最终的公平
- 单个队列可以单独设置资源分配方式(FIFO、FAIR【默认】、DRF)
资源分配流程(与容量调度器三个流程一致)
- 饥饿优先【最小资源份额、是否饥饿、资源分配占比、资源使用权重比】

5.4 Yarn 常用命令

查看任务和日志

# 查看全部任务
yarn application -list
# 查看任务，加过滤状态
yarn application -list -appStates FINISHED
# 杀掉任务
yarn application -kill application_xxx
# 查看日志
yarn logs -applicationId application_xxx
# 查看提交的尝试运行的任务,可以看到运行的容器ID
yarn applicationattempt -list application_xxx
# 查看任务下某一容器的运行日志
yarn logs -applicationId application_xxx -containerId container_xxx

查看运行中的容器

yarn container -list
yarn container -status container_xxx

查看集群节点

yarn node -list -all

更新队列配置

yarm rmadmin -refreshQueues

查看队列信息

yarn queue -status default

修改任务优先级

yarn application -appID application_xxx -updatePriority X

5.5 Yarn 核心配置参数

根据业务需求，进行适当配置

ResourceManager 相关

Key	Value
yarn.resourcemanager.scheduler.class	调度器配置，默认CapacityScheduler
yarn.resourcemanager.scheduler.client.thread-count	RM处理调度器线程数，默认50

NodeManager 相关

Key	Value
yarn.nodemanager.resource.memory-mb	NM使用内存，默认8G
yarn.nodemanager.resource.cpu-vcores	NM使用CPU核数，默认8个

Container 相关

Key	Value
yarn.scheduler.minimum-allocation-mb	容器最小内存，默认1G
yarn.scheduler.maximum-allocation-mb	容器最大内存，默认8G
yarn.scheduler.minimum-allocation-vcores	容器最小CPU核数，默认1个
yarn.scheduler.maximum-allocation-vcores	容器最大CPU核数，默认4个

虚拟机快照设置：在VMware里直接设置集群各个主机的快照即可，不用担心后续玩坏了

5.6 Yarn 多队列设置

默认只有一个default队列，配置文件为：capacity-scheduler.xml

根据业务需求，适当配置，配置过之后刷新队列

yarm rmadmin -refreshQueues

指定任务提交的队列

# 命令行指定
-D mapreduce.job.queuename=xxx
# 或 程序中指定
conf.set("mapreduce.job.queuename","xxx")

任务也可以设置优先级，来优先执行

如果要设置FairScheduler，需要创建配置文件(fair-scheduler.xml)并且在yarn-site.xml中指定，最后重启Yarn集群

5.7 Yarn Tool接口

解决任务提交运行时指定参数的问题

实现流程(以wc为例)

定义一个类实现Tool接口

package com.ipinyou.mapreduce.yarntool;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;

import java.io.IOException;

public class WordCount implements Tool {

    private Configuration conf;

    // 核心方法
    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(conf);

        job.setJarByClass(WordCountDriver.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    @Override
    public Configuration getConf() {
        return conf;
    }

    //mapper
    public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

        private Text outK = new Text();
        private IntWritable outV = new IntWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] words = value.toString().split(" ");
            for (String word : words) {
                outK.set(word);
                context.write(outK, outV);
            }

        }
    }

    //reducer
    public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable outV = new IntWritable();

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            outV.set(sum);
            context.write(key, outV);
        }
    }
}

在驱动类中进行配置

package com.ipinyou.mapreduce.yarntool;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.util.Arrays;

public class WordCountDriver {

    private static Tool tool;

    public static void main(String[] args) throws Exception {
        // args
        Configuration conf = new Configuration();
        if ("wordcount".equals(args[0])) {
            tool = new WordCount();
        } else {
            throw new RuntimeException("No Tool:" + args[0]);
        }

        //run
        int run = ToolRunner.run(conf, tool, Arrays.copyOfRange(args, 1, args.length));
        System.exit(run);
    }
}

打成Jar包，集群运行

yarn jar wctool.jar com.ipinyou.mapreduce.yarntool.WordCountDriver wordcount -Dk=v /input /output