处理思路
1、通过 HistoryServerRestApi 获取YARN JOB的基本信息 (包括JOB的 ID和名称,开始时间和结束时间)
http://<history server http address:port>/ws/v1/history/mapreduce/jobs?startedTimeBegin=%s&startedTimeEnd=%s
2、对 执行时间(结束时间 - 开始时间)进行排序,找到执行时间最长的JOB列表
3、对第二步找到的JOB列表,查看其对应的Task的执行时长,判断是否存在数据倾斜
http://<history server http address:port>/ws/v1/history/mapreduce/jobs/{
jobid}/tasks
4、对第二部找到的JOB列表,查看其SQL文件和Stage步骤
http://<history server http address:port>/ws/v1/history/mapreduce/jobs/{
jobid}/conf
落地演示

代码参考
import sys
import json
import time
import requests
reload(sys)
sys.setdefaultencoding('utf8')
TOP_PROCESS_TIME_JOB_NUM=30
def get_all_job_base_info():
"""
:return: 当天yarn执行的所有任务的基本信息
"""
start_end_timestamp = int(time.time()) * 1000
start_begin_timestamp = start_end_timestamp - (24 * 60 * 60 * 1000)
url = "http://localhost:19888/ws/v1/history/mapreduce/jobs?" \
"startedTimeBegin=%s&startedTimeEnd=%s" % (start_begin_timestamp, start_end_timestamp)
jobs_base_info = requests.get(url).content
return jobs_base_info
def get_yarn_job_configuration_info(yarn_job_id):
"""
Yarn Job Configuration 获取信息
:return: 返回yarn job query 配置信息
"""
job_conf_url = "http://localhost:19888/ws/v1/history/mapreduce/jobs/%s/conf" % (yarn_job_id)
key_property_list = ["mapreduce.job.name", "hive.query.id", "hive.query.string", "mapreduce.job.reduces",
"mapreduce.job.maps"]
job_conf_list = json