Flink监控及调优

最新推荐文章于 2025-04-15 15:37:31 发布

Moca·

最新推荐文章于 2025-04-15 15:37:31 发布

阅读量804

点赞数 1

分类专栏：实时计算Flink 文章标签： flink 大数据 java scala

本文链接：https://blog.youkuaiyun.com/qq_32165517/article/details/108228960

版权

实时计算Flink 专栏收录该内容

11 篇文章

订阅专栏

本文介绍了Flink的监控及调优，包括History Server的配置和使用，REST API的概述和开发，以及常用优化策略。重点讲解了如何配置HistoryServer以查询已完成作业的统计信息，并探讨了Flink的资源管理、并行度调整和数据倾斜问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Flink监控及调优

History Server
Hadoop MapReduce
Spark
Flink

start/stop-xxx.sh
看一下这些脚本的写法
shell对于bigdata有用吗？ lower

配置：
historyserver.web.address: 0.0.0.0
historyserver.web.port: 8082
historyserver.archive.fs.refresh-interval: 10000

jobmanager.archive.fs.dir: hdfs://hadoop000:8020/completed-jobs-pk/
historyserver.archive.fs.dir: hdfs://hadoop000:8020/completed-jobs-pk/

启动：./historyserver.sh start

History Server

Flink has a history server that can be used to query the statistics of completed jobs after the corresponding Flink cluster has been shut down.

Furthermore, it exposes a REST API that accepts HTTP requests and responds with JSON data.

Overview

The HistoryServer allows you to query the status and statistics of completed jobs that have been archived by a JobManager.

After you have configured the HistoryServer and JobManager, you start and stop the HistoryServer via its corresponding startup script:

# Start or stop the HistoryServer
bin/historyserver.sh (start|start-foreground|stop)

By default, this server binds to localhost and listens at port 8082.

Currently, you can only run it as a standalone process.

Configuration

The configuration keys jobmanager.archive.fs.dir and historyserver.archive.fs.refresh-interval need to be adjusted for archiving and displaying archived jobs.

JobManager

The archiving of completed jobs happens on the JobManager, which uploads the archived job information to a file system directory. You can configure the directory to archive completed jobs in flink-conf.yaml by setting a directory via jobmanager.archive.fs.dir.

# Directory to upload completed job information
jobmanager.archive.fs.dir: hdfs:///completed-jobs

HistoryServer

The HistoryServer can be configured to monitor a comma-separated list of directories in via historyserver.archive.fs.dir. The configured directories are regularly polled for new archives; the polling interval can be configured via historyserver.archive.fs.refresh-interval.

# Monitor the following directories for completed jobs
historyserver.archive.fs.dir: hdfs:///completed-jobs

# Refresh every 10 seconds
historyserver.archive.fs.refresh-interval: 10000

The contained archives are downloaded and cached in the local filesystem. The local directory for this is configured via historyserver.web.tmpdir.

Check out the configuration page for a complete list of configuration options.

Available Requests

Below is a list of available requests, with a sample JSON response. All requests are of the sample form http://hostname:8082/jobs, below we list only the path part of the URLs.

Values in angle brackets are variables, for example http://hostname:port/jobs/<jobid>/exceptions will have to requested for example as http://hostname:port/jobs/7684be6004e4e955c2a558a9bc463f65/exceptions.

/config
/jobs/overview
/jobs/<jobid>
/jobs/<jobid>/vertices
/jobs/<jobid>/config
/jobs/<jobid>/exceptions
/jobs/<jobid>/accumulators
/jobs/<jobid>/vertices/<vertexid>
/jobs/<jobid>/vertices/<vertexid>/subtasktimes
/jobs/<jobid>/vertices/<vertexid>/taskmanagers
/jobs/<jobid>/vertices/<vertexid>/accumulators
/jobs/<jobid>/vertices/<vertexid>/subtasks/accumulators
/jobs/<jobid>/vertices/<vertexid>/subtasks/<subtasknum>
/jobs/<jobid>/vertices/<vertexid>/subtasks/<subtasknum>/attempts/<attempt>
/jobs/<jobid>/vertices/<vertexid>/subtasks/<subtasknum>/attempts/<attempt>/accumulators
/jobs/<jobid>/plan

REST API

Flink has a monitoring API that can be used to query status and statistics of running jobs, as well as recent completed jobs. This monitoring API is used by Flink’s own dashboard, but is designed to be used also by custom monitoring tools.

The monitoring API is a REST-ful API that accepts HTTP requests and responds with JSON data.

Overview

The monitoring API is backed by a web server that runs as part of the Dispatcher. By default, this server listens at post 8081, which can be configured in flink-conf.yaml via rest.port. Note that the monitoring API web server and the web dashboard web server are currently the same and thus run together at the same port. They respond to different HTTP URLs, though.

In the case of multiple Dispatchers (for high availability), each Dispatcher will run its own instance of the monitoring API, which offers information about completed and running job while that Dispatcher was elected the cluster leader.

Developing

The REST API backend is in the flink-runtime project. The core class is org.apache.flink.runtime.webmonitor.WebMonitorEndpoint, which sets up the server and the request routing.

We use Netty and the Netty Router library to handle REST requests and translate URLs. This choice was made because this combination has lightweight dependencies, and the performance of Netty HTTP is very good.

To add new requests, one needs to

add a new MessageHeaders class which serves as an interface for the new request,
add a new AbstractRestHandler class which handles the request according to the added MessageHeaders class,
add the handler to org.apache.flink.runtime.webmonitor.WebMonitorEndpoint#initializeHandlers().

A good example is the org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler that uses the org.apache.flink.runtime.rest.messages.JobExceptionsHeaders.

API

The REST API is versioned, with specific versions being queryable by prefixing the url with the version prefix. Prefixes are always of the form v[version_number]. For example, to access version 1 of /foo/bar one would query /v1/foo/bar.

If no version is specified Flink will default to the oldest version supporting the request.

Querying unsupported/non-existing versions will return a 404 error.

There exist several async operations among these APIs, e.g. trigger savepoint, rescale a job. They would return a triggerid to identify the operation you just POST and then you need to use that triggerid to query for the status of the operation.

详情如下：V1

Metrics（指标）

Flink exposes a metric system that allows gathering and exposing metrics to external systems.

Flink公开了一个度量系统，该度量系统允许收集度量并将其暴露给外部系统。

详细文档如下：

文档

常用优化

Flink中常用的优化策略

资源
并行度
默认是1 适当的调整：好几种 ==> 项目实战
数据倾斜
100task 98-99跑完了 1-2很慢 ==> 能跑完、跑不完
- group by：二次聚合
  random_key + random
  key - random
- join on xxx=xxx
  repartition-repartition strategy 大大
  broadcast-forward strategy 大小

链接：Big Data Zone