Flink监控及调优
History Server
Hadoop MapReduce
Spark
Flink
start/stop-xxx.sh
看一下这些脚本的写法
shell对于bigdata有用吗? lower
配置:
historyserver.web.address: 0.0.0.0
historyserver.web.port: 8082
historyserver.archive.fs.refresh-interval: 10000
jobmanager.archive.fs.dir: hdfs://hadoop000:8020/completed-jobs-pk/
historyserver.archive.fs.dir: hdfs://hadoop000:8020/completed-jobs-pk/
启动:./historyserver.sh start
History Server
Flink has a history server that can be used to query the statistics of completed jobs after the corresponding Flink cluster has been shut down.
Furthermore, it exposes a REST API that accepts HTTP requests and responds with JSON data.
Overview
The HistoryServer allows you to query the status and statistics of completed jobs that have been archived by a JobManager.
After you have configured the HistoryServer and JobManager, you start and stop the HistoryServer via its corresponding startup script:
# Start or stop the HistoryServer
bin/historyserver.sh (start|start-foreground|stop)
By default, this server binds to localhost
and listens at port 8082
.
Currently, you can only run it as a standalone process.
Configuration
The configuration keys jobmanager.archive.fs.dir
and historyserver.archive.fs.refresh-interval
need to be adjusted for archiving and displaying archived jobs.
JobManager
The archiving of completed jobs happens on the JobManager, which uploads the archived job information to a file system directory. You can configure the directory to archive completed jobs in flink-conf.yaml
by setting a directory via jobmanager.archive.fs.dir
.
# Directory to upload completed job information
jobmanager.archive.fs.dir: hdfs:///completed-jobs
HistoryServer
The HistoryServer can be configured to monitor a comma-separated list of directories in via historyserver.archive.fs.dir
. The configured directories are regularly polled for new archives; the polling interval can be configured via historyserver.archive.fs.refresh-interval
.
# Monitor the following directories for completed jobs
historyserver.archive.fs.dir: hdfs:///completed-jobs
# Refresh every 10 seconds
historyserver.archive.fs.refresh-interval: 10000
The contained archives are downloaded and cached in the local filesystem. The local directory for this is configured via historyserver.web.tmpdir
.
Check out the configuration page for a complete list of configuration options.
Available Requests
Below is a list of available requests, with a sample JSON response. All requests are of the sample form http://hostname:8082/jobs
, below we list only the path part of the URLs.
Values in angle brackets are variables, for example http://hostname:port/jobs/<jobid>/exceptions
will have to requested for example as http://hostname:port/jobs/7684be6004e4e955c2a558a9bc463f65/exceptions
.
/config
/jobs/overview
/jobs/<jobid>
/jobs/<jobid>/vertices
/jobs/<jobid>/config
/jobs/<jobid>/exceptions
/jobs/<jobid>/accumulators
/jobs/<jobid>/vertices/<vertexid>
/jobs/<jobid>/vertices/<vertexid>/subtasktimes
/jobs/<jobid>/vertices/<vertexid>/taskmanagers
/jobs/<jobid>/vertices/<vertexid>/accumulators
/jobs/<jobid>/vertices/<vertexid>/subtasks/accumulators
/jobs/<jobid>/vertices/<vertexid>/subtasks/<subtasknum>
/jobs/<jobid>/vertices/<vertexid>/subtasks/<subtasknum>/attempts/<attempt>
/jobs/<jobid>/vertices/<vertexid>/subtasks/<subtasknum>/attempts/<attempt>/accumulators
/jobs/<jobid>/plan
REST API
Flink has a monitoring API that can be used to query status and statistics of running jobs, as well as recent completed jobs. This monitoring API is used by Flink’s own dashboard, but is designed to be used also by custom monitoring tools.
The monitoring API is a REST-ful API that accepts HTTP requests and responds with JSON data.
Overview
The monitoring API is backed by a web server that runs as part of the Dispatcher. By default, this server listens at post 8081
, which can be configured in flink-conf.yaml
via rest.port
. Note that the monitoring API web server and the web dashboard web server are currently the same and thus run together at the same port. They respond to different HTTP URLs, though.
In the case of multiple Dispatchers (for high availability), each Dispatcher will run its own instance of the monitoring API, which offers information about completed and running job while that Dispatcher was elected the cluster leader.
Developing
The REST API backend is in the flink-runtime
project. The core class is org.apache.flink.runtime.webmonitor.WebMonitorEndpoint
, which sets up the server and the request routing.
We use Netty and the Netty Router library to handle REST requests and translate URLs. This choice was made because this combination has lightweight dependencies, and the performance of Netty HTTP is very good.
To add new requests, one needs to
- add a new
MessageHeaders
class which serves as an interface for the new request, - add a new
AbstractRestHandler
class which handles the request according to the addedMessageHeaders
class, - add the handler to
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint#initializeHandlers()
.
A good example is the org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler
that uses the org.apache.flink.runtime.rest.messages.JobExceptionsHeaders
.
API
The REST API is versioned, with specific versions being queryable by prefixing the url with the version prefix. Prefixes are always of the form v[version_number]
. For example, to access version 1 of /foo/bar
one would query /v1/foo/bar
.
If no version is specified Flink will default to the oldest version supporting the request.
Querying unsupported/non-existing versions will return a 404 error.
There exist several async operations among these APIs, e.g. trigger savepoint
, rescale a job
. They would return a triggerid
to identify the operation you just POST and then you need to use that triggerid
to query for the status of the operation.
详情如下:V1
Metrics(指标)
Flink exposes a metric system that allows gathering and exposing metrics to external systems.
Flink公开了一个度量系统,该度量系统允许收集度量并将其暴露给外部系统。
详细文档如下:
常用优化
Flink中常用的优化策略
- 资源
- 并行度
默认是1 适当的调整:好几种 ==> 项目实战 - 数据倾斜
100task 98-99跑完了 1-2很慢 ==> 能跑完 、 跑不完- group by: 二次聚合
random_key + random
key - random - join on xxx=xxx
repartition-repartition strategy 大大
broadcast-forward strategy 大小
- group by: 二次聚合