问题背景
目标:实现对 docker 容器的性能监控
背景:已部署 GTI(Grafana+Telegraf+Influxdb)
问题:如何让 telegraf 能够采集到 docker 容器的性能指标?
解决过程
已知 telegraf 自带多种插件可用于对多种类型对象的指标采集,查看 /etc/telegraf/telegraf.conf 配置文件能发现配置项 [[inputs.docker]],去掉该项注释并合理进行配置即可。较完整的配置项内容如下:
[[inputs.docker]]
# ## Docker Endpoint
# ## To use TCP, set endpoint = "tcp://[ip]:[port]"
# ## To use environment variables (ie, docker-machine), set endpoint = "ENV"
# endpoint = "unix:///var/run/docker.sock"
#
# ## Set to true to collect Swarm metrics(desired_replicas, running_replicas)
# gather_services = false
#
# ## Only collect metrics for these containers, collect all if empty
# container_names = []
#
# ## Set the source tag for the metrics to the container ID hostname, eg first 12 chars
# source_tag = false
#
# ## Containers to include and exclude. Globs accepted.
# ## Note that an empty array for both will include all containers
# container_name_include = []
# container_name_exclude = []
#
# ## Container states to include and exclude. Globs accepted.
# ## When empty only containers in the "running" state will be captured.
# ## example: container_state_include = ["created", "restarting", "running", "removing", "paused", "exited", "dead"]
# ## example: container_state_exclude = ["created", "restarting", "running", "removing", "paused", "exited", "dead"]
# # container_state_include = []
# # container_state_exclude = []
#
# ## Timeout for docker list, info, and stats commands
# timeout = "5s"
#
# ## Whether to report for each container per-device blkio (8:0, 8:1...),
# ## network (eth0, eth1, ...) and cpu (cpu0, cpu1, ...) stats or not.
# ## Usage of this setting is discouraged since it will be deprecated in favor of 'perdevice_include'.
# ## Default value is 'true' for backwards compatibility, please set it to 'false' so that 'perdevice_include' setting
# ## is honored.
# perdevice = false
#
# ## Specifies for which classes a per-device metric should be issued
# ## Possible values are 'cpu' (cpu0, cpu1, ...), 'blkio' (8:0, 8:1, ...) and 'network' (eth0, eth1, ...)
# ## Please note that this setting has no effect if 'perdevice' is set to 'true'
# perdevice_include = ["cpu"]
#
# ## Whether to report for each container total blkio and network stats or not.
# ## Usage of this setting is discouraged since it will be deprecated in favor of 'total_include'.
# ## Default value is 'false' for backwards compatibility, please set it to 'true' so that 'total_include' setting
# ## is honored.
# total = true
#
# ## Specifies for which classes a total metric should be issued. Total is an aggregated of the 'perdevice' values.
# ## Possible values are 'cpu', 'blkio' and 'network'
# ## Total 'cpu' is reported directly by Docker daemon, and 'network' and 'blkio' totals are aggregated by this plugin.
# ## Please note that this setting has no effect if 'total' is set to 'false'
# total_include = ["cpu", "blkio", "network"]
#
# ## Which environment variables should we use as a tag
# ##tag_env = ["JAVA_HOME", "HEAP_SIZE"]
#
# ## docker labels to include and exclude as tags. Globs accepted.
# ## Note that an empty array for both will include all labels as tags
# docker_label_include = []
# docker_label_exclude = []
#
# ## Optional TLS Config
# # tls_ca = "/etc/telegraf/ca.pem"
# # tls_cert = "/etc/telegraf/cert.pem"
# # tls_key = "/etc/telegraf/key.pem"
# ## Use TLS but skip chain & host verification
# # insecure_skip_verify = false
简单修改如下:
[[inputs.docker]]
#指定docker启动的api接口,并指定需要采集那些容器指标
endpoint = "unix:///var/run/docker.sock"
container_names = []
保存修改后重启 telegraf,查看服务状态发现启动失败,出现以下报错:
[inputs.docker] Error in plugin: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.so…rmission denied
似乎是无权限访问 docker.sock。检索后大概了解到,docker daemon socket 是 docker 的远程访问 API 之一,分别是 unix 套接字文件、tcp 监听端口和 fd 文件描述符。那么就尝试打开 docker daemon 的 TCP 监听端口吧。
Step1:新建 /etc/docker/daemon.json 文件,添加内容:
{
"hosts": ["tcp://0.0.0.0:2375", "unix:///var/run/docker.sock"]
}
“unix:///var/run/docker.sock”:unix socket,本地客户端将通过这个来连接 Docker Daemon。
“tcp://0.0.0.0:2375”:tcp socket,表示允许任何远程客户端通过 2375 端口连接 Docker Daemon。
修改配置以后让 docker 重新读取配置文件,并重启 docker 服务
systemctl daemon-reload
systemctl restart docker
然后出现了如下报错
Job for docker.service failed because the control process exited with error code. See “systemctl status docker.service” and “journalctl -xe” for details.
是因为 docker 的 socket 配置出现了冲突,在 docker 的启动入口文件中配置了 host 相关的信息, 而在 docker 的配置文件中也配置了 host 的信息,所以发生了冲突。解决办法:建议将 docker 启动入口文件(/usr/lib/systemd/system/docker.service)中的“ -H fd:// ”删除,再重启 docker 服务即可。
[root()@hzoffice40-152-163 system]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-03-18 17:26:37 CST; 8s ago
Docs: https://docs.docker.com
Main PID: 287684 (dockerd)
Tasks: 9
Memory: 28.4M
CGroup: /system.slice/docker.service
└─287684 /usr/bin/dockerd --containerd=/run/containerd/containerd.sock
Mar 18 17:26:36 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:36.796311180+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Mar 18 17:26:36 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:36.807023536+08:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
Mar 18 17:26:36 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:36.836291988+08:00" level=info msg="Loading containers: start."
Mar 18 17:26:36 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:36.992154660+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be use...red IP address"
Mar 18 17:26:37 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:37.045762583+08:00" level=info msg="Loading containers: done."
Mar 18 17:26:37 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:37.070431960+08:00" level=info msg="Docker daemon" commit=e2f740d graphdriver(s)=overlay2 version=20.10.10
Mar 18 17:26:37 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:37.070528736+08:00" level=info msg="Daemon has completed initialization"
Mar 18 17:26:37 hzoffice40-152-163.iflyos.org systemd[1]: Started Docker Application Container Engine.
Mar 18 17:26:37 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:37.108059171+08:00" level=info msg="API listen on [::]:2375"
Mar 18 17:26:37 hzoffice40-152-163.iflyos.org dockerd[287684]: time="2022-03-18T17:26:37.113034897+08:00" level=info msg="API listen on /var/run/docker.sock"
Hint: Some lines were ellipsized, use -l to show in full.
Docker 守护进程打开了一个 HTTP Socket,这样才能实现远程通信。
再次编辑 telegraf.conf,修改 [[inputs.docker]] 的 endpoint
endpoint = "tcp://xxx.xxx.xxx.xxx:2375"
重启 telegraf 服务。至此,Telegraf 顺利采集到容器的性能指标。

本文介绍了如何在已部署Grafana+Telegraf+Influxdb的环境中,通过配置Telegraf的[[inputs.docker]]插件,解决权限问题,实现对Docker容器性能指标的采集。在解决过程中,涉及到修改docker配置文件以开启TCP监听端口,解决socket冲突,并最终成功采集数据。
1082

被折叠的 条评论
为什么被折叠?



