你可以通过 kubectl
命令来动态获取容器的名称,结合环境变量中的 namespace
和 pod
,从而确保监控数据能够正确关联到特定的容器。以下是具体的步骤和代码说明。
实现步骤
创建账号,并赋予权限
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: pod-exec-role
rules:
- apiGroups: [""]
resources: ["pods","pods/exec"]
verbs: ["create", "get", "list", "watch"]
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: pod-exec-role-binding
subjects:
- kind: ServiceAccount
name: custom-exporter
namespace: prom
roleRef:
kind: ClusterRole
name: pod-exec-role
apiGroup: rbac.authorization.k8s.io
apiVersion: v1
kind: ServiceAccount
metadata:
name: custom-exporter
namespace: prom
kubectl auth can-i create pods/exec --as=system:serviceaccount:prom:custom-exporter
测试权限的命令
kubectl exec -it t-test-main-7858665cb9-sqnv5 --container t-test-main-0 --as=system:serviceaccount:prom:custom-exporter -- /bin/bash
kubectl auth can-i create pods/exec --as=system:serviceaccount:prom:custom-exporter
kubectl auth can-i get pods --as=system:serviceaccount:prom:custom-exporter --namespace default
1. **环境变量指定 namespace
和 **pod
在 Kubernetes 中,可以通过 Downward API 将 namespace
和 pod
名称注入到容器的环境变量中:
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-exporter
namespace: prom
spec:
replicas: 1
selector:
matchLabels:
app: custom-exporter
template:
metadata:
labels:
app: custom-exporter
annotations:
prometheus.io/scrape: "true"
spec:
serviceAccountName: custom-exporter
containers:
- name: exporter
image: registry-vpc.cn-shanghai.aliyuncs.com/test/custom-exporter:v0.17
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: custom-exporter
namespace: prom
spec:
ports:
- name: http
port: 8000
targetPort: 8000
selector:
app: custom-exporter
type: ClusterIP
2. 动态获取容器的名称
可以通过 kubectl
动态获取容器名称。在 Pod 内运行时,使用 kubectl get pods
来获取 Pod 内的所有容器。
例如,运行以下命令来列出指定 Pod 的所有容器:
kubectl get pod t-test-main-74b676dfbb-7zj9l -n default -o jsonpath='{.spec.containers[*].name}'
3. 代码实现
v1 需要容器环境中有 kubectl 命令
在自定义 Exporter 中,通过 Python 动态获取容器名称并与其他信息一起作为标签进行监控。
import subprocess
from prometheus_client import start_http_server, Gauge
import time
import json
# Prometheus metrics
passwd_mod_time = Gauge('container_passwd_modification_time', 'Last modification time of /etc/passwd', ['namespace', 'pod', 'container'])
max_open_files = Gauge('container_max_open_files', 'Maximum number of open files allowed', ['namespace', 'pod', 'container'])
max_processes = Gauge('container_max_processes', 'Maximum number of processes allowed', ['namespace', 'pod', 'container'])
total_processes_metric = Gauge('total_processes', 'Total number of processes', ['namespace', 'pod', 'container'])
running_processes_metric = Gauge('running_processes', 'Number of running processes', ['namespace', 'pod', 'container'])
tcp_total_connections = Gauge('tcp_total_connections', 'Total number of TCP connections', ['namespace', 'pod', 'container'])
tcp_established = Gauge('tcp_established_connections', 'Number of TCP connections in ESTABLISHED state', ['namespace', 'pod', 'container'])
tcp_fin_wait_1 = Gauge('tcp_fin_wait_1_connections', 'Number of TCP connections in FIN-WAIT-1 state', ['namespace', 'pod', 'container'])
tcp_fin_wait_2 = Gauge('tcp_fin_wait_2_connections', 'Number of TCP connections in FIN-WAIT-2 state', ['namespace', 'pod', 'container'])
tcp_last_ack = Gauge('tcp_last_ack_connections', 'Number of TCP connections in LAST-ACK state', ['namespace', 'pod', 'container'])
tcp_listen = Gauge('tcp_listen_connections', 'Number of TCP connections in LISTEN state', ['namespace', 'pod', 'container'])
tcp_syn_recv = Gauge('tcp_syn_recv_connections', 'Number of TCP connections in SYN-RECV state', ['namespace', 'pod', 'container'])
tcp_time_wait = Gauge('tcp_time_wait_connections', 'Number of TCP connections in TIME-WAIT state', ['namespace', 'pod', 'container'])
udp_total_connections = Gauge('udp_total_connections', 'Total number of UDP connections', ['namespace', 'pod', 'container'])
time_diff = Gauge('container_prometheus_time_diff_seconds', 'Time difference between container and Prometheus in seconds', ['namespace', 'pod', 'container'])
# Function to run a shell command and return the output
def run_command(command):
result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
return result.stdout.decode().strip()
# Get all deployments in the cluster
def get_all_deployments():
cmd = "kubectl get deployments --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace} {.metadata.name}{\"\\n\"}{end}'"
deployments = run_command(cmd)
return [line.split() for line in deployments.splitlines()]
# Get all pods in a specific deployment
def get_pods(namespace, deployment):
cmd = f"kubectl get pods -n {namespace} -l app={deployment} -o jsonpath='{{.items[*].metadata.name}}'"
pods = run_command(cmd)
return pods.split()
# Get all containers in a specific pod
def get_containers(namespace, pod):
cmd = f"kubectl get pod {pod} -n {namespace} -o jsonpath='{{.spec.containers[*].name}}'"
containers = run_command(cmd)
return containers.split()
# Convert 'unlimited' to a suitable numeric value
def convert_to_int(value):
if value == 'unlimited':
return float('inf') # 或者选择一个大数字,比如: return 9999999
return int(value)
# Get TCP connections count in a specific state
def get_tcp_connections_count(state, namespace, pod, container):
cmd = f"kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state {state} | wc -l"
return run_command(cmd)
# Get container's current timestamp
def get_container_time(namespace, pod, container):
cmd = f"kubectl exec -n {namespace} {pod} -c {container} -- date +%s"
container_time = run_command(cmd)
return int(container_time) if container_time.isdigit() else None
# Collect metrics from each container
def collect_metrics():
deployments = get_all_deployments()
for namespace, deployment in deployments:
pods = get_pods(namespace, deployment)
for pod in pods:
containers = get_containers(namespace, pod)
for container in containers:
# Get modification time of /etc/passwd
mod_time = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- stat -c %Y /etc/passwd")
if mod_time:
passwd_mod_time.labels(namespace=namespace, pod=pod, container=container).set(float(mod_time))
# Get max open files
open_files = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- bash -c 'ulimit -n'")
if open_files:
max_open_files.labels(namespace=namespace, pod=pod, container=container).set(int(open_files))
# Get max processes and handle 'unlimited'
processes = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- bash -c 'ulimit -u'")
if processes:
max_processes.labels(namespace=namespace, pod=pod, container=container).set(int(processes) if processes != 'unlimited' else float('inf'))
# Get total number of processes (all processes)
total_processes = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- ps -e | wc -l")
if total_processes:
total_processes_metric.labels(namespace=namespace, pod=pod, container=container).set(int(total_processes))
# Get number of running processes
running_processes = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ps -e --no-headers -o stat | grep "R" | wc -l')
if running_processes:
running_processes_metric.labels(namespace=namespace, pod=pod, container=container).set(int(running_processes))
# Get total number of TCP connections
tcp_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant | wc -l')
if tcp_connections:
tcp_total_connections.labels(namespace=namespace, pod=pod, container=container).set(int(tcp_connections) - 1 if int(tcp_connections) > 0 else int(tcp_connections))
# Get TCP connections in ESTABLISHED state
established_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state ESTABLISHED | wc -l')
if established_connections:
tcp_established.labels(namespace=namespace, pod=pod, container=container).set(int(established_connections))
# Get TCP connections in FIN-WAIT-1 state
fin_wait_1_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state FIN-WAIT-1 | wc -l')
if fin_wait_1_connections:
tcp_fin_wait_1.labels(namespace=namespace, pod=pod, container=container).set(int(fin_wait_1_connections))
# Get TCP connections in FIN-WAIT-2 state
fin_wait_2_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state FIN-WAIT-2 | wc -l')
if fin_wait_2_connections:
tcp_fin_wait_2.labels(namespace=namespace, pod=pod, container=container).set(int(fin_wait_2_connections))
# Get TCP connections in LAST-ACK state
last_ack_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state LAST-ACK | wc -l')
if last_ack_connections:
tcp_last_ack.labels(namespace=namespace, pod=pod, container=container).set(int(last_ack_connections))
# Get TCP connections in LISTEN state
listen_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state LISTEN | wc -l')
if listen_connections:
tcp_listen.labels(namespace=namespace, pod=pod, container=container).set(int(listen_connections))
# Get TCP connections in SYN-RECV state
syn_recv_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state SYN-RECV | wc -l')
if syn_recv_connections:
tcp_syn_recv.labels(namespace=namespace, pod=pod, container=container).set(int(syn_recv_connections))
# Get TCP connections in TIME-WAIT state
time_wait_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state TIME-WAIT | wc -l')
if time_wait_connections:
tcp_time_wait.labels(namespace=namespace, pod=pod, container=container).set(int(time_wait_connections))
# Get total number of UDP connections
udp_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -anu | wc -l')
if udp_connections:
udp_total_connections.labels(namespace=namespace, pod=pod, container=container).set(int(udp_connections) - 1 if int(udp_connections) > 0 else int(udp_connections))
# Gets the time difference between the container and Prometheus
container_time = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- date +%s')
if container_time:
prometheus_response = run_command('curl -s "http://101.132.104.118:9090/api/v1/query?query=time()"')
prometheus_time = float(json.loads(prometheus_response)["data"]["result"][1])
time_diff_second = abs(int(container_time) - prometheus_time)
time_diff.labels(namespace=namespace, pod=pod, container=container).set(float(time_diff_second))
if __name__ == '__main__':
start_http_server(8000) # Start Prometheus metrics server
while True:
collect_metrics()
time.sleep(60) # Collect metrics every second
v2 利用 python 库( 自动加载服务账户凭证和集群 API 地址 )
from kubernetes import client, config
from prometheus_client import start_http_server, Gauge
from kubernetes.stream import stream
import time
# 初始化 Kubernetes 客户端
config.load_incluster_config()
# Prometheus metrics
passwd_mod_time = Gauge('container_passwd_modification_time', 'Last modification time of /etc/passwd', ['namespace', 'pod', 'container'])
max_open_files = Gauge('container_max_open_files', 'Maximum number of open files allowed', ['namespace', 'pod', 'container'])
max_processes = Gauge('container_max_processes', 'Maximum number of processes allowed', ['namespace', 'pod', 'container'])
def get_all_deployments():
v1 = client.AppsV1Api()
deployments = v1.list_deployment_for_all_namespaces()
return [(item.metadata.namespace, item.metadata.name) for item in deployments.items]
def get_pods(namespace, deployment):
v1 = client.CoreV1Api()
label_selector = f'app={deployment}'
pods = v1.list_namespaced_pod(namespace, label_selector=label_selector)
return [pod.metadata.name for pod in pods.items]
def get_containers(namespace, pod):
v1 = client.CoreV1Api()
pod_info = v1.read_namespaced_pod(pod, namespace)
return [container.name for container in pod_info.spec.containers]
def exec_command_in_container(namespace, pod, container, command):
try:
v1 = client.CoreV1Api()
exec_command = ['/bin/sh', '-c', command]
resp = stream(v1.connect_get_namespaced_pod_exec, pod, namespace,
container=container, command=exec_command, stderr=True,
stdin=False, stdout=True, tty=False)
return resp.strip()
except Exception as e:
print(f"Error executing command in {namespace}/{pod}/{container}: {e}")
return None
def collect_metrics():
deployments = get_all_deployments()
for namespace, deployment in deployments:
pods = get_pods(namespace, deployment)
for pod in pods:
containers = get_containers(namespace, pod)
for container in containers:
mod_time = exec_command_in_container(namespace, pod, container, 'stat -c %Y /etc/passwd')
if mod_time:
passwd_mod_time.labels(namespace=namespace, pod=pod, container=container).set(float(mod_time))
open_files = exec_command_in_container(namespace, pod, container, 'ulimit -n')
if open_files:
max_open_files.labels(namespace=namespace, pod=pod, container=container).set(int(open_files))
processes = exec_command_in_container(namespace, pod, container, 'ulimit -u')
if processes:
max_processes.labels(namespace=namespace, pod=pod, container=container).set(int(processes))
if __name__ == '__main__':
start_http_server(8000)
while True:
collect_metrics()
time.sleep(60)
DockerFile 文件
# 使用官方 Python 运行时作为基础镜像
FROM python:3.9-slim
# 安装 kubectl 和其他依赖
RUN apt-get update && apt-get install -y curl \
bash \
procps \
ca-certificates \
&& rm -rf /var/lib/apt/lists/* \
&& mkdir -p /root/.kube \
&& curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
&& chmod +x kubectl \
&& mv kubectl /usr/local/bin/kubectl
# 在容器中设置工作目录
WORKDIR /app
# 将本地 ~/.kube 目录中的内容复制到容器中的 /root/.kube
COPY .kube /root/.kube
# 将当前目录的内容复制到容器中
COPY . /app
# 安装需要的 Python 包
RUN pip install prometheus-client kubernetes
# 暴露 Prometheus 的端口
EXPOSE 8000
# 启动容器时执行的命令
CMD ["python", "custom-exporter.py"]
打镜像
version=v0.17 && docker build -t custom-exporter:${version} . && docker tag custom-exporter:${version} registry.cn-shanghai.aliyuncs.com/test/custom-exporter:${version} && docker push registry.cn-shanghai.aliyuncs.com/test/custom-exporter:${version}