python 编写自定义 exporter 暴露监控指标

狗贤

于 2024-10-21 21:56:25 发布

阅读量393

点赞数 5

分类专栏： Prometheus 文章标签： python prometheus

本文链接：https://blog.youkuaiyun.com/qq_41917355/article/details/143134405

版权

Prometheus 专栏收录该内容

4 篇文章

订阅专栏

你可以通过 kubectl 命令来动态获取容器的名称，结合环境变量中的 namespace 和 pod，从而确保监控数据能够正确关联到特定的容器。以下是具体的步骤和代码说明。

实现步骤

创建账号，并赋予权限

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-exec-role
rules:
- apiGroups: [""]
  resources: ["pods","pods/exec"]
  verbs: ["create", "get", "list", "watch"]

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pod-exec-role-binding
subjects:
- kind: ServiceAccount
  name: custom-exporter
  namespace: prom
roleRef:
  kind: ClusterRole
  name: pod-exec-role
  apiGroup: rbac.authorization.k8s.io

apiVersion: v1
kind: ServiceAccount
metadata:
  name: custom-exporter
  namespace: prom

kubectl auth can-i create pods/exec --as=system:serviceaccount:prom:custom-exporter

测试权限的命令

kubectl exec -it t-test-main-7858665cb9-sqnv5 --container t-test-main-0 --as=system:serviceaccount:prom:custom-exporter -- /bin/bash
kubectl auth can-i create pods/exec --as=system:serviceaccount:prom:custom-exporter
kubectl auth can-i get pods --as=system:serviceaccount:prom:custom-exporter --namespace default

1. 环境变量指定 `namespace` 和 `pod`

在 Kubernetes 中，可以通过 Downward API 将 namespace 和 pod 名称注入到容器的环境变量中：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-exporter
  namespace: prom
spec:
  replicas: 1
  selector:
    matchLabels:
      app: custom-exporter
  template:
    metadata:
      labels:
        app: custom-exporter
      annotations:
        prometheus.io/scrape: "true"
    spec:
      serviceAccountName: custom-exporter
      containers:
      - name: exporter
        image: registry-vpc.cn-shanghai.aliyuncs.com/test/custom-exporter:v0.17
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: custom-exporter
  namespace: prom
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  selector:
    app: custom-exporter
  type: ClusterIP

2. 动态获取容器的名称

可以通过 kubectl 动态获取容器名称。在 Pod 内运行时，使用 kubectl get pods 来获取 Pod 内的所有容器。

例如，运行以下命令来列出指定 Pod 的所有容器：

kubectl get pod t-test-main-74b676dfbb-7zj9l -n default -o jsonpath='{.spec.containers[*].name}'

3. 代码实现

v1 需要容器环境中有 kubectl 命令

在自定义 Exporter 中，通过 Python 动态获取容器名称并与其他信息一起作为标签进行监控。

import subprocess
from prometheus_client import start_http_server, Gauge
import time
import json

# Prometheus metrics
passwd_mod_time = Gauge('container_passwd_modification_time', 'Last modification time of /etc/passwd', ['namespace', 'pod', 'container'])
max_open_files = Gauge('container_max_open_files', 'Maximum number of open files allowed', ['namespace', 'pod', 'container'])
max_processes = Gauge('container_max_processes', 'Maximum number of processes allowed', ['namespace', 'pod', 'container'])
total_processes_metric = Gauge('total_processes', 'Total number of processes', ['namespace', 'pod', 'container'])
running_processes_metric = Gauge('running_processes', 'Number of running processes', ['namespace', 'pod', 'container'])
tcp_total_connections = Gauge('tcp_total_connections', 'Total number of TCP connections', ['namespace', 'pod', 'container'])
tcp_established = Gauge('tcp_established_connections', 'Number of TCP connections in ESTABLISHED state', ['namespace', 'pod', 'container'])
tcp_fin_wait_1 = Gauge('tcp_fin_wait_1_connections', 'Number of TCP connections in FIN-WAIT-1 state', ['namespace', 'pod', 'container'])
tcp_fin_wait_2 = Gauge('tcp_fin_wait_2_connections', 'Number of TCP connections in FIN-WAIT-2 state', ['namespace', 'pod', 'container'])
tcp_last_ack = Gauge('tcp_last_ack_connections', 'Number of TCP connections in LAST-ACK state', ['namespace', 'pod', 'container'])
tcp_listen = Gauge('tcp_listen_connections', 'Number of TCP connections in LISTEN state', ['namespace', 'pod', 'container'])
tcp_syn_recv = Gauge('tcp_syn_recv_connections', 'Number of TCP connections in SYN-RECV state', ['namespace', 'pod', 'container'])
tcp_time_wait = Gauge('tcp_time_wait_connections', 'Number of TCP connections in TIME-WAIT state', ['namespace', 'pod', 'container'])
udp_total_connections = Gauge('udp_total_connections', 'Total number of UDP connections', ['namespace', 'pod', 'container'])
time_diff = Gauge('container_prometheus_time_diff_seconds', 'Time difference between container and Prometheus in seconds', ['namespace', 'pod', 'container'])

# Function to run a shell command and return the output
def run_command(command):
    result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    return result.stdout.decode().strip()

# Get all deployments in the cluster
def get_all_deployments():
    cmd = "kubectl get deployments --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace} {.metadata.name}{\"\\n\"}{end}'"
    deployments = run_command(cmd)
    return [line.split() for line in deployments.splitlines()]

# Get all pods in a specific deployment
def get_pods(namespace, deployment):
    cmd = f"kubectl get pods -n {namespace} -l app={deployment} -o jsonpath='{{.items[*].metadata.name}}'"
    pods = run_command(cmd)
    return pods.split()

# Get all containers in a specific pod
def get_containers(namespace, pod):
    cmd = f"kubectl get pod {pod} -n {namespace} -o jsonpath='{{.spec.containers[*].name}}'"
    containers = run_command(cmd)
    return containers.split()

# Convert 'unlimited' to a suitable numeric value
def convert_to_int(value):
    if value == 'unlimited':
        return float('inf')  # 或者选择一个大数字，比如: return 9999999
    return int(value)

# Get TCP connections count in a specific state
def get_tcp_connections_count(state, namespace, pod, container):
    cmd = f"kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state {state} | wc -l"
    return run_command(cmd)

# Get container's current timestamp
def get_container_time(namespace, pod, container):
    cmd = f"kubectl exec -n {namespace} {pod} -c {container} -- date +%s"
    container_time = run_command(cmd)
    return int(container_time) if container_time.isdigit() else None




# Collect metrics from each container
def collect_metrics():
    deployments = get_all_deployments()

    for namespace, deployment in deployments:
        pods = get_pods(namespace, deployment)
        for pod in pods:
            containers = get_containers(namespace, pod)
            for container in containers:

                # Get modification time of /etc/passwd
                mod_time = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- stat -c %Y /etc/passwd")
                if mod_time:
                    passwd_mod_time.labels(namespace=namespace, pod=pod, container=container).set(float(mod_time))

                # Get max open files
                open_files = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- bash -c 'ulimit -n'")
                if open_files:
                    max_open_files.labels(namespace=namespace, pod=pod, container=container).set(int(open_files))

                # Get max processes and handle 'unlimited'
                processes = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- bash -c 'ulimit -u'")
                if processes:
                    max_processes.labels(namespace=namespace, pod=pod, container=container).set(int(processes) if processes != 'unlimited' else float('inf'))

                # Get total number of processes (all processes)
                total_processes = run_command(f"kubectl exec -n {namespace} {pod} -c {container} -- ps -e | wc -l")
                if total_processes:
                    total_processes_metric.labels(namespace=namespace, pod=pod, container=container).set(int(total_processes))

                # Get number of running processes
                running_processes = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ps -e --no-headers -o stat | grep "R" | wc -l')
                if running_processes:
                    running_processes_metric.labels(namespace=namespace, pod=pod, container=container).set(int(running_processes))

                # Get total number of TCP connections
                tcp_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant | wc -l')
                if tcp_connections:
                    tcp_total_connections.labels(namespace=namespace, pod=pod, container=container).set(int(tcp_connections) - 1 if int(tcp_connections) > 0 else int(tcp_connections))

                # Get TCP connections in ESTABLISHED state
                established_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state ESTABLISHED | wc -l')
                if established_connections:
                    tcp_established.labels(namespace=namespace, pod=pod, container=container).set(int(established_connections))

                # Get TCP connections in FIN-WAIT-1 state
                fin_wait_1_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state FIN-WAIT-1 | wc -l')
                if fin_wait_1_connections:
                    tcp_fin_wait_1.labels(namespace=namespace, pod=pod, container=container).set(int(fin_wait_1_connections))

                # Get TCP connections in FIN-WAIT-2 state
                fin_wait_2_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state FIN-WAIT-2 | wc -l')
                if fin_wait_2_connections:
                    tcp_fin_wait_2.labels(namespace=namespace, pod=pod, container=container).set(int(fin_wait_2_connections))

                # Get TCP connections in LAST-ACK state
                last_ack_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state LAST-ACK | wc -l')
                if last_ack_connections:
                    tcp_last_ack.labels(namespace=namespace, pod=pod, container=container).set(int(last_ack_connections))

                # Get TCP connections in LISTEN state
                listen_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state LISTEN | wc -l')
                if listen_connections:
                    tcp_listen.labels(namespace=namespace, pod=pod, container=container).set(int(listen_connections))

                # Get TCP connections in SYN-RECV state
                syn_recv_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state SYN-RECV | wc -l')
                if syn_recv_connections:
                    tcp_syn_recv.labels(namespace=namespace, pod=pod, container=container).set(int(syn_recv_connections))

                # Get TCP connections in TIME-WAIT state
                time_wait_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -ant state TIME-WAIT | wc -l')
                if time_wait_connections:
                    tcp_time_wait.labels(namespace=namespace, pod=pod, container=container).set(int(time_wait_connections))

                # Get total number of UDP connections
                udp_connections = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- ss -anu | wc -l')
                if udp_connections:
                    udp_total_connections.labels(namespace=namespace, pod=pod, container=container).set(int(udp_connections) - 1 if int(udp_connections) > 0 else int(udp_connections))


                # Gets the time difference between the container and Prometheus
                container_time = run_command(f'kubectl exec -n {namespace} {pod} -c {container} -- date +%s')
                if container_time:
                    prometheus_response = run_command('curl -s "http://101.132.104.118:9090/api/v1/query?query=time()"')
                    prometheus_time = float(json.loads(prometheus_response)["data"]["result"][1])
                    time_diff_second = abs(int(container_time) - prometheus_time)
                    time_diff.labels(namespace=namespace, pod=pod, container=container).set(float(time_diff_second))


if __name__ == '__main__':
    start_http_server(8000)  # Start Prometheus metrics server
    while True:
        collect_metrics()
        time.sleep(60)  # Collect metrics every second

v2 利用 python 库（自动加载服务账户凭证和集群 API 地址）

from kubernetes import client, config
from prometheus_client import start_http_server, Gauge
from kubernetes.stream import stream
import time

# 初始化 Kubernetes 客户端
config.load_incluster_config()

# Prometheus metrics
passwd_mod_time = Gauge('container_passwd_modification_time', 'Last modification time of /etc/passwd', ['namespace', 'pod', 'container'])
max_open_files = Gauge('container_max_open_files', 'Maximum number of open files allowed', ['namespace', 'pod', 'container'])
max_processes = Gauge('container_max_processes', 'Maximum number of processes allowed', ['namespace', 'pod', 'container'])

def get_all_deployments():
    v1 = client.AppsV1Api()
    deployments = v1.list_deployment_for_all_namespaces()
    return [(item.metadata.namespace, item.metadata.name) for item in deployments.items]

def get_pods(namespace, deployment):
    v1 = client.CoreV1Api()
    label_selector = f'app={deployment}'
    pods = v1.list_namespaced_pod(namespace, label_selector=label_selector)
    return [pod.metadata.name for pod in pods.items]

def get_containers(namespace, pod):
    v1 = client.CoreV1Api()
    pod_info = v1.read_namespaced_pod(pod, namespace)
    return [container.name for container in pod_info.spec.containers]

def exec_command_in_container(namespace, pod, container, command):
    try:
        v1 = client.CoreV1Api()
        exec_command = ['/bin/sh', '-c', command]
        resp = stream(v1.connect_get_namespaced_pod_exec, pod, namespace,
                      container=container, command=exec_command, stderr=True,
                      stdin=False, stdout=True, tty=False)
        return resp.strip()
    except Exception as e:
        print(f"Error executing command in {namespace}/{pod}/{container}: {e}")
        return None

def collect_metrics():
    deployments = get_all_deployments()
    for namespace, deployment in deployments:
        pods = get_pods(namespace, deployment)
        for pod in pods:
            containers = get_containers(namespace, pod)
            for container in containers:
                mod_time = exec_command_in_container(namespace, pod, container, 'stat -c %Y /etc/passwd')
                if mod_time:
                    passwd_mod_time.labels(namespace=namespace, pod=pod, container=container).set(float(mod_time))

                open_files = exec_command_in_container(namespace, pod, container, 'ulimit -n')
                if open_files:
                    max_open_files.labels(namespace=namespace, pod=pod, container=container).set(int(open_files))

                processes = exec_command_in_container(namespace, pod, container, 'ulimit -u')
                if processes:
                    max_processes.labels(namespace=namespace, pod=pod, container=container).set(int(processes))

if __name__ == '__main__':
    start_http_server(8000)
    while True:
        collect_metrics()
        time.sleep(60)

DockerFile 文件

# 使用官方 Python 运行时作为基础镜像
FROM python:3.9-slim


# 安装 kubectl 和其他依赖
RUN apt-get update && apt-get install -y curl \
    bash \
    procps \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/* \
    && mkdir -p /root/.kube \
    && curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
    && chmod +x kubectl \
    && mv kubectl /usr/local/bin/kubectl


# 在容器中设置工作目录
WORKDIR /app

# 将本地 ~/.kube 目录中的内容复制到容器中的 /root/.kube
COPY .kube /root/.kube

# 将当前目录的内容复制到容器中
COPY . /app

# 安装需要的 Python 包
RUN pip install prometheus-client kubernetes

# 暴露 Prometheus 的端口
EXPOSE 8000

# 启动容器时执行的命令
CMD ["python", "custom-exporter.py"]

打镜像

version=v0.17 && docker build -t custom-exporter:${version} . && docker tag custom-exporter:${version} registry.cn-shanghai.aliyuncs.com/test/custom-exporter:${version} && docker push registry.cn-shanghai.aliyuncs.com/test/custom-exporter:${version}