14、微服务监控、追踪、告警与负载测试全解析

最新推荐文章于 2025-11-04 14:44:24 发布

脸先着地天使

最新推荐文章于 2025-11-04 14:44:24 发布

阅读量72

点赞数

CC 4.0 BY-SA版权

分类专栏：微服务开发实战指南文章标签：微服务监控 Prometheus Jaeger

本文链接：https://blog.youkuaiyun.com/jwt8token/article/details/149732619

微服务开发实战指南专栏收录该内容

15 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

微服务监控、追踪、告警与负载测试全解析

1. 监控与可观测性

1.1 本地运行 statsd 和 graphite

为了展示指标的实际输出，我们可以在 Docker 容器中本地运行 statsd 和 graphite。安装 Docker 后，运行以下命令，它将从 Docker Hub 拉取镜像并在本地运行容器：

docker run -d --name graphite --restart=always \
  -p 80:80 -p 2003-2004:2003-2004 -p 2023-2024:2023-2024 \
  -p 8125:8125/udp -p 8126:8126 \
  hopsoft/graphite-statsd

现在，访问 http://localhost 即可查看指标。

1.2 使用 Prometheus 收集指标

Prometheus 是一个开源的监控和告警工具包，最初由 SoundCloud 于 2012 年开发，受 Google 的 Borgmon 启发。与 statsd 等系统采用的推送模型不同，Prometheus 使用拉取模型来收集指标。以下是使用 Prometheus 收集指标的具体步骤：
1. 编辑 build.gradle ：打开 message-service 项目，编辑 build.gradle 文件，添加 actuator 和 micrometer-prometheus 依赖：

group 'com.packtpub.microservices'
version '1.0-SNAPSHOT'
buildscript {
    repositories {
        mavenCentral()
    }
    dependencies {
        classpath group: 'org.springframework.boot', name: 'spring-boot-gradle-plugin', version: '2.0.4.RELEASE'
    }
}
apply plugin: 'java'
apply plugin: 'org.springframework.boot'
sourceCompatibility = 1.8
repositories {
    mavenCentral()
}
dependencies {
    compile group: 'org.springframework.boot', name: 'spring-boot-starter-web', version: '2.0.4.RELEASE'
    compile group: 'org.springframework.boot', name: 'spring-boot-starter-actuator', version: '2.0.4.RELEASE'
    compile group: 'io.micrometer', name: 'micrometer-core', version: '1.0.6'
    compile group: 'io.micrometer', name: 'micrometer-registry-prometheus', version: '1.0.6'
    compile group: 'io.github.resilience4j', name: 'resilience4j-circuitbreaker', version: '0.11.0'
    compile group: 'log4j', name: 'log4j', version: '1.2.17'
    compile group: 'net.logstash.logback', name: 'logstash-logback-encoder', version: '5.2'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

配置 application.yml ：在 application.yml 中添加以下内容，启用一个暴露 Prometheus 指标注册表中收集的指标的端点，并为 actuator 添加的管理端点打开另一个端口：

server:
  port:
    8082
management:
  server:
    port:
      8081
  endpoint:
    metrics:
      enabled: true
    prometheus:
      enabled: true
  endpoints:
    web:
      base-path: "/manage"
      exposure:
        include: "*"
  metrics:
    export:
      prometheus:
        enabled: true

测试服务是否暴露指标 ：运行服务并执行以下 curl 请求：

$ curl http://localhost:8081/manage/prometheus

配置并运行 Prometheus ：在 /tmp 目录下创建一个名为 prometheus.yml 的新文件，包含目标信息：

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'message-service'
    metrics_path: '/manage/prometheus'
    static_configs:
    - targets: ['localhost:8081']

运行 Prometheus ：下载并提取适合你平台的 Prometheus 版本，然后使用之前创建的配置文件运行 Prometheus：

$ ./prometheus --config.file=/tmp/prometheus.yml

查看指标 ：在浏览器中打开 http://localhost:9090 进行 Prometheus 查询并查看指标。

1.3 使用追踪简化调试

在微服务架构中，单个请求可能会经过多个不同的服务，并写入多个不同的数据存储和事件队列。为了更方便地调试生产事件，我们可以使用追踪技术。以下是使用 Jaeger 进行追踪的步骤：
1. 编辑 build.gradle ：打开 message-service 项目，替换 build.gradle 的内容：

group 'com.packtpub.microservices'
version '1.0-SNAPSHOT'
buildscript {
    repositories {
        mavenCentral()
    }
    dependencies {
        classpath group: 'org.springframework.boot', name: 'spring-boot-gradle-plugin', version: '2.0.4.RELEASE'
    }
}
apply plugin: 'java'
apply plugin: 'org.springframework.boot'
sourceCompatibility = 1.8
repositories {
    mavenCentral()
}
dependencies {
    compile group: 'org.springframework.boot', name: 'spring-boot-starter-web', version: '2.0.4.RELEASE'
    compile group: 'org.springframework.boot', name: 'spring-boot-starter-actuator', version: '2.0.4.RELEASE'
    compile group: 'io.micrometer', name: 'micrometer-core', version: '1.0.6'
    compile group: 'io.micrometer', name: 'micrometer-registry-statsd', version: '1.0.6'
    compile group: 'io.opentracing.contrib', name: 'opentracing-spring-cloud-starter-jaeger', version: '0.1.13'
    compile group: 'io.github.resilience4j', name: 'resilience4j-circuitbreaker', version: '0.11.0'
    compile group: 'log4j', name: 'log4j', version: '1.2.17'
    compile group: 'net.logstash.logback', name: 'logstash-logback-encoder', version: '5.2'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

配置 application.yml ：在 src/main/resources 目录下打开 application.yml ，添加 opentracing 配置：

opentracing:
  jaeger:
    udp-sender:
      host: "localhost"
      port:
        6831
spring:
  application:
    name: "message-service"

运行 Jaeger ：使用 Docker 运行 Jaeger 实例：

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 9411:9411 \
  jaegertracing/all-in-one:latest

查看追踪数据 ：运行 message-service 并进行一些示例请求，然后在浏览器中打开 http://localhost:16686 ，点击搜索并探索收集到的追踪数据。

1.4 异常告警

在微服务环境中，及时准确地发出告警对于维护服务的可用性至关重要。以下是使用 Alertmanager 为 message-service 添加告警的步骤：
1. 修改 prometheus.yml ：打开之前配置的 prometheus.yml 文件，添加 Alertmanager 配置：

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
    - "rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'message-service'
    metrics_path: '/manage/prometheus'
    static_configs:
    - targets: ['localhost:8081']

创建 rules.yml ：在 /tmp 目录下创建 rules.yml 文件，定义告警规则：

groups:
- name: message-service-latency
  rules:
  - alert: HighLatency
    expr: rate(http_server_requests_seconds_sum{job="message-service", instance="localhost:8081"}[1m]) /
rate(http_server_requests_seconds_count{job="message-service", instance="localhost:8081"}[1m]) > .5
    for: 1m
    labels:
      severity: 'critical'
    annotations:
      summary: High request latency

创建 alertmanager.yml ：在 /tmp 目录下创建 alertmanager.yml 文件，描述告警配置：

global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:4567/'

编写 Ruby 脚本 ：创建一个 Ruby 脚本 echo.rb ，用于接收并打印告警信息：

require 'sinatra'
post '/' do
    body = request.body.read()
    puts body
    return body
end

运行服务 ：运行 Ruby 脚本，重启 Prometheus 并启动 Alertmanager：

$ ruby echo.rb
$ ./prometheus --config.file=/tmp/prometheus.yml
$ ./alertmanager --config.file=/tmp/alertmanager.yml

触发告警 ：打开 MessageController.java ，添加以下代码，使控制器在返回响应前休眠 600 毫秒：

@RequestMapping(path = "/{id}", method = RequestMethod.GET, produces = "application/json")
public Message get(@PathVariable("id") String id) throws MessageNotFoundException {
    try { Thread.sleep(600); } catch (InterruptedException e) { e.printStackTrace(); }
    return messagesStore.get(id);
}

验证告警 ：运行更新后的 message-service 并进行一些请求，一分钟后，Prometheus 应通知 Alertmanager，Alertmanager 再通知本地的 Ruby 服务，表明告警正常工作。

2. 微服务负载测试

2.1 负载测试的重要性

负载测试是预测服务随时间表现的重要环节。在进行负载测试时，我们不应只关注简单问题，如“我们的系统每秒能处理多少请求”，而应尝试理解整个系统在各种负载条件下的性能。了解系统的基础设施和服务的依赖关系对于回答这些问题至关重要。

2.2 使用 Vegeta 进行微服务负载测试

Vegeta 是一个开源的负载测试工具，用于以恒定请求速率测试 HTTP 服务。以下是使用 Vegeta 对 message-service 进行负载测试的步骤：
1. 修改 MessageRepository ：在 MessageRepository 类中添加一个新的内存映射，用于存储用户的收件箱信息：

package com.packtpub.microservices.ch08.message;
import com.packtpub.microservices.ch08.message.exceptions.MessageNotFoundException;
import com.packtpub.microservices.ch08.message.models.Message;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
public class MessageRepository {
    private ConcurrentHashMap<String, Message> messages;
    private ConcurrentHashMap<String, List<Message>> inbox;
    public MessageRepository() {
        messages = new ConcurrentHashMap<>();
        inbox = new ConcurrentHashMap<>();
    }
    public Message save(Message message) {
        UUID uuid = UUID.randomUUID();
        Message saved = new Message(uuid.toString(), message.getSender(), message.getRecipient(),
                message.getBody(), message.getAttachmentUri());
        messages.put(uuid.toString(), saved);
        List<Message> userInbox = inbox.getOrDefault(message.getRecipient(), new ArrayList<>());
        userInbox.add(saved);
        inbox.put(message.getRecipient(), userInbox);
        return saved;
    }
    public Message get(String id) throws MessageNotFoundException {
        if (messages.containsKey(id)) {
            return messages.get(id);
        } else {
            throw new MessageNotFoundException("Message " + id + " could not be found");
        }
    }
    public List<Message> getByUser(String userId) {
        return inbox.getOrDefault(userId, new ArrayList<>());
    }
}

修改 MessageController ：在 MessageController 类中添加一个新的端点，用于查询特定用户的所有消息：

package com.packtpub.microservices.ch08.message.controllers;
import com.packtpub.microservices.ch08.message.MessageRepository;
import com.packtpub.microservices.ch08.message.clients.SocialGraphClient;
import com.packtpub.microservices.ch08.message.exceptions.MessageNotFoundException;
import com.packtpub.microservices.ch08.message.exceptions.MessageSendForbiddenException;
import com.packtpub.microservices.ch08.message.exceptions.MessagesNotFoundException;
import com.packtpub.microservices.ch08.message.models.Message;
import com.packtpub.microservices.ch08.message.models.UserFriendships;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.ResponseEntity;
import org.springframework.scheduling.annotation.Async;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.client.RestTemplate;
import org.springframework.web.servlet.support.ServletUriComponentsBuilder;
import java.net.URI;
import java.util.List;
import java.util.concurrent.CompletableFuture;
@RestController
public class MessageController {
    @Autowired
    private MessageRepository messagesStore;
    @Autowired
    private SocialGraphClient socialGraphClient;
    @RequestMapping(path = "/{id}", method = RequestMethod.GET, produces = "application/json")
    public Message get(@PathVariable("id") String id) throws MessageNotFoundException {
        return messagesStore.get(id);
    }
    @RequestMapping(path = "/", method = RequestMethod.POST, produces = "application/json")
    public ResponseEntity<Message> send(@RequestBody Message message) throws MessageSendForbiddenException {
        List<String> friendships = socialGraphClient.getFriendships(message.getSender());
        if (!friendships.contains(message.getRecipient())) {
            throw new MessageSendForbiddenException("Must be friends to send message");
        }
        Message saved = messagesStore.save(message);
        URI location = ServletUriComponentsBuilder
                .fromCurrentRequest().path("/{id}")
                .buildAndExpand(saved.getId()).toUri();
        return ResponseEntity.created(location).build();
    }
    @RequestMapping(path = "/user/{userId}", method = RequestMethod.GET, produces = "application/json")
    public ResponseEntity<List<Message>> getByUser(@PathVariable("userId") String userId) throws MessageNotFoundException {
        List<Message> inbox = messagesStore.getByUser(userId);
        if (inbox.isEmpty()) {
            throw new MessageNotFoundException("No messages found for user: " + userId);
        }
        return ResponseEntity.ok(inbox);
    }
    @Async
    public CompletableFuture<Boolean> isFollowing(String fromUser, String toUser) {
        String url = String.format(
"http://localhost:4567/followings?user=%s&filter=%s",
                fromUser, toUser);
        RestTemplate template = new RestTemplate();
        UserFriendships followings = template.getForObject(url,
UserFriendships.class);
        return CompletableFuture.completedFuture(
                followings.getFriendships().isEmpty()
        );
    }
}

创建模拟社交图服务 ：创建一个 Ruby 脚本 socialgraph.rb ，用于模拟社交图服务：

require 'sinatra'
get '/friendships/:user' do
    content_type :json
    {
        username: "user:32134",
        friendships: [
            "user:12345"
        ]
    }.to_json
end

安装 Vegeta ：如果你使用的是 Mac OS X 并安装了 HomeBrew，可以使用以下命令安装 Vegeta：

$ brew update && brew install vegeta

创建目标文件 ：创建 message-request-body.json 文件，包含创建消息的请求体：

{
    "sender": "user:32134",
    "recipient": "user:12345",
    "body": "Hello there!",
    "attachment_uri": "http://foo.com/image.png"
}

创建 targets.txt 文件，包含两个请求：

POST http://localhost:8082/
Content-Type: application/json
@message-request-body.json
GET http://localhost:8082/user:12345

进行负载测试 ：运行 message-service 和模拟社交图服务，然后使用以下命令进行负载测试：

$ cat targets.txt| vegeta attack -duration=60s -rate=100 | vegeta report -reporter=text

以下是负载测试结果的表格：
| 指标 | 数值 |
| — | — |
| Requests [total, rate] | 6000, 100.01 |
| Duration [total, attack, wait] | 1m0.004668981s, 59.99172349s, 12.945491ms |
| Latencies [mean, 50, 95, 99, max] | 10.683968ms, 5.598656ms, 35.108562ms, 98.290388ms, 425.186942ms |
| Bytes In [total, mean] | 667057195, 111176.20 |
| Bytes Out [total, mean] | 420000, 70.00 |
| Success [ratio] | 99.80% |
| Status Codes [code:count] | 201:3000 500:12 200:2988 |

通过以上步骤，我们可以对微服务进行全面的监控、追踪、告警和负载测试，确保服务的稳定性和性能。

2.3 使用 Gatling 进行微服务负载测试

Gatling 是一款基于 Scala 的高性能负载测试工具，以其强大的性能和丰富的功能而受到广泛关注。它采用异步非阻塞 I/O 模型，能够高效地模拟大量并发用户，为微服务的负载测试提供了有力支持。以下是使用 Gatling 对 message-service 进行负载测试的详细步骤：

2.3.1 项目配置

首先，我们需要在项目中添加 Gatling 的依赖。如果你使用的是 Maven 项目，可以在 pom.xml 中添加以下依赖：

<dependency>
    <groupId>io.gatling.highcharts</groupId>
    <artifactId>gatling-charts-highcharts</artifactId>
    <version>3.9.5</version>
    <scope>test</scope>
</dependency>

如果你使用的是 Gradle 项目，可以在 build.gradle 中添加以下依赖：

testImplementation 'io.gatling.highcharts:gatling-charts-highcharts:3.9.5'

2.3.2 编写测试脚本

在 src/test/scala 目录下创建一个新的 Scala 文件，例如 MessageServiceSimulation.scala ，并编写以下测试脚本：

import io.gatling.core.Predef._
import io.gatling.http.Predef._

class MessageServiceSimulation extends Simulation {

  // 配置 HTTP 协议
  val httpProtocol = http
    .baseUrl("http://localhost:8082")
    .acceptHeader("application/json")
    .contentTypeHeader("application/json")

  // 定义场景
  val scn = scenario("MessageServiceScenario")
    .exec(
      http("CreateMessage")
        .post("/")
        .body(StringBody("""{
          "sender": "user:32134",
          "recipient": "user:12345",
          "body": "Hello from Gatling!",
          "attachment_uri": "http://foo.com/image.png"
        }"""))
    )
    .pause(1)
    .exec(
      http("GetMessagesByUser")
        .get("/user/user:12345")
    )

  // 设置模拟参数
  setUp(
    scn.inject(atOnceUsers(100))
  ).protocols(httpProtocol)
}

在上述脚本中，我们定义了一个名为 MessageServiceScenario 的场景，该场景包含两个请求：一个是创建消息的 POST 请求，另一个是获取用户消息的 GET 请求。我们使用 atOnceUsers(100) 方法模拟 100 个用户同时发起请求。

2.3.3 运行测试

在终端中运行以下命令来执行 Gatling 测试：

$ mvn gatling:test  # 如果使用 Maven
$ gradle gatlingRun  # 如果使用 Gradle

测试完成后，Gatling 会生成详细的测试报告，你可以在 target/gatling 目录下找到生成的 HTML 报告，打开报告文件即可查看测试结果。

2.4 构建自动伸缩集群

在微服务架构中，自动伸缩集群能够根据实际的流量需求动态调整服务的实例数量，从而确保服务的高可用性和性能。以下是在 AWS 上构建自动伸缩集群的步骤：

2.4.1 创建启动配置

启动配置定义了自动伸缩组中实例的基本信息，包括 AMI（Amazon Machine Image）、实例类型、安全组等。以下是创建启动配置的步骤：
1. 登录 AWS 管理控制台，导航到 EC2 服务。
2. 在左侧导航栏中，选择“启动配置”。
3. 点击“创建启动配置”按钮。
4. 选择合适的 AMI，例如 Amazon Linux 2。
5. 选择实例类型，根据服务的需求选择合适的 CPU、内存和存储配置。
6. 配置安全组，确保实例可以正常访问网络。
7. 点击“创建启动配置”完成创建。

2.4.2 创建自动伸缩组

自动伸缩组根据启动配置创建和管理实例。以下是创建自动伸缩组的步骤：
1. 在 EC2 服务中，选择“自动伸缩组”。
2. 点击“创建自动伸缩组”按钮。
3. 选择之前创建的启动配置。
4. 配置自动伸缩组的详细信息，包括实例数量、子网等。
5. 设置伸缩策略，可以根据 CPU 利用率、网络流量等指标自动调整实例数量。
6. 点击“创建自动伸缩组”完成创建。

2.4.3 配置伸缩策略

伸缩策略定义了自动伸缩组根据哪些指标进行伸缩。以下是配置伸缩策略的步骤：
1. 在自动伸缩组的详细信息页面，选择“伸缩策略”。
2. 点击“创建策略”按钮。
3. 选择策略类型，例如简单伸缩策略、目标跟踪伸缩策略等。
4. 配置策略的具体参数，例如阈值、调整实例数量等。
5. 点击“创建策略”完成配置。

2.5 容量规划策略

容量规划是确保服务能够满足未来业务需求的重要环节。以下是一些常用的容量规划策略：

2.5.1 历史数据分析

通过分析历史数据，了解服务的流量模式和峰值情况。可以使用监控工具收集服务的请求量、响应时间等指标，并进行统计分析。根据历史数据预测未来的流量增长趋势，为容量规划提供依据。

2.5.2 压力测试

进行压力测试，模拟高并发场景，测试服务的性能极限。通过压力测试可以了解服务在不同负载下的表现，找出性能瓶颈，并进行优化。根据压力测试的结果，合理调整服务的容量。

2.5.3 弹性伸缩

采用弹性伸缩技术，根据实际流量需求动态调整服务的实例数量。可以使用自动伸缩组、容器编排工具等实现弹性伸缩。弹性伸缩可以提高资源利用率，降低成本。

2.5.4 预留资源

预留一定的资源作为缓冲，以应对突发的流量增长。可以预留一定数量的实例、存储容量等。预留资源可以提高服务的可用性和稳定性，但也会增加成本。

2.6 总结

微服务的监控、追踪、告警和负载测试是确保服务稳定性和性能的关键环节。通过使用 Prometheus、Jaeger、PagerDuty 等工具，我们可以实现对微服务的全面监控和告警；使用 Vegeta、Gatling 等工具进行负载测试，了解服务在不同负载下的性能表现；通过构建自动伸缩集群和采用合理的容量规划策略，确保服务能够应对不同的流量需求。

以下是一个 mermaid 流程图，展示了微服务监控、追踪、告警和负载测试的整体流程：

graph LR
    A[微服务] --> B[监控指标收集]
    B --> C[Prometheus]
    C --> D[告警规则配置]
    D --> E[Alertmanager]
    E --> F[告警通知]
    A --> G[追踪数据收集]
    G --> H[Jaeger]
    H --> I[追踪数据分析]
    A --> J[负载测试]
    J --> K[Vegeta/Gatling]
    K --> L[负载测试报告]
    M[自动伸缩集群] --> N[根据指标伸缩]
    N --> A

通过以上的流程，我们可以建立一个完善的微服务监控、追踪、告警和负载测试体系，确保微服务的稳定运行和高性能。