21、AWS 日志监控与告警搭建指南

最新推荐文章于 2025-11-05 02:14:19 发布

Wind6

最新推荐文章于 2025-11-05 02:14:19 发布

阅读量31

点赞数

CC 4.0 BY-SA版权

分类专栏： AWS与DevOps：构建高效云基础设施文章标签： AWS 日志监控告警系统

本文链接：https://blog.youkuaiyun.com/wind6/article/details/150203739

AWS与DevOps：构建高效云基础设施专栏收录该内容

27 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

AWS 日志监控与告警搭建指南

1. 整体架构概述

在 AWS 环境中，为了实现高效的日志管理和监控，我们将利用 Kinesis、ElasticSearch 和 S3 构建一个完整的日志处理栈。Kinesis 可以将日志同时写入 ElasticSearch 和 S3，若写入 ElasticSearch 失败，日志会保存到 S3 中。整个搭建过程分为创建 ElasticSearch 集群、Kinesis Firehose 流，更新应用程序以发送日志到 Firehose 端点，以及使用 Kibana 可视化日志等步骤。

2. 创建并启动 ElasticSearch 集群

首先，我们要创建一个 ElasticSearch 集群。AWS 提供了 ElasticSearch 托管服务，我们可以借助 CloudFormation 模板和 troposphere 来完成集群的创建。
- 创建脚本文件 ：创建一个名为 elasticsearch-cf-template.py 的文件，脚本开头需要导入一些必要的模块，如下所示：

"""Generating CloudFormation template."""
from ipaddress import ip_network
from ipify import get_ip
from troposphere import (
    GetAtt,
    Join,
    Output,
    Export,
    Parameter,
    Ref,
    Template,
)
from troposphere.elasticsearch import (
    Domain,
    EBSOptions,
    ElasticsearchClusterConfig,
)

创建模板并提取 IP 地址 ：

t = Template()
PublicCidrIp = str(ip_network(get_ip()))

添加描述和参数 ：

t.add_description('Effective DevOps in AWS: Elasticsearch')
t.add_parameter(Parameter(
    "InstanceType",
    Type="String",
    Description="instance type",
    Default="t2.small.elasticsearch",
    AllowedValues=[
        "t2.small.elasticsearch",
        "t2.medium.elasticsearch",
        "m4.large.elasticsearch",
    ],
))
t.add_parameter(Parameter(
    "InstanceCount",
    Default="2",
    Type="String",
    Description="Number instances in the cluster",
))
t.add_parameter(Parameter(
    "VolumeSize",
    Default="10",
    Type="String",
    Description="Size in Gib of the EBS volumes",
))

创建 ElasticSearch 集群（Domain 资源） ：

t.add_resource(Domain(
    'ElasticsearchCluster',
    DomainName="logs",
    ElasticsearchVersion="5.3",
    ElasticsearchClusterConfig=ElasticsearchClusterConfig(
        DedicatedMasterEnabled=False,
        InstanceCount=Ref("InstanceCount"),
        ZoneAwarenessEnabled=False,
        InstanceType=Ref("InstanceType"),
    ),
    AdvancedOptions={
        "indices.fielddata.cache.size": "",
        "rest.action.multi.allow_explicit_index": "true",
    },
    EBSOptions=EBSOptions(EBSEnabled=True,
                          Iops=0,
                          VolumeSize=Ref("VolumeSize"),
                          VolumeType="gp2"),
    AccessPolicies={
        'Version': '2012-10-17',
        'Statement': [
            {
                'Effect': 'Allow',
                'Principal': {
                    'AWS': [Ref('AWS::AccountId')]
                },
                'Action': 'es:*',
                'Resource': '*',
            },
            {
                'Effect': 'Allow',
                'Principal': {
                    'AWS': "*"
                },
                'Action': 'es:*',
                'Resource': '*',
                'Condition': {
                    'IpAddress': {
                        'aws:SourceIp': PublicCidrIp
                    }
                }
            }
        ]
    },
))

添加输出并打印 JSON ：

t.add_output(Output(
    "DomainArn",
    Description="Domain Arn",
    Value=GetAtt("ElasticsearchCluster", "DomainArn"),
    Export=Export("LogsDomainArn"),
))
t.add_output(Output(
    "Kibana",
    Description="Kibana url",
    Value=Join("", [
        "https://",
        GetAtt("ElasticsearchCluster", "DomainEndpoint"),
        "/_plugin/kibana/"
    ])
))
print t.to_json()

执行脚本并创建 ElasticSearch 域 ：

$ python elasticsearch-cf-template.py > elasticsearch-cf.template
$ git add elasticsearch-cf-template.py
$ git commit -m "Adding ElasticSearch template"
$ git push
$ aws cloudformation create-stack \
      --stack-name elasticsearch \
      --template-body file://elasticsearch-cf.template \
      --parameters \
          ParameterKey=InstanceType,ParameterValue=t2.small.elasticsearch \
          ParameterKey=InstanceCount,ParameterValue=2 \
          ParameterKey=VolumeSize,ParameterValue=10

3. 创建并启动 Kinesis Firehose 流

当 ElasticSearch 集群启动并运行后，我们开始创建 Kinesis Firehose 流，用于将数据导入 ElasticSearch。
- 创建脚本文件 ：创建一个名为 firehose-cf-template.py 的文件，导入必要的模块：

"""Generating CloudFormation template."""
from troposphere import (
    GetAtt,
    Join,
    Ref,
    Template,
    ImportValue
)
from troposphere.firehose import (
    BufferingHints,
    CloudWatchLoggingOptions,
    DeliveryStream,
    S3Configuration,
    ElasticsearchDestinationConfiguration,
    RetryOptions,
)
from troposphere.iam import Role
from troposphere.s3 import Bucket
t = Template()
t.add_description('Effective DevOps in AWS: Kinesis Firehose Stream')

创建 S3 桶 ：

t.add_resource(Bucket(
    "S3Bucket",
    DeletionPolicy="Retain"
))

创建 IAM 角色 ：

t.add_resource(Role(
    'FirehoseRole',
    ManagedPolicyArns=[
        'arn:aws:iam::aws:policy/AmazonS3FullAccess',
        'arn:aws:iam::aws:policy/AmazonESFullAccess',
    ],
    AssumeRolePolicyDocument={
        'Version': '2012-10-17',
        'Statement': [{
            'Action': 'sts:AssumeRole',
            'Principal': {'Service': 'firehose.amazonaws.com'},
            'Effect': 'Allow',
        }]
    }
))

创建 Firehose 流 ：

t.add_resource(DeliveryStream(
    'FirehoseLogs',
    DeliveryStreamName='FirehoseLogs',
    ElasticsearchDestinationConfiguration=ElasticsearchDestinationConfiguration(
        DomainARN=ImportValue("LogsDomainArn"),
        RoleARN=GetAtt("FirehoseRole", "Arn"),
        IndexName="logs",
        TypeName="Logs",
        IndexRotationPeriod="OneDay",
        RetryOptions=RetryOptions(
            DurationInSeconds="300"
        ),
        BufferingHints=BufferingHints(
            IntervalInSeconds=60,
            SizeInMBs=1
        ),
        S3BackupMode="AllDocuments",
        S3Configuration=S3Configuration(
            BufferingHints=BufferingHints(
                IntervalInSeconds=300,
                SizeInMBs=5
            ),
            BucketARN=Join("", [
                "arn:aws:s3:::", Ref("S3Bucket")
            ]),
            CompressionFormat='UNCOMPRESSED',
            Prefix='firehose-logs',
            RoleARN=GetAtt("FirehoseRole", "Arn"),
        ),
    )
))

打印 JSON 并执行脚本 ：

print t.to_json()

$ git add firehose-cf-template.py
$ git commit -m "Adding Firehose template"
$ git push
$ python firehose-cf-template.py > firehose-cf.template
$ aws cloudformation create-stack \
      --stack-name firehose \
      --template-body file://firehose-cf.template \
      --capabilities CAPABILITY_IAM

4. 更新应用程序以发送日志到 Firehose 端点

在完成 ElasticSearch 集群和 Kinesis Firehose 流的创建后，我们需要更新应用程序，使其能够将日志直接发送到 Kinesis，而不是先写入磁盘。
- 添加 EC2 与 Firehose 通信的权限 ：编辑 nodeserver-cf-template.py 脚本，在 MonitoringPolicy 策略中添加以下内容：

t.add_resource(IAMPolicy(
    "MonitoringPolicy",
    PolicyName="AllowSendingDataForMonitoring",
    PolicyDocument=Policy(
        Statement=[
            Statement(
                Effect=Allow,
                Action=[
                    Action("cloudwatch", "Put*"),
                    Action("logs", "Create*"),
                    Action("logs", "Put*"),
                    Action("logs", "Describe*"),
                    Action("events", "Put*"),
                    Action("firehose", "Put*"),
                ],
                Resource=["*"])
        ]
    ),
    Roles=[Ref("Role")],
))

保存脚本并部署更新：

$ git add nodeserver-cf-template.py
$ git commit -m "Allowing our application to send logs to Firehose"
$ git push
$ python nodeserver-cf-template.py > nodeserver-cf.template
$ aws cloudformation update-stack \
      --capabilities CAPABILITY_IAM \
      --stack-name helloworld-staging \
      --template-body file://nodeserver-cf.template \
      --parameters \
          ParameterKey=InstanceType,UsePreviousValue=true \
          ParameterKey=KeyPair,UsePreviousValue=true \
          ParameterKey=PublicSubnet,UsePreviousValue=true \
          ParameterKey=ScaleCapacity,UsePreviousValue=true \
          ParameterKey=VpcId,UsePreviousValue=true
$ aws cloudformation update-stack \
      --capabilities CAPABILITY_IAM \
      --stack-name helloworld-production \
      --template-body file://nodeserver-cf.template \
      --parameters \
          ParameterKey=InstanceType,UsePreviousValue=true \
          ParameterKey=KeyPair,UsePreviousValue=true \
          ParameterKey=PublicSubnet,UsePreviousValue=true \
          ParameterKey=ScaleCapacity,UsePreviousValue=true \
          ParameterKey=VpcId,UsePreviousValue=true

更改日志传输方式 ：
- 安装 winston-firehose 包：

$ npm install winston-firehose@1.0.6 --save --save-exact

- 编辑 `helloworld.js` 文件：

var WFirehose = require('winston-firehose')
var hostname = process.env.HOSTNAME
logger.rewriters.push(function(level, msg, meta) {
  meta.version = version
  meta.hostname = hostname
  meta.appname = "helloworld"
  return meta
})
var logger = new (winston.Logger)({
  transports: [new WFirehose({
    'streamName': 'FirehoseLogs',
    'firehoseOptions': {
      'region': 'us-east-1'
    }
  })]
})

- 提交更改：

$ git add helloworld.js package.json node_modules
$ git commit -m "Sending logs to Firehose directly"
$ git push

整体流程 mermaid 图

graph LR
    A[创建 ElasticSearch 集群] --> B[创建 Kinesis Firehose 流]
    B --> C[更新应用程序权限]
    C --> D[更改日志传输方式]
    D --> E[使用 Kibana 可视化日志]

参数说明表格

参数名称	描述	默认值	允许值
InstanceType	ElasticSearch 实例类型	t2.small.elasticsearch	t2.small.elasticsearch, t2.medium.elasticsearch, m4.large.elasticsearch
InstanceCount	集群中的实例数量	2	无
VolumeSize	EBS 卷大小（GiB）	10	无

AWS 日志监控与告警搭建指南

5. 使用 Kibana 可视化日志

当应用程序的日志通过 Kinesis Firehose 发送到 ElasticSearch 后，我们可以使用 Kibana 来可视化这些日志。
- 获取 Kibana 实例 URL ：通过以下命令查看 ElasticSearch CloudFormation 栈的输出，获取 Kibana 实例的 URL。

$ aws cloudformation describe-stacks \
      --stack-name es \
      --query 'Stacks[0].Outputs'

输出示例如下：

[
    {
        "Description": "Kibana url",
        "OutputKey": "Kibana",
        "OutputValue": "https://search-logsx7c2g5zqbrtpxotpv3b3jw2uk4.us-east-1.es.amazonaws.com/_plugin/kibana/"
    },
    {
        "Description": "Domain Arn",
        "OutputKey": "DomainArn",
        "OutputValue": "arn:aws:es:us-east-1:511912822958:domain/logs"
    }
]

配置 Kibana ：
1. 使用浏览器打开 Kibana URL，进入初始配置页面。
2. 在 Index name or pattern 字段中输入 logs-* ，Kibana 会分析日志以找出可能的时间字段名称。
3. 选择名为 timestamp 的时间字段。
4. 点击 Create 按钮。
查看日志和创建仪表盘 ：配置完成后，进入索引模式的管理页面。此时，每个元数据都已被分析。点击 Discover 可以查看所有日志，并探索不同的可视化选项，为日志创建仪表盘。你可以通过 Google 搜索 Kibana 获取更多灵感，了解可以在仪表盘中展示的内容。

6. 删除旧日志

与 CloudWatch 日志不同，此日志监控栈没有内置的删除旧日志功能。我们可以使用 Elastic Curator 来删除旧日志，你可以访问 http://bit.ly/2rFHzUT 了解更多信息。例如，可以使用 Lambda 函数每天运行一次 Elastic Curator 来实现定期删除旧日志的功能。

7. 监控基础设施

监控是一项没有终点的任务，我们需要不断地添加和改进监控内容。因此，在开始时或新服务发布时，优先关注重点领域非常重要。不同的 AWS 服务需要不同程度的监控关注：
| 服务类型 | 特点 | 监控需求 |
| ---- | ---- | ---- |
| 无服务器服务（如 Lambda、S3、DynamoDB） | AWS 负责处理大部分工作，如故障处理、安全补丁、扩展和高可用性等 | 监控需求相对较低 |
| EC2 服务 | 除硬件外，用户控制实例的各个方面 | 需要投入更多时间和精力进行监控 |

幸运的是，大多数 AWS 服务都与 CloudWatch 有原生集成。我们可以通过以下方式为现有模板添加监控功能：
- 利用 CloudWatch 指标 ：许多 AWS 服务会自动向 CloudWatch 发送指标，我们可以根据这些指标设置告警规则。例如，对于 EC2 实例，可以监控 CPU 使用率、网络流量等指标。
- 自定义监控脚本 ：对于一些特殊的监控需求，我们可以编写自定义的监控脚本，并通过 CloudWatch Events 定期执行。

总结与展望

通过使用 AWS 的多个服务，我们成功搭建了一个完整的日志监控和告警系统，能够收集应用程序的日志、事件和指标，并将其发送到 CloudWatch、S3 和 ElasticSearch 等服务。这个监控栈可以用于监控几乎所有对我们和公司重要的内容。

在未来，我们可以进一步扩展这个监控系统，例如：
- 添加更多监控指标 ：根据业务需求，监控更多的指标，如数据库查询性能、应用程序响应时间等。
- 集成第三方监控工具 ：结合第三方监控工具，提供更全面的监控和告警功能。
- 自动化运维 ：利用监控数据实现自动化运维，如自动扩展、自动故障恢复等。

日志处理流程 mermaid 图

graph LR
    A[应用程序日志] --> B[Kinesis Firehose]
    B --> C{写入 ElasticSearch}
    C -->|成功| D[ElasticSearch 存储日志]
    C -->|失败| E[S3 存储日志]
    D --> F[Kibana 可视化日志]

综上所述，搭建一个完善的日志监控和告警系统需要不断地优化和扩展，以满足业务的不断发展和变化。通过合理利用 AWS 提供的各种服务和工具，我们可以提高系统的可靠性和可维护性。