Logstash 数据保护机制

最新推荐文章于 2023-11-23 16:20:49 发布

ColorlessCube

最新推荐文章于 2023-11-23 16:20:49 发布

阅读量1.8k

点赞数

分类专栏：架构设计 ElasticSearch 文章标签： 1024程序员节运维 es

本文链接：https://blog.youkuaiyun.com/qq_43619899/article/details/127490380

版权

本文介绍了Logstash在Kafka到ES架构中的数据保护，包括持久化队列（Persistent Queues）防止数据丢失，重试机制（Retry Policy）处理异常，以及死信队列（DLQ）处理无法处理的事件。当Logstash或ES宕机时，持久化队列确保数据安全，死信队列存储失败事件，重启后可恢复处理，保证数据完整性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

背景

在 Kafka => Logstash => ES 的架构设计中，如果 Logstash 或 ES 侧的集群发生了宕机，是否会造成数据的丢失或者磁盘或内存的溢出，这是需要关心的问题。因此，需要对 Logstash 的 ES output 断联问题进行测试。

概念介绍

持久化队列

Persistent queues (PQ)

A Logstash persistent queue helps protect against data loss during abnormal termination by storing the in-flight message queue to disk.

Helps protect against message loss during a normal shutdown and when Logstash is terminated abnormally. If Logstash is restarted while events are in-flight, Logstash attempts to deliver messages stored in the persistent queue until delivery succeeds at least once.

Can absorb bursts of events without needing an external buffering mechanism like Redis or Apache Kafka.

Logstash 输入插件默认使用基于内存的事件队列，这就意味中如果 Logstash 因为意外崩溃，队列中未处理的事件将全部丢失。不仅如此，基于内存的队列容量小且不可以通过配置扩大容量，所以它能起到的缓冲作用也就非常有限。为了应对内存队列的这些问题，可以将事件队列配置为基于硬盘存储的持久化队列( Persistent Queue)。

持久化队列将输入插件发送过来的事件存储在硬盘中，只有过滤器插件或输出插件确认已经处理了事件，持久化队列才会将事件从队列中删除。当 Logstash 因意外崩溃后重启，它会从持久化队列中将未处理的事件取出处理，所以使用了持久化队列的 Logstash 可以保证事件至少被处理一次。

如果想要开启 Logstash 持久化队列，只要在 logstash. yml 文件中将 queue. type 参数设置为 persisted 即可，它的默认值是 memory。当开启了持久化队列后，队列数据默认存储在 Logstash 数据文件路径的 queue 目录中。数据文件路径默认是在 Logstash 安装路径的 data 目录，这个路径可以通过 path. data 参数修改。持久化队列的存储路径则可以通过参数 path. queue 修改。

尽管持久化队列将事件存储在硬盘上，但由于硬盘空间也不是无限的，所以需要根据应用实际需求配置持久化队列的容量大小。Logstash持久化队列容量可通过事件数量和存储空间大小两种方式来控制，在默认情况下 Logstash 持久化队列容量为 1024MB，而事件数量则没有设置上限。当持久化队列达到了容量上限，Logstash 会通过控制输入插件产生事件的频率来防止队列溢出，或者拒绝再接收输入事件直到队列有空闲空间。持久化队列事件数量容量可通过 queue. max events 修改，而存储空间容量则可通过 queue. max bytes 来修改。

事实上在许多高访问量的应用中，单纯使用 Logstash 内部队列的机制还是远远不够的。许多应用会在 Logstash 接收数据前部署专业的消息队列，以避免瞬间流量对后台系统造成冲击。这就是人们常说的 MQ( Message Queue)，比如 Kafka、RocketMQ 等。这些专业的消息队列具有千万级别的数据缓存能力，从而可以保护后续应用避免被流量压跨。

**因此，当 Logstash 开启了持久化队列时，即使 Logstash 发生了宕机，其实也不会造成数据的丢失。**对于 Kafka 侧而言，Logstash 恢复启动之后，仍然可以从之前的 offset 处继续消费数据。对于 ES 侧而言，Logstash 是调用 Bulk API 向 ES 批量输出事件的，只有 ES 输出插件确认已经处理了事件，持久化队列才会将事件从持久化队列即磁盘中删除。因此，当 Logstash 重启时，仍然能从磁盘中恢复之前未曾处理的数据。

重试机制

Retry Policy

The retry policy has changed significantly in the 8.1.1 release. This plugin uses the Elasticsearch bulk API to optimize its imports into Elasticsearch. These requests may experience either partial or total failures. The bulk API sends batches of requests to an HTTP endpoint. Error codes for the HTTP request are handled differently than error codes for individual documents.

HTTP requests to the bulk API are expected to return a 200 response code. All other response codes are retried indefinitely.

The following document errors are handled as follows:

400 and 404 errors are sent to the dead letter queue (DLQ), if enabled. If a DLQ is not enabled, a log message will be emitted, and the event will be dropped. See DLQ Policy for more info.

409 errors (conflict) are logged as a warning and dropped.

Note that 409 exceptions are no longer retried. Please set a higher retry_on_conflict value if you experience 409 exceptions. It is more performant for Elasticsearch to retry these exceptions than this plugin.

自从 logstash-8.1.1 版本之后，es output 插件使用 Bulk API 将数据导入到 es 中。如果 es 侧返回 200 则代表数据导出成功，否则 400/404 时则会尝试将失败的消息存储到 dead letter queue (DLQ) 中，如果配置了该策略的话，否则就丢弃。如果返回 409 则代表数据冲突或冗余，直接丢弃该数据。

死信队列（DLQ）

Dead letter queues (DLQ)

The dead letter queue (DLQ) is designed as a place to temporarily write events that cannot be processed. The DLQ gives you flexibility to investigate problematic events without blocking the pipeline or losing the events. Your pipeline keeps flowing, and the immediate problem is averted. But those events still need to be addressed.

Logstash 输入插件的事件队列位于输入插件与其他插件之间，而死信队列则位于输出插件与目标数据源之间。如果 Logstash 处理某一事件失败，事件将被写入到死信队列中。

Logstash 死信队列以文件的形式存储在硬盘中，为失败事件提供了采取补救措施的可能。死信队列并不是 Logstash 中特有的概念，在许多分布式组件中都采用了死信队列的设计思想。由于死信队列的英文名称为 Dead Letter Queue，所以在很多文献中经常将它简写为 DLQ。

Logstash 在目标数据源返回 400 或 404 响应状态码时认为事件失败，而支持这种逻辑的目标数据源只有 Elasticsearch。所以 Logstash 死信队列目前只支持目标数据源为 Elasticsearch 的输出插件，并且在默认情况下死信队列是关闭的。开启死信队列的方式与持久化队列类似，也是在 logstash. yml 文件中配置，参数名为dead_ letter_queue.enable。死信队列默认存储在 Logstash 数据路径下的 dead_letter_queue 目录中，可通过 path. dead_letter._queue 参数修改。

死信队列同样也有容量上限，默认值为 1024MB，可通过 dead_letter_queue.max_bytes 参数修改。虽然死信队列可以缓存一定数量的错误事件，但当容量超过上限时它们还是会被删除，所以依然需要通过某种机制处理这些事件。Logstash 为此专门提供了一种死信队列输入插件，它可以将死信队列中的事件读取出来并传输至另一个管道中处理。

因此，当 ES 发生宕机时，Logstash 在接收到