introduction to Monitoring Serverless Applications

本文探讨了在Serverless架构中实现高可用性和高效能的重要性,深入解析了监控、日志、追踪和警报四大观测支柱,为优化Serverless应用提供实践指导。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Keep your performance high and your observability higher with this breakdown on what exactly monitoring entails.

Serverless computing as an architectural pattern is now widely-adopted, and has quite rightly challenged traditional approaches when it comes to design. Serverless enhances productivity through faster deployments bringing applications to life in minimal time. Time not spent provisioning and maintaining production infrastructure can be invested elsewhere to drive business value — because at the end of the day that's what matters!

Now, with your serverless application shipped into production, maintaining optimal performance requires you to focus on the application's journey through production. In other words, you'll need to address observability for every operation that takes place within your serverless application.

Observability

Observability ties together many aspects — monitoring, logging, tracing, and alerts. Each observation pillar provides critical insight into how your deployed serverless application is working and collectively whether your serverless application is not just working but deriving real business value.

In this post, we are going to discuss each observation pillar, providing you with examples and solutions, which specifically address the serverless ecosystem.

Monitoring

Monitoring is the main pillar which tells us whether our application is working properly. "Properly" can be defined by multiple parameters:

  1. Errors: every request or event that yielded an error result
  2. Latency: the amount of time it takes for a request to be processed
  3. Traffic: the total number of requests that the resource is handling

Compounded together, monitoring allows us to detect highly errored services, performance degradation across our resources, and even scaling issues when we hit higher traffic rates.

Much of our serverless deployment is undertaken within a Function-as-a-Service (FaaS). FaaS provides us with our base compute unit, the Function. Popular examples of cloud-hosted managed FaaS services are:

Using AWS Lambda as our FaaS of choice, monitoring Lambda functions is accomplished by using CloudWatch Metrics. With CloudWatch Metrics, every deployed function is monitored using several insightful metrics:

These metrics include:

  1. The number of invocations.
  2. The minimum, maximum, and average duration of invocations.
  3. Error counts and availability (derived from errors/invocations ratio).
  4. The number of throttled requests.
  5. Iterator Age — The "age" of the oldest processed record.

For serverless, we are still missing some unique issues such as timeouts, memory usage, and even cost, which we can monitor using a dedicated serverless monitoring and troubleshooting tool like Epsagon:

Logging

When a problem has been found according to our monitoring parameters, we then need to troubleshoot it. We accomplish this by consulting and analyzing all relevant logs.

Logs can be generated by prints, custom logging, and/or exceptions. They often include very verbose information — that's why they are a necessity for debugging a problem.

When approaching our logs, we need to know what we are looking for, so searching and filtering within and across logs is essential. In AWS Lambda, all of our logs are shipped to AWS CloudWatch Logs. Each function is assigned to its own log group and one log stream for each container instance.

Once we find the correct log file, we can see the error message and gain a better understanding of what initiated the error.

There are better ways to look for logs other than just using CloudWatch Logs. A known pattern is to stream log data to a dedicated logs aggregation service, for example, ELK. With this in place, when a function fails you, you can simply query and find the corresponding log for the specific function invocation.

The main problem that is often associated and attributed to logs, is that they have minimal or no context. When our applications become distributed, trying to correlate between logs from different services can be a nightmare. For that particular issue, distributed tracing comes to help.

Tracing

Tracing, or specifically distributed tracing, helps us to correlate between events captured across logs on different services and resources. When being applied correctly we can utilize it to find the root cause of our errors with minimal effort.

Let's imagine for example that we've built a blog site that has the following public endpoints

  1. View an existing post /post
  2. Post a new blog post /new_post

And it consists of these resources and services:

Now, by using the monitoring and logs methods from before we've noticed that there's an error in our Post Analysis lambda. How do we progress from here and find the root cause of this issue?

Well, in microservices and serverless applications specifically, we want to be able to collect each trace from our microservices and gather them together as a whole execution.

In order for us to analyze distributed traces, we need two main things:

  1. A distributed tracing instrumentation library
  2. A distributed tracing engine

When instrumenting our code, the most common approach is to use and implement OpenTracing. OpenTracing is a specification that defines the traces' structure across different programming languages for distributed tracing. Traces in OpenTracing are defined implicitly by their spans — an individual unit of work done in a distributed system. Here's an example of constructing a trace with open-tracing:

span = tracer.start_span(operation_name='our operation')
scope = tracer.scope_manager.activate(span, True)
try:
    # Do things that may trigger a KeyError.
except KeyError as e:
    scope.set_tag('keyerror', f'{e}')
finally:
    scope.finish()

 

It's advised to use a common standard across all your services and have a vendor-neutral API. This means you don't need to make large code changes if you're switching between different trace engines. There are some downsides in using this approach; for example, developers need to maintain span and trace-related declarations across their codebase.

So once instrumented, we can publish and capture those traces into a distributed traces engine. Some of which will result in us being able to visually see an end-to-end transaction within our system. A transaction is basically a story of how data has been transmitted from one end to the other within the system

With an engine such as Jaeger we can view the traces organized in a timeline manner:

This way we can try and find the exact time this error happened and therefore find the originating event that caused the error.

By utilizing Epsagon, a purpose-built distributed tracing application which we earlier introduced, we can see that the errored lambda in question has received its input from a malformed lambda (Request Processor) two hops earlier, and that it handled an authentication error and propagated a false input to the Post Analysis lambda via the SNS message broker.

It's important to remember that when going serverless we have broken each of our microservices into nano-services. Each of them will have an impact on the other, and attempting to figure out the root cause can be very frustrating.

Epsagon tackles this issue by visualizing the participated elements in the system in a graph, and by presenting trace data directly within each of the nodes, helps to reduce the time involved in investigating the root cause significantly.

Alerts

Last but not least, come the alerts. It's pretty obvious to everyone that we don't want to wait in front of a monitor 24/7 to see problems.

Being able to get alerts to an incident management platform is important, so relevant people will be able to get notified and take action. Popular alerting platforms are PagerDutyOpsGenie, and even Slack!

When choosing your observability platform, you'll need to make sure you can configure your alerts based on the type of issue, the involved resource, and the destination (i.e. integrates to the platforms above). For Lambda functions, basic alerts can be configured in CloudWatch Alarms:

In this example, we want to get notified when we breach the threshold of 10 or more errors within 2 consecutive 5-minute windows. A dedicated monitoring tool has the capability to configure more specific alerts such as:

  1. Alert if the function is timing out (rather than a general error).
  2. Alert on specific business flows KPIs.
  3. Alert regarding the performance degradation of a resource (for example Redis).

Summary

We understand that observability is a broad term, that unifies many important aspects of monitoring and troubleshooting applications (in production, or not).

When going serverless, observability becomes a bottleneck for the development velocity, and a proper tool which is dedicated to the distributed and event-driven nature must be in place.

If you're curious and would like to learn more about serverless applications, join our webinar on the Best Practices to Monitor and Troubleshoot Serverless Applications, March 7th at 10 am PT.

1. 用户与权限管理模块 角色管理: 学生:查看实验室信息、预约设备、提交耗材申请、参与安全考核 教师:管理课题组预约、审批学生耗材申请、查看本课题组使用记录 管理员:设备全生命周期管理、审核预约、耗材采购与分发、安全检查 用户操作: 登录认证:统一身份认证(对接学号 / 工号系统,模拟实现),支持密码重置 信息管理:学生 / 教师维护个人信息(联系方式、所属院系),管理员管理所有用户 权限控制:不同角色仅可见对应功能(如学生不可删除设备信息) 2. 实验室与设备管理模块 实验室信息管理: 基础信息:实验室编号、名称、位置、容纳人数、开放时间、负责人 功能分类:按学科(计算机实验室 / 电子实验室 / 化学实验室)标记,关联可开展实验类型 状态展示:实时显示当前使用人数、设备运行状态(正常 / 故障) 设备管理: 设备档案:名称、型号、规格、购置日期、单价、生产厂家、存放位置、责任人 全生命周期管理: 入库登记:管理员录入新设备信息,生成唯一资产编号 维护记录:记录维修、校准、保养信息(时间、内容、执行人) 报废处理:登记报废原因、时间,更新设备状态为 "已报废" 设备查询:支持按名称、型号、状态多条件检索,显示设备当前可用情况 3. 预约与使用模块 预约管理: 预约规则:学生可预约未来 7 天内的设备 / 实验室,单次最长 4 小时(可设置) 预约流程:选择实验室→选择设备→选择时间段→提交申请(需填写实验目的) 审核机制:普通实验自动通过,高危实验(如化学实验)需教师审核 使用记录: 签到 / 签退:到达实验室后扫码签到,离开时签退,系统自动记录实际使用时长 使用登记:填写实验内容、设备运行情况(正常 / 异常),异常情况需详细描述 违规管理:迟到 15 分钟自动取消预约,多次违规限制预约权限 4. 耗材与安全管理模块 耗材管理: 耗材档案:名称、规格、数量、存放位置、
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值