System Design Basics

最新推荐文章于 2025-11-17 07:03:23 发布

原创最新推荐文章于 2025-11-17 07:03:23 发布 · 302 阅读

0 ·

CC 4.0 BY-SA版权

系统设计专栏收录该内容

10 篇文章

订阅专栏

本文概述了分布式系统的主要特性，包括可伸缩性、可靠性、可用性、效率和可管理性。可伸缩性关注系统在处理需求增长时的能力，分为水平和垂直扩展。可靠性涉及到系统在组件故障时仍能持续服务的能力，而可用性则是系统在特定时间内保持运行的比例。效率则关注系统的响应时间和吞吐量。最后，可管理性涉及系统的操作和维护简易性。

System Design Basics

Key Characteristics of Distributed Systems

System Design Basics
Whenever we are designing a large system, we need to consider a few things:
What are the different architectural pieces that can be used?
How do these pieces work with each other?
How can we best utilize these pieces: what are the right tradeoffs?
Investing in scaling before it is needed is generally not a smart business proposition; however, some forethought into the design can save valuable time and resources in the future. In the following chapters, we will try to define some of the core building blocks of scalable systems. Familiarizing these concepts would greatly benefit in understanding distributed system concepts. In the next section, we will go through Consistent Hashing, CAP Theorem, Load Balancing, Caching, Data Partitioning, Indexes, Proxies, Queues, Replication, and choosing between SQL vs. NoSQL.

Key Characteristics of Distributed Systems

Key characteristics of a distributed system include Scalability, Reliability, Availability, Efficiency, and Manageability. Let’s briefly review them:

分布式系统的关键特性包括可伸缩性、可靠性、可用性、效率和可管理性

Scalability

Scalability is the capability of a system, process, or a network to grow and manage increased demand. Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable.

A system may have to scale because of many reasons like increased data volume or increased amount of work, e.g., number of transactions. A scalable system would like to achieve this scaling without performance loss.

Generally, the performance of a system, although designed (or claimed) to be scalable, declines with the system size due to the management or environment cost. For instance, network speed may become slower because machines tend to be far apart from one another. More generally, some tasks may not be distributed, either because of their inherent atomic nature or because of some flaw in the system design. At some point, such tasks would limit the speed-up obtained by distribution. A scalable architecture avoids this situation and attempts to balance the load on all the participating nodes evenly.

Horizontal vs. Vertical Scaling: Horizontal scaling means that you scale by adding more servers into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM, Storage, etc.) to an existing server.

With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool; Vertical-scaling is usually limited to the capacity of a single server and scaling beyond that capacity often involves downtime and comes with an upper limit.

Good examples of horizontal scaling are Cassandra and MongoDB as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime.

Vertical scaling vs. Horizontal scaling

可伸缩性是系统、进程或网络增长和管理不断增长的需求的能力。任何能够不断进化以支持不断增长的工作量的分布式系统都被认为是可伸缩的。

由于数据量增加或工作量增加（例如事务数）等许多原因，系统可能必须进行扩展。一个可扩展的系统希望在不损失性能的情况下实现这种扩展。

一般来说，系统的性能虽然被设计（或声称）为可伸缩的，但是由于管理或环境成本的原因，随着系统的大小而下降。例如，网络速度可能会变慢，因为机器之间往往相距很远。更一般地说，有些任务可能不会被分发，这可能是因为它们固有的原子特性，也可能是因为系统设计中的某些缺陷。在某种程度上，这样的任务会限制分配所获得的速度。可伸缩的体系结构避免了这种情况，并试图均衡所有参与节点上的负载。

水平扩展与垂直扩展：水平扩展意味着通过向资源池中添加更多服务器来扩展，而垂直扩展意味着通过向现有服务器添加更多电源（CPU、RAM、存储等）来扩展。

使用水平扩展时，通过向现有池中添加更多的计算机，动态扩展通常更容易；垂直扩展通常限于单个服务器的容量，超出该容量的扩展通常会涉及停机时间，并且有一个上限。

水平扩展的好例子是Cassandra和MongoDB，因为它们都通过添加更多的机器来满足不断增长的需求，从而提供了一种简单的水平扩展方法。类似地，垂直伸缩的一个很好的例子是MySQL，因为它允许通过从较小的机器切换到较大的机器来实现垂直伸缩。然而，这一过程往往涉及停机时间。

Reliability

By definition, reliability is the probability a system will fail in a given period. In simple terms, a distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail. Reliability represents one of the main characteristics of any distributed system, since in such systems any failing machine can always be replaced by another healthy one, ensuring the completion of the requested task.

Take the example of a large electronic commerce store (like Amazon), where one of the primary requirement is that any user transaction should never be canceled due to a failure of the machine that is running that transaction. For instance, if a user has added an item to their shopping cart, the system is expected not to lose it. A reliable distributed system achieves this through redundancy of both the software components and data. If the server carrying the user’s shopping cart fails, another server that has the exact replica of the shopping cart should replace it.

Obviously, redundancy has a cost and a reliable system has to pay that to achieve such resilience for services by eliminating every single point of failure.

根据定义，可靠性是一个系统在给定时间内发生故障的概率。简单地说，如果一个分布式系统在一个或多个软件或硬件组件出现故障时仍能提供服务，那么它就被认为是可靠的。可靠性代表了任何分布式系统的主要特征之一，因为在这样的系统中，任何出现故障的机器都可以被另一台正常的机器替换，从而确保完成所请求的任务。

以大型电子商务商店（如Amazon）为例，其中一个主要要求是，任何用户事务都不应因为运行该事务的机器出现故障而取消。例如，如果用户将一个项目添加到他们的购物车中，系统就不会丢失它。可靠的分布式系统通过软件组件和数据的冗余来实现这一点。如果承载用户购物车的服务器出现故障，则另一个具有购物车精确副本的服务器应该替换它。

显然，冗余是有代价的，而一个可靠的系统必须为此付出代价，才能通过消除每一个故障点来实现服务的恢复能力。

Availability

By definition, availability is the time a system remains operational to perform its required function in a specific period. It is a simple measure of the percentage of time that a system, service, or a machine remains operational under normal conditions. An aircraft that can be flown for many hours a month without much downtime can be said to have a high availability. Availability takes into account maintainability, repair time, spares availability, and other logistics considerations. If an aircraft is down for maintenance, it is considered not available during that time.

Reliability is availability over time considering the full range of possible real-world conditions that can occur. An aircraft that can make it through any possible weather safely is more reliable than one that has vulnerabilities to possible conditions.

根据定义，可用性是系统在特定时间段内保持运行以执行其所需功能的时间。它是对系统、服务或机器在正常条件下保持运行的时间百分比的简单度量。一架每月可以飞行数小时而不停机的飞机可以说是高可用性的。可用性考虑了可维护性、维修时间、备件可用性和其他后勤因素。如果飞机因维修而停机，则认为在此期间不可用。

可靠性是指考虑到可能发生的各种现实条件，随着时间的推移而变得可用。一架能在任何可能的天气下安全飞行的飞机比一架易受可能情况影响的飞机更可靠。

Reliability Vs. Availability
If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve a high availability even with an unreliable product by minimizing repair time and ensuring that spares are always available when they are needed. Let’s take the example of an online retail store that has 99.99% availability for the first two years after its launch. However, the system was launched without any information security testing. The customers are happy with the system, but they don’t realize that it isn’t very reliable as it is vulnerable to likely risks. In the third year, the system experiences a series of information security incidents that suddenly result in extremely low availability for extended periods of time. This results in reputational and financial damage to the customers.

如果一个系统是可靠的，它是可用的。然而，如果有，就不一定可靠。换句话说，高可靠性有助于实现高可用性，但通过最小化维修时间并确保备件在需要时始终可用，即使产品不可靠，也有可能实现高可用性。让我们以一家在线零售店为例，它在推出后的头两年有99.99%的可用性。然而，该系统在启动时没有进行任何信息安全测试。客户对这个系统很满意，但是他们没有意识到它不太可靠，因为它很容易受到风险的影响。在第三年，系统经历了一系列信息安全事件，这些事件突然导致长时间的极低可用性。这会对客户的声誉和财务造成损害。

Efficiency

To understand how to measure the efficiency of a distributed system, let’s assume we have an operation that runs in a distributed manner and delivers a set of items as result. Two standard measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first item and the throughput (or bandwidth) which denotes the number of items delivered in a given time unit (e.g., a second). The two measures correspond to the following unit costs:

Number of messages globally sent by the nodes of the system
regardless of the message size.
Size of messages representing the volume of data exchanges.

The complexity of operations supported by distributed data structures (e.g., searching for a specific key in a distributed index) can be characterized as a function of one of these cost units. Generally speaking, the analysis of a distributed structure in terms of ‘number of messages’ is over-simplistic. It ignores the impact of many aspects, including the network topology, the network load, and its variation, the possible heterogeneity of the software and hardware components involved in data processing and routing, etc. However, it is quite difficult to develop a precise cost model that would accurately take into account all these performance factors; therefore, we have to live with rough but robust estimates of the system behavior.

为了理解如何度量分布式系统的效率，假设我们有一个以分布式方式运行的操作，并作为结果交付一组项。其效率的两个标准度量是表示获得第一个项目的延迟的响应时间（或延迟）和表示在给定时间单位（例如，秒）中交付的项目的数量的吞吐量（或带宽）。这两项措施对应以下单位成本：

系统节点全局发送的消息数，无论消息大小如何。

表示数据交换量的消息大小。

分布式数据结构所支持的操作的复杂性（例如，在分布式索引中搜索特定键）可以描述为这些成本单位之一的函数。一般来说，用“消息数”来分析分布式结构过于简单。它忽略了许多方面的影响，包括网络拓扑、网络负载及其变化、数据处理和路由所涉及的软硬件组件的可能异构性等，很难建立一个精确的成本模型来精确地考虑所有这些性能因素；因此，我们不得不接受对系统行为的粗略但稳健的估计。

Serviceability or Manageability

Another important consideration while designing a distributed system is how easy it is to operate and maintain. Serviceability or manageability is the simplicity and speed with which a system can be repaired or maintained; if the time to fix a failed system increases, then availability will decrease. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate (i.e., does it routinely operate without failure or exceptions?).

Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault.

在设计分布式系统时，另一个重要的考虑因素是它的操作和维护有多容易。可维护性或可管理性是指系统修复或维护的简单性和速度；如果修复故障系统的时间增加，则可用性将降低。可管理性需要考虑的问题是：当问题发生时，诊断和理解问题的难易程度、更新或修改的难易程度，以及系统操作的简单程度（即，系统是否正常运行而没有故障或异常？）。

早期检测故障可以减少或避免系统停机。例如，当系统发生系统故障时，某些企业系统可以自动呼叫服务中心（无需人工干预）。