CAP理论(帽子理论)
分布式系统中,一致性、可用性、分区容错性不可兼得,最多只可同时满足两个。
C(Consistency 一致性):
A read is guaranteed to return the most recent write for a given client.
在分布式系统中的所有数据备份,在同一时刻是否同样的值。(等同于所有节点访问同一份最新的数据副本)
注:在一个节点上修改数据,在另一个节点上能读取到修改后的数据
某个节点的数据更新结果对后面通过其它节点的读操作可见
立即可见,称为强一致性
部分或者全部感知不到该更新,称为弱一致性
在一段时间(通常该时间不固定)后,一定可以感知该更新,称为最终一致性
A(Availability 可用性):
A non-failing node will return a reasonable response within a reasonable amount of time(no error or timeout).
在集群中一部分节点故障后,集群整体是否还能响应客户端的读写请求。(对数据更新具备高可用性)
注:在用户可容忍的范围内返回数据
一个没有发生故障的节点必须在有限的时间内返回合理的结果
Vs 数据库的HA:一个节点宕机,其它节点仍然可用
P(Partition Tolerance 分区容错性):
The system will continue to function when network partitions occur.
以实际效果而言,分区相当于对通信的时限要求。系统如果不能在时限内达成数据一致性,就意味着发生了分区的情况,必须就当前操作在C和A之间做出选择。
注:一个分布式集群,分成多个小的集群,小的集群内部可相互通信,对于用户透明
Ref: https://blog.youkuaiyun.com/Happy_Sunshine_Boy/article/details/86285746
Kafka是CA的,因为Kafka在活跃时只有一个MQ有用,其他的几份(比如2个)是作为备份的只能读。
How does Kafka’s notion of streams compare to a traditional enterprise messaging system?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren’t multi-subscriber—once one process reads the data it’s gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
The advantage of Kafka’s model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
Kafka has stronger ordering guarantees than a traditional messaging system, too.
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
一种是queue模式,很多pool(表示没有联系的一堆)的消费者,表示读取一次后,不管操作成功或者失败,不需再进行处理。
另外一种就是消费者订阅者模式,所有消费者都会接收到广播,所以难以避免一条消息消费多次的问题。这里就引出消费组的概念。通过使用消费组,可以保证一个组内消息不会重复,就保证了消息消费的唯一性,并且通过Topic来获取就行。
另外,作为一个消息队列,还可以将消息队列处理为推模式和拉模式。
push vs. pull
An initial question we considered is whether consumers should pull data from brokers or brokers should push data to the consumer. In this respect Kafka follows a more traditional design, shared by most messaging systems, where data is pushed to the broker from the producer and pulled from the broker by the consumer. Some logging-centric systems, such as Scribe and Apache Flume, follow a very different push-based path where data is pushed downstream. There are pros and cons to both approaches. However, a push-based system has difficulty dealing with diverse consumers as the broker controls the rate at which data is transferred. The goal is generally for the consumer to be able to consume at the maximum possible rate; unfortunately, in a push system this means the consumer tends to be overwhelmed when its rate of consumption falls below the rate of production (a denial of service attack, in essence). A pull-based system has the nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but getting the rate of transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems. Previous attempts at building systems in this fashion led us to go with a more traditional pull model.
Another advantage of a pull-based system is that it lends itself to aggressive batching of data sent to the consumer. A push-based system must choose to either send a request immediately or accumulate more data and then send it later without knowledge of whether the downstream consumer will be able to immediately process it. If tuned for low latency, this will result in sending a single message at a time only for the transfer to end up being buffered anyway, which is wasteful. A pull-based design fixes this as the consumer always pulls all available messages after its current position in the log (or up to some configurable max size). So one gets optimal batching without introducing unnecessary latency.
The deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectively busy-waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a “long poll” waiting until data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).
You could imagine other possible designs which would be only pull, end-to-end. The producer would locally write to a local log, and brokers would pull from that with consumers pulling from them. A similar type of “store-and-forward” producer is often proposed. This is intriguing but we felt not very suitable for our target use cases which have thousands of producers. Our experience running persistent data systems at scale led us to feel that involving thousands of disks in the system across many applications would not actually make things more reliable and would be a nightmare to operate. And in practice we have found that we can run a pipeline with strong SLAs at large scale without a need for producer persistence.
推模式,容易把消费者喂的太饱,或者吃撑到。拉模式呢,比较被动,就像养小孩一样,万一我这赚钱不多,4个孩子嗷嗷待哺,就一直哭着要吃奶,我喂不过来了,他们就一直哭。另一种拉的方式是我提前泡好奶粉放在宝宝边上(磁盘上),但是存放时间问题和宝宝一次吃半瓶浪费问题,以及存放过多不好管理生产日期等,也会引发巨大的问题,因此,如何使用Kafka,着实需要理论基础和实践基础的结合。
Ref:
https://stackoverflow.com/questions/51375187/why-kafka-is-not-p-in-cap-theorem/51379079