cassandra batch的注意事项

最新推荐文章于 2021-06-18 15:03:26 发布

GeoAnt

最新推荐文章于 2021-06-18 15:03:26 发布

阅读量2.1k

点赞数 1

CC 4.0 BY-SA版权

文章标签： cassandra 数据库

本文链接：https://blog.youkuaiyun.com/mathgeophysics/article/details/52736505

Cassandra 的批处理并非总是提升效率，尤其在分布式环境中。logged batches 在数据跨分区时无法有效工作，可能导致单个节点处理压力增大。正确使用场景主要限于同一分区的数据操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/

https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/
However, a lot of people are used to databases where explicit batching is a performance improvement. If you did this in Cassandra you're very likely to see the performance reduce. You'd end up with some code like this (the code to build a single bound statement has been extracted out to a helper method):

Looks good right? Surely this means we get to send all our inserts in one go and the database can handle them in one storage action? Well, put simply, no. Cassandra is a distributed database, no single node can handle this type of insert even if you had a single replica per partition.

#xhe_tmpurl

CASSANDRA是分布数据库，BATCH只在数据属于同一个PARTITION时，使用BATCH才会提高效率

Cassandra anti-pattern: Logged batches

I've previously blogged about other anti-patterns:

Distributed joins
Unlogged batches

This post is similar to the unlogged batches post but is instead about logged batches.

We'll again go through an example Java application.

The good news is that the common misuse is virtually the same as the last article on unlogged batches, so you know what not to do. The bad news is if you do happen to misuse them it is even worse!

Let's see why. Logged batches are used to ensure that all the statements will eventually succeed. Cassandra achieves this by first writing all the statements to a batch log. That batch log is replicated to two other nodes in case the coordinator fails. If the coordinator fails then another replica for the batch log will take over.

Now that sounds like a lot of work. So if you try to use logged batches as a performance improvement then you'll be very disappointed! For a logged batch with 8 insert statements (equally distributed) in a 8 node cluster it will look something like this:

The coordinator has to do a lot more work than any other node in the cluster. Where if we were to just do them as regular inserts we'd be looking like this:

A nice even workload.

So when would you want to use logged batches?

Short answer: consistent denormalisation. In most cases you won't want to use them, they are a performance hit. However for some tables where you have denormalised you can decide to make sure that both statements succeed. Lets go back to our customer event table from the previous post but also add a customer events by staff id table: