Big Data Counting: How to count a billion distinct objects using only 1.5KB of

本文介绍如何使用小型数据结构准确估算包含数十亿独立元素的集合基数。通过对比不同计数技术,如HashSet、线性概率计数器及HyperLogLog计数器等,展示了在保证一定精度的情况下大幅度节省内存的方法。

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

 

This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.

At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.

To better understand the challenge of determining the cardinality of large sets let's imagine that you have a 16 character ID and you'd like to count the number of distinct IDs that you've seen in your logs. Here is an example:

4f67bfc603106cb2

These 16 characters represent 128 bits. 65K IDs would require 1 megabyte of space. We receive over 3 billion events per day, and each event has an ID. Those IDs require 384,000,000,000 bits or 45 gigabytes of storage. And that is just the space that the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea is to use an in memory hash set that contains the unique list of IDs seen in the input files. Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM, not including the overhead Java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements this way and that is only to count a single day's worth of unique IDs. The problem only gets more difficult if we want to count weeks or months of data. We certainly don't have a single machine with several hundred gigs of free memory sitting around so we needed a better solution.

One common approach to this problem is the use of bitmaps. Bitmaps can be used to quickly and accurately get the cardinality of a given input. The basic idea with a bitmap is mapping the input dataset to a bit field using a hash function where each input element uniquely maps to one of the bits in the field. This produces zero collisions, and reduces the space required to count each unique element to 1 bit. While bitmaps drastically reduce the space requirements from the naive set implementation described above they are still problematic when the cardinality is very high and/or you have a very large number of different sets to count. For example, if we want to count to one billion using a bitmap you will need one billion bits, or roughly 120 megabytes for each counter. Sparse bitmaps can be compressed in order to gain space efficiency, but that is not always helpful.

Luckily, cardinality estimation is a popular area of research. We've leveraged this research to provide a open source implementation of cardinality estimators, set membership detection, and top-k algorithms.

Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare's works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Counter

Bytes Used

Count

Error

HashSet

10447016

67801

0%

Linear

3384

67080

1%

HyperLogLog

512

70002

3%

 

The table shows that we can count the words with a 3% error rate using only 512 bytes of space. Compare that to a perfect count using a HashMap that requires nearly 10 megabytes of space and you can easily see why cardinality estimators are useful. In applications where accuracy is not paramount, which is true for most web scale and network counting scenarios, using a probabilistic counter can result in tremendous space savings.

Linear Probabilistic Counter

The Linear Probabilistic Counter is space efficient and allows the implementer to specify the desired level of accuracy. This algorithm is useful when space efficiency is important but you need to be able to control the error in your results. This algorithm works in a two-step process. The first step assigns a bitmap in memory initialized to all zeros. A hash function is then applied to the each entry in the input data. The result of the hash function maps the entry to a bit in the bitmap, and that bit is set to 1. The second step the algorithm counts the number of empty bits and uses that number as input to the following equation to get the estimate.

n=-m ln Vn

In the equation m is the size of the bitmap and Vn is the ratio of empty bits over the size of the map. The important thing to note is that the size of the original bitmap can be much smaller than the expected max cardinality. How much smaller depends on how much error you can tolerate in the result. Because the size of the bitmap, m, is smaller than the total number of distinct elements, there will be collisions. These collisions are required to be space-efficient but also result in the error found in the estimation. So by controlling the size of the original map we can estimate the number of collisions and therefore the amount of error we will see in the end result.

Hyper LogLog

The Hyper LogLog Counter's name is self-descriptive. The name comes from the fact that you can estimate the cardinality of a set with cardinality Nmax using just loglog(Nmax) + O(1) bits. Like the Linear Counter the Hyper LogLog counter allows the designer to specify the desired accuracy tolerances. In Hyper LogLog's case this is done by defining the desired relative standard deviation and the max cardinality you expect to count. Most counters work by taking an input data stream, M, and applying a hash function to that set, h(M). This yields an observable result of S = h(M) of {0,1}^∞ strings. Hyper LogLog extends this concept by splitting the hashed input stream into m substrings and then maintains m observables for each of the substreams. Taking the average of the additional observables yields a counter whose accuracy improves as m grows in size but only requires a constant number of operations to be performed on each element of the input set. The result is that, according to the authors of this paper, this counter can count one billion distinct items with an accuracy of 2% using only 1.5 kilobytes of space. Compare that to the 120 megabytes required by the HashSet implementation and the efficiency of this algorithm becomes obvious.

Merging Distributed Counters

We've shown that using the counters described above we can estimate the cardinality of large sets. However, what can you do if your raw input dataset does not fit on single machine? This is exactly the problem we face at Clearspring. Our data is spread out over hundreds of servers and each server contains only a partial subset of the the total dataset. This is where the fact that we can merge the contents of a set of distributed counters is crucial. The idea is a little mind-bending but if you take a moment to think about it the concept is not that much different than basic cardinality estimation. Because the counters represent the cardinality as set of bits in a map we can take two compatible counters and merge their bits into a single map. The algorithms already handle collisions so we can still get a cardinality estimation with the desired precision even though we never brought all of the input data to a single machine. This is terribly useful and saves us a lot of time and effort moving data around our network.

Next Steps

Hopefully this post has helped you better understand the concept and application of probabilistic counters. If estimating the cardinality of large sets is a problem and you happen to use a JVM based language then you should check out the stream-lib project — it provides implementations of the algorithms described above as well as several other stream-processing utilities.

Related Articles

 

Total 175 (delta 60), reused 0 (delta 0), pack-reused 0 remote: Resolving deltas: 1% (1/60) remote: Resolving deltas: 3% (2/60) remote: Resolving deltas: 5% (3/60) remote: Resolving deltas: 6% (4/60) remote: Resolving deltas: 8% (5/60) remote: Resolving deltas: 10% (6/60) remote: Resolving deltas: 11% (7/60) remote: Resolving deltas: 13% (8/60) remote: Resolving deltas: 15% (9/60) remote: Resolving deltas: 16% (10/60) remote: Resolving deltas: 18% (11/60) remote: Resolving deltas: 20% (12/60) remote: Resolving deltas: 21% (13/60) remote: Resolving deltas: 23% (14/60) remote: Resolving deltas: 25% (15/60) remote: Resolving deltas: 26% (16/60) remote: Resolving deltas: 28% (17/60) remote: Resolving deltas: 30% (18/60) remote: Resolving deltas: 31% (19/60) remote: Resolving deltas: 33% (20/60) remote: Resolving deltas: 35% (21/60) remote: Resolving deltas: 36% (22/60) remote: Resolving deltas: 38% (23/60) remote: Resolving deltas: 40% (24/60) remote: Resolving deltas: 41% (25/60) remote: Resolving deltas: 43% (26/60) remote: Resolving deltas: 45% (27/60) remote: Resolving deltas: 46% (28/60) remote: Resolving deltas: 48% (29/60) remote: Resolving deltas: 50% (30/60) remote: Resolving deltas: 51% (31/60) remote: Resolving deltas: 53% (32/60) remote: Resolving deltas: 55% (33/60) remote: Resolving deltas: 56% (34/60) remote: Resolving deltas: 58% (35/60) remote: Resolving deltas: 60% (36/60) remote: Resolving deltas: 61% (37/60) remote: Resolving deltas: 63% (38/60) remote: Resolving deltas: 65% (39/60) remote: Resolving deltas: 66% (40/60) remote: Resolving deltas: 68% (41/60) remote: Resolving deltas: 70% (42/60) remote: Resolving deltas: 71% (43/60) remote: Resolving deltas: 73% (44/60) remote: Resolving deltas: 75% (45/60) remote: Resolving deltas: 76% (46/60) remote: Resolving deltas: 78% (47/60) remote: Resolving deltas: 80% (48/60) remote: Resolving deltas: 81% (49/60) remote: Resolving deltas: 83% (50/60) remote: Resolving deltas: 85% (51/60) remote: Resolving deltas: 86% (52/60) remote: Resolving deltas: 88% (53/60) remote: Resolving deltas: 90% (54/60) remote: Resolving deltas: 91% (55/60) remote: Resolving deltas: 93% (56/60) remote: Resolving deltas: 95% (57/60) remote: Resolving deltas: 96% (58/60) remote: Resolving deltas: 98% (59/60) remote: Resolving deltas: 100% (60/60) remote: Resolving deltas: 100% (60/60) remote: Counting objects: 3 remote: Counting objects: 4 remote: Counting objects: 6 remote: Counting objects: 10 remote: Counting objects: 13 remote: Counting objects: 15 remote: Counting objects: 19 remote: Counting objects: 22 remote: Counting objects: 28 remote: Counting objects: 34 remote: Counting objects: 36 remote: Counting objects: 38 remote: Counting objects: 40 remote: Counting objects: 42 remote: Counting objects: 49 remote: Counting objects: 50 remote: Counting objects: 51 remote: Counting objects: 64 remote: Counting objects: 69 remote: Counting objects: 71 remote: Counting objects: 81 remote: Counting objects: 91 remote: Counting objects: 96 remote: Counting objects: 101 remote: Counting objects: 104 remote: Counting objects: 114 remote: Counting objects: 120 remote: Counting objects: 131 remote: Counting objects: 144 remote: Counting objects: 153 remote: Counting objects: 165 remote: Counting objects: 179 remote: Counting objects: 187 remote: Counting objects: 197 remote: Counting objects: 213 remote: Counting objects: 220 remote: Counting objects: 231 remote: Counting objects: 241 remote: Counting objects: 258 remote: Counting objects: 270 remote: Counting objects: 274 remote: Counting objects: 289 remote: Counting objects: 298 remote: Counting objects: 332 remote: Counting objects: 353 remote: Counting objects: 375 remote: Counting objects: 17527 remote: Counting objects: 23033 remote: Counting objects: 26210 remote: Counting objects: 29370 remote: Counting objects: 37996 remote: Counting objects: 43676 remote: Counting objects: 46267 remote: Counting objects: 50326 remote: Counting objects: 52648 remote: Counting objects: 54660 remote: Counting objects: 58753 remote: Counting objects: 60436 remote: Counting objects: 64595 remote: Counting objects: 67979 remote: Counting objects: 70970 remote: Counting objects: 73122 remote: Counting objects: 74939 remote: Counting objects: 76795 remote: Counting objects: 79100 remote: Counting objects: 81443 remote: Counting objects: 83914 remote: Counting objects: 86900 remote: Counting objects: 90295 remote: Counting objects: 93691 remote: Counting objects: 95231 remote: Counting objects: 101642 remote: Counting objects: 108319 remote: Counting objects: 110386 remote: Counting objects: 115714 remote: Counting objects: 120384 remote: Counting objects: 125628 remote: Counting objects: 129481 remote: Counting objects: 131493 remote: Counting objects: 135234 remote: Counting objects: 138075 remote: Counting objects: 141316 remote: Counting objects: 144987 remote: Counting objects: 148761 remote: Counting objects: 152079 remote: Counting objects: 154997 remote: Counting objects: 161710 remote: Counting objects: 164913 remote: Counting objects: 167735 remote: Counting objects: 172413 remote: Counting objects: 174035 remote: Counting objects: 178740 remote: Counting objects: 180202 remote: Counting objects: 182779 remote: Counting objects: 188229 remote: Counting objects: 192854 remote: Counting objects: 194578 remote: Counting objects: 197468 remote: Counting objects: 202622 remote: Counting objects: 204426 remote: Counting objects: 208079 remote: Counting objects: 215423 remote: Counting objects: 220318 remote: Counting objects: 226199 remote: Counting objects: 227868, done remote: error: branch refs/heads/v2.1.15.1_dev: remote: To push into this reference you need 'Push' rights. remote: User: likun5 remote: Contact an administrator to fix the permissions remote: remote: Processing changes: refs: 1 remote: Processing changes: refs: 1, done error: failed to push some refs to 'ssh://pdgerrit.tp-link.com:29418/VMS/SERVER/vms' To ssh://pdgerrit.tp-link.com:29418/VMS/SERVER/vms ! refs/heads/v2.1.15.1_dev:refs/heads/v2.1.15.1_dev [remote rejected] (prohibited by Gerrit: not permitted: update)
最新发布
10-17
从日志来看,你在尝试推送分支 `v2.1.15.1_dev` 到远程仓库时被拒绝,原因是 **Gerrit 权限限制**(`prohibited by Gerrit: not permitted: update`)。以下是分析和解决方案: --- ### **问题原因** 1. **权限不足**:你的账号 `likun5` 没有该分支的 `Push` 权限(Gerrit 需明确授权才能推送)。 2. **分支保护**:远程分支 `v2.1.15.1_dev` 可能被设置为受保护分支,需通过代码评审(Gerrit 的 `Push` 权限通常仅限特定用户或 CI/CD 账号)。 --- ### **解决方案** #### 1. **联系管理员** - 根据日志提示,需管理员调整权限: ``` Contact an administrator to fix the permissions ``` - 提供以下信息: - 你的账号:`likun5` - 目标分支:`refs/heads/v2.1.15.1_dev` - 请求权限:`Push` 或 `Create Reference`(视仓库规则而定)。 #### 2. **通过 Gerrit 流程提交代码** - 如果权限无法立即调整,需通过 Gerrit 的代码评审流程: 1. **提交到 `refs/for/分支名`**(Gerrit 专用引用): ```bash git push origin HEAD:refs/for/v2.1.15.1_dev ``` 2. 登录 Gerrit Web 界面,添加评审人并等待合并。 #### 3. **检查本地分支状态** - 确保本地分支与远程一致,避免冲突: ```bash git fetch origin v2.1.15.1_dev git rebase origin/v2.1.15.1_dev ``` --- ### **关键日志分析** - **错误行**: ``` remote: error: branch refs/heads/v2.1.15.1_dev: remote: To push into this reference you need 'Push' rights. ``` 明确提示需 `Push` 权限。 - **Gerrit 限制**: ``` ! refs/heads/v2.1.15.1_dev:refs/heads/v2.1.15.1_dev [remote rejected] ``` Gerrit 阻止了直接推送,需通过评审。 --- ### **后续建议** 1. **权限申请模板**: ``` 主题:申请 v2.1.15.1_dev 分支的 Push 权限 内容: 账号:likun5 用途:开发功能 XXX(简要说明) 仓库:VMS/SERVER/vms ``` 2. **临时替代方案**: - 如果急需推送,可联系有权限的同事代为提交。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值