Use HBase to Solve Page Access Problem

本文探讨了使用HBase存储海量网站访问记录的方法,并利用MapReduce进行聚合统计,以支持实时查询和批处理两种场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Currrently I'm working on sth like calculating how many hits on a page. The problem is the raw data can be huge, so it may not scale if you use RDBMS.

The raw input is as follows.

Date Page User
-----------
date1 page1 user1
date1 page1 user2
date1 page2 user1
date1 page2 user3

... ...

So I need to answer questions like "for page1 on day1 how many distinct users have visited it?" or "on day1 how many distinct users have visited (the web site)?" That is, you need to support roll up or drill down at some columns.

Before coding, I read some articles related to my problem. Here are the references.

1. [url]http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/[/url] (English)

2. [url]http://www.cnblogs.com/panfeng412/archive/2011/11/19/hbase-application-in-data-statistics.html[/url] (Chinese)

For raw data we can simpley write into hbase; the challenge left is how to calculate the aggregated result. One solution mentioned in [2] is that you have a table holding the aggregated result, whenever a raw data record is put in hbase, you also update the corresponding aggregated record. E.g.,

Table: Day_Page_Access

key value
20130304_page1 3400
20130304_page2 7800

When a raw data record (20130304, page1, Tom) is processed, you get the total access count with row key 20130304_page1 which is 3400 then increase it by 1 and write back.

I think the problem is when doing large writes mixed with updates the in-all performance will be droped severely. But the benefit of this approach is the aggregated result is available at any time. You can support querying in almost real time.

The other solution in [1] has done a quite good work which leverage hadoop mapreduce to calculate the aggregated result. After all data is loaded in hbase, it will perform a mapreduce job to sum the access counts for same page same day. The solution is suited to scenarios that allow off-line batch processing. It can have greate write throughput.

Btw, I'm working on solution 1.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值