标记下:先翻译下HBase,hadoop未必全部需要,HBase不可少(构建在Hadoop的HDFS之上,实际上依赖于Hadoop,如果只是测试在单机运行,不需要安装配置Hadoop,如果需要分布式,还是需要的),看了下cassandra,accumulo,都大同小异,主要是没有深入到源码级别。
When Would I Use HBase?
Use HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
何时使用HBase?
如果需要随机,实时读写Big Data数据。这项目目标是支持巨型表——几十亿行,几百万烈——构建在集群硬件之上。
HBase是开源的,分布式,版本化的,面向列方式存储的,以Google的BigTable为模型。正如Google的GFS中,Bigtable在分布式存储上的核心,HBase是Hadoop和HDFS的分布式存储核心。
Features
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible jruby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
特性
- 线性和模块化的扩展能力
- 读写强一致性
- 自动化和可配置的表分片
- 通过RegionSever支持自动故障转移
- 无缝支持Hadoop的基于HBase表数据的MapReduce任务
- 易于客户端通过Java API访问
- 基于块缓存和Bloom filter(Bloom Filters是一种效率较高的内存索引hash算法,它本身具有矛盾性:一方面能快速测试目标成员是否存在,另一方面又不可避免的具有假命中率)来支持实时查询
- 通过服务端的Filters来查询预测
- Thrift 网关和REST-ful web应用,支持XML,Protobuf,和二进制编码数据。
- 可扩展jruby-based (JIRB) 脚本
- 支持外部的测量,如通过Hadoop的测量子系统,文件,Ganglia或者JMX。
JIRB的启动方式:
$ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT
PATH_TO_SCRIPT,是一个.rb文件。ruby,python这种还真是挺火的...