《Hbase权威指南》深入学习hbase：表定义和基本操作

最新推荐文章于 2019-05-09 20:54:09 发布

原创最新推荐文章于 2019-05-09 20:54:09 发布 · 335 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hbase #分布式数据库

Hbase 专栏收录该内容

18 篇文章

订阅专栏

本文详细介绍了在HBase中创建用户表、获取表实例、基本操作类型（Put、Get、Delete、Scan）以及批量操作（Batch）的实现与注意事项。

部署运行你感兴趣的模型镜像

在HBase中，要定义一个用户表（HTable），只需要以下几个步骤：


                  Configuration conf = HBaseConfiguration.create();[/b][i]//通过HBase配置工厂生成一个Configuration配置实例
                  HBaseAdmin admin = new HBaseAdmin(conf);
                  HTableDescriptor htableDesc = new HTableDescriptor("users");//声明一个叫“users”的表[/i]
                  HColomnDescriptor colomnFamilyDesc = new HColomnDescriptor("info");//声明一个叫“info”的列族
                  colomnFamilyDesc.setMaxVersions(3);//将版本数由1改为3
                  htableDesc.addFamily(colomnFamilyDesc);//给“users”表添加“info”列族
                  admin.createTable(htableDesc);[/b][i]//生成“users”表

在操作用户表的时候，可以通过两种方式获得用户表的实例：
1、直接生成HTable实例：


                  Configuration conf = HBaseConfiguration.create()；
                  HTableInterface userTable = new HTable(conf,"users");

2、通过HTablePool连接池获取用户表：


                  HTablePool htablePool = new HTablePool();
                  //HTablePool htablePool = new HTablePool(30);
                  HTableInterface userTable = htablePool.getTable("users");

对于HTable，有如下几个特点：
1、[b]HTable实例不是线程安全的[/b]，我们来看其API中的说明：
[i]"This class is not thread safe for updates; the underlying write buffer can be corrupted if multiple threads contend over a single HTable instance."[/i]
2、需要尽可能地共享Configuration实例，在第一种获取HTable对象的方式中，每次都活生成一个新的Configuration对象，可以把每次生成的Configuration对象都代表一个HBase的连接，在会造成线程不安全，建议使用同一个HBaseConfiguration实例来创建HTable实例；
3、HTable实例创建是一个代价非常昂贵的操作。
对于HTable的以上特性，故尽量通过第二种方式获取HTable的实例，HTablePool对象池中保有一个共享的Configuration对象，HTablePool对象池提供了一个线程封闭技术下的方案，它保证多线程下的共享安全。
通过第一种方式获得的HTable对象，使用完后都必须关闭；而对于第二种方式获得的HTable对象，执行userTable.close()操作就表示要把该对象返回给htablePool对象池。 [/b]

HBase有[b]4[/b]种基本的操作类型：[b]Put，Get，Delete，Scan[/b]。在默认情况下，每次对一个HTable表调用Put，Get，Delete操作，都会执行一次RPC调用，每次对Scan执行后的结果集的每一次循环也代表一次RPC调用。这表明如果在一个批量提交数据的场景中，比如说要一个提交1000个Put操作，那么就要和服务器做1000次RPC操作，这无疑会带来很多不必要的网络开销。
HBase内建有客户端的写缓冲(a built-in client-side write buffer)，可以通过一次RPC调用将多个数据提交操作发送到服务器端。具体做如下：
[b] userTable.setAutoFlush(false);//默认为true[/b]
来禁止默认情况下的制动刷新行为，这些Puts操作会保存在客户端的内存中，然后在调用
[b]userTable.flushCommits();或userTable.close();[/b]
操作来提交批量修改，数据只有在被提交后才能再次被查出。但是，禁用自动刷新功能会有一个弊端，那就是如果客户端在调用RPC时出现问题，那么一部分数据就有可能丢失。
[b]注：userTable.close()操作会隐含包括userTable.flushCommits()的调用。[/b]
[img]http://dl.iteye.com/upload/attachment/0083/8797/81dc431a-0d79-3d14-b54d-642b25250cc0.jpg[/img]

[b]Put操作：[/b]
[b]Put操作可以对应HBase数据库表的保存和修改两个操作。[/b]除了可以一次提交一个put对象外，还可以一次提交一个put的集合：


                userTable.put(put);
                userTable.put(putList);

示例如下：


          HTablePool htablePool = new HTablePool(); 
          HTableInterface userTable = htablePool.getTable("users"); 

          /**
           * 提交单个修改
           */
          Put singlePut = new Put(Bytes.toBytes("张三丰13560204"));
          singlePut.add(Bytes.toBytes("info"),Bytes.toBytes("sex"),Bytes.toBytes("male"));
          userTable.put(singlePut);

          /**
           * 提交单个修改
           */
          List<Put> putList = new ArrayList<Put>(3);
          singlePut = new Put(Bytes.toBytes("杨过12760204"));
          singlePut.add(Bytes.toBytes("info"),Bytes.toBytes("address"),Bytes.toBytes("湖北"));
          putList.add(singlePut);

          singlePut = new Put(Bytes.toBytes("小龙女12760204"));
          singlePut.add(Bytes.toBytes("info"),Bytes.toBytes("address"),Bytes.toBytes("湖北"));

          putList.add(singlePut);
          singlePut = new Put(Bytes.toBytes("段誉11760204"));
          singlePut.add(Bytes.toBytes("info"),Bytes.toBytes("address"),Bytes.toBytes("大理"));
          putList.add(singlePut);
          userTable.put(putList);//批量提交
          userTable.close();

[b]注：userTable.put(putList)操作并不能象传统关系型数据库那样，保证所有的put要么全部成功，要么全部失败。下面就是一个这样的示例：[/b]


          HTablePool htablePool = new HTablePool(); 
          HTableInterface userTable = htablePool.getTable("users"); 

          List<Put> putList = new ArrayList<Put>(3);
          Put put1 = new Put(Bytes.toBytes("李三19260204"));
          put1.add(Bytes.toBytes("info"),Bytes.toBytes("address"),Bytes.toBytes("陕西"));
          putList.add(put1);

          Put put2 = new Put(Bytes.toBytes("王五19760204"));   
          put1.add(Bytes.toBytes("info"),Bytes.toBytes("sex"),Bytes.toBytes("female"));
          putList.add(put2);

          Put put3 = new Put(Bytes.toBytes("王五19460204"));
          put1.add(Bytes.toBytes("empty"),Bytes.toBytes("sex"), Bytes.toBytes("female"));[b]//注意：在定义users表时并没有定义"empty"列族[/b]
          putList.add(put3);

          try{
              userTable.put(putList);
          }catch(Exception e){
              System.err.println("Error: ) + e;
              userTable.flushCommits();
          }
          userTable.close();

上例提交运行后只有rowkey为"李三19260204"和"王五19760204"行会被保存，而在保存rowkey="王五19460204"的行将不会被保存，提交的操作的集合会在保存rowkey="王五19460204"的行时会报报如下异常：


           Error:java.lang.IllegalArgumentException: No colomns to insert
           Exception in thread "main"
           org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
           Failed 1 action: NoSuchColomnFamilyException: 1 time,
           servers with issues:10.0.0.57:51640

产生的Error是一个客户端出检查(client-side check)的error，第二个Exception是服务器端由userTable.flushCommits()产生的远程异常。
[b]注意：由于之前设置过userTable.setAutoFlush(false)激活了client-side write buffer，客户端检查(“client-side check”)不会立即报错，一直延迟到buffer刷新。在这种情况下，可以通过调用checkAndPut()来将客户端检查不延迟报错。[/b]

[b]Get操作：[/b]
[b]Get操作是对HBase数据库表的读操作。[/b]可以一次读一条记录，也可以同时读取多条数据，如下所示：


          userTable.put(get);  
          userTable.put(getList);

可以通过Get操作一次获取整行的数据，也可以一次获取一行中的一个列族中的数据，还可以获得具体的一个cell的数据：


          HTablePool htablePool = new HTablePool(); 
          HTableInterface userTable = htablePool.getTable("users"); 

          /**
           *一次获取整行的数据
           */
          Get get = new Get(Bytes.toBytes("张三丰13560204"));
          Result result = userTable.get(get); 

          /**
           *一次获取一行的"info"列族的数据
           */
          get = new Get(Bytes.toBytes("张三丰13560204"));
          get.addFamily(Bytes.toBytes("info"));
          result = userTable.get(get); 

          /**
           *获取具体的cell中的数据
           */
          Get get = new Get(Bytes.toBytes("张三丰13560204"));
          get.addColomn(Bytes.toBytes("info"),Bytes.toBytes("address"));
          result = userTable.get(get);

一次RPC调用获得多个Get的操作的示例：


          HTablePool htablePool = new HTablePool(); 
          HTableInterface userTable = htablePool.getTable("users");

          List<Get> getList = new ArrayList<Get>(3);
          Get get = new Get(Bytes.toBytes("张三丰13560204"));          
          getList.add(get);

          get = new Get(Bytes.toBytes("段誉11760204"));
          getList.add(get);//2

          get = new Get(Bytes.toBytes("小龙女12760204));
          getList.add(get);//3

          Result[] results = userTable.get(getList);
          for(Result result : results){
              ... //在此做数据处理逻辑
          }

在一次RPC调用中List<Get>获取多条数据时，如果其中一个Get有问题报错的话，整个RPC调用都会失败，不会像批处理List<Put>操作那样部分成功部分失败，如下例所示：


         HTablePool htablePool = new HTablePool(); 
         HTableInterface userTable = htablePool.getTable("users");

         List<Get> getList = new ArrayList<Get>(3);
         Get get = new Get(Bytes.toBytes("张三丰13560204"));
         get.addColomn(Bytes.toBytes("info"),Bytes.toBytes("address"));
         getList.add(get);//1

         get = new Get(Bytes.toBytes("段誉11760204"));
         get.addColomn(Bytes.toBytes("info"),Bytes.toBytes("sex"));
         getList.add(get);//2

         get = new Get(Bytes.toBytes("小龙女12760204));
         get.addColomn(Bytes.toBytes("no_such_cf"),Bytes.toBytes("address"));
         getList.add(get);//3

         Result[] results = userTable.get(getList);
         for(Result result : results){
             ... //在此做数据处理逻辑
         }

由于在users中没有定义“no_such_cf”列族，故在执行[i]userTable.get(getList);[/i]时会报如下异常：


         org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:  
         Failed 1 action: NoSuchColomnFamilyException: 1 time,  
         servers with issues:10.0.0.57:51640

[b]Get操作可以保证数据库的原子性。[/b]

[b]Delete操作：[/b]
[b]Delete是对HBase数据库表的删除操作。[/b]从前面的章节中我们已经知道HBase并没有真正的“删除”操作，对HTable执行delete实际上是向表中追加了一条被标记为“删除”标记的记录。
在delete操作中可以删除整条数据，或具体的cell的数据，当然也可以删除某一个列族。可以一次删除一行数据，也可以一次删除多条数据。
在删除数据时，当指定一个不存在的列族时会报错，这种情况下可以调用checkAndDelete()方法在客户端捕获这个异常并处理之。
在批处理List<Delete>时，如果其中有一个delete有问题时，情况和批处理List<Put>同。

[b]Scan操作：[/b]
Scan是HBase中对表的基于磁盘顺序“读”的扫描操作，功能和关系数据库中的游标相同。默认情况下，在遍历整个返回结果集时，没遍历一个对象会执行一次RPC调用，基于性能考虑，可以给scan设置缓存。
Scan和Get都是“读”操作，但二者有明显的却别：Get操作需要一个具体的rowkey，而Scan并不需要，在一般情况下Scan是查询区间的。
Scan是一个功能强大的操作，它提供如下几个构造函数：


         Scan()
         Scan(byte[] starRow,Filter filter)
         Scan(byte[] starRow)
         Scan(byte[] starRow,byte[] stopRow)

另外，可以由如下方法获取扫描结果集：


         ResultScanner getScanner(Scan sacn) throws IOException
         ResultScanner getScanner(byte[] family) throws IOException
         ResultScanner getScanner(byte[] family,byte[] qualifier) throws IOException

注：在Scan操作中，当设置一个不存在的列族时不会保存。
下面代码是scan操作的示例：


         HTablePool htablePool = new HTablePool();   
         HTableInterface userTable = htablePool.getTable("users"); 

         Scan scan = new Scan();
         scan.addFamily(Bytes.toBytes("info"));
         scan.setStartRow(Bytes.toBytes("段誉11760204"));
         //scan.setStopRow(Bytes.toBytes("张三丰13560204"));
         ResultScanner scanner = userTable.getScanner(scan);
         for(Result rs : scanner){
             ... //do something here
         }
         scanner.close();//结束遍历后要一定要关闭scanner
         userTable.close();

如上所说，在遍历scanner返回集时，默认情况下每一行都会触发一个RPC调用，这是在客户端(client-side)循环RPC调用。显而易见，在性能上这有很大的不足。要提升Scan扫描的性能，一次RPC调用抓取多条数据，可以激活scanner的cache功能，这个功能在默认情况下是禁用的。
我们可以从两个方面来激活这个功能：在表层面，设置scanner客户端缓存的记录条数，如：


         scanner.setScannerCaching(20);//默认为1

也可以通过HBase的hbase-site.xml配置文件设置全局的扫描缓存的记录数：


         <property>
             <name>hbase.client.scanner.caching</name>
             <value>20</value>
         </property>

还要在服务器端设置scanner的缓存记录数：


         void setCaching(20);

有了这两个设置，才可以真正激活该缓存功能。但是要注意的时，要设置恰当的缓存数量，要不然会造成OutOfMemoryException和时间过期的错误。

[b]Batch操作：[/b]
前面介绍的List<Put>、List<Get>、List<Delete>，只能是通过一次RPC调用处理一种类型的集合处理操作，而Batch操作将将Put、Get、Delete等类型的操作组成一个批处理操作，由一个RPC处理，如下所示：


         HTablePool htablePool = new HTablePool();   
         HTableInterface userTable = htablePool.getTable("users");

         List<Row> batch = new ArrayList<Row>();

         Get get = new Get("张三丰13560204"");
         batch.add(get);

         Delete delete =new Delete("王五19460204");
         batch.add(delete);

         Put put = new Put(Bytes.toBytes("小张20120406");
         batch.add(put);

         Object[] results = new Object[batch.size];
         try{
             userTable.batch(batch,results);
         }catch(Exception e){
             System.err.out("Error: " + e);
         }