Hadoop组件之存储系统Hbase的MapReduce(三)
通过 HBase 的相关 JavaAPI,我们可以实现伴随 HBase 操作的 MapReduce 过程,比如使用MapReduce 将数据从本地文件系统导入到 HBase 的表中,比如我们从 HBase 中读取一些原始数据后使用 MapReduce 做数据分析。
-
官方Hbase-MapReduce
1. 查看Hbase的MapReduce任务的执行 $ bin/hbase mapreduce 2. 环境变量的导入 $ export HBASE_HOME = /opt/module/hbase $ export HADOOP_HOME=/opt/module/hadoop-2.7.2 $ export HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp` 3. 运行官方MapReduce任务 案例一:统计student表中数据得行数 $ cd /opt/module/hbase $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp` \ hadoop jar lib/hbase-mapreduce-2.1.8.jar rowcounter student 案例二:使用MapReduce将本地数据导入到Hbasse的表中 1)本地创建tsv文件:friut.tsv 1001 apple red 1002 pear yellow 1003 pineapple yellow 2) 创建Hbase表 hbase(main):001:0> create 'fruit','info' 3)在HDFS中创建input——fruit文件上传fruit.tsv文件 $ hdfs dfs -mkdir /input_fruit/ $hdfs dfs -put fruit.tsv /input_fruit// 4)执行MapReduce到Hbase的fruit表中 $ hadoop jar lib/ hbase-mapreduce-2.1.8.jar importtsv \ -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:color fruit \ hdfs://hadoop102:9000/input_fruit 5)使用scan命令查看导入后的结果 hbase(main):001:0>scan 'fruit' 案例三:批量导入数据到集群中 批量加载功能使用 MapReduce 作业以 HBase 的内部数据格式输出表数据,然后将生成的StoreFiles 直接加载到运行的集群中。与通过 HBase API 进行加载相比,使用批量加载将占用更少的 CPU 和网络资源。 1)通过 MapReduce 作业准备数据生成 StoreFiles ,使用工具 importtsv 或者自定义MapReduce 使用 HFileOutputFormat2 格式写出数据 $ hadoop jar lib/hbase-mapreduce-2.1.8.jar importtsv \ -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:color \ -Dimporttsv.bulk.output=hdfs://hadoop102:9000/storefileoutput fruit_2 \ hdfs://hadoop102:9000/input_fruit 2)完成数据加载 $ hadoop jar lib/hbase-mapreduce-2.1.8.jar completebulkload \ hdfs://hadoop12:9000/storefileoutput fruit_2
-
自定义Hbase-MapRedce(一)
目标:将fruit表中的一部分数据,通过MR嵌入fruit——mr表中
1 自定义mapper类,读取hbase中的数据
package com.ityouxin.mapreduce; import org.apache.hadoop.hbase.Cell; import org.apache.hadoop.hbase.CellUtil; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.util.Bytes; import java.io.IOException; import java.util.List; /** * @program: hbase * @description: 自定义Mapper类 * @author: lhx * @create: 2019-12-17 11:52 **/ public class Mapper extends TableMapper<ImmutableBytesWritable, Put> { @Override protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { Put put = new Put(key.get()); List<Cell> cellList = value.listCells(); for (Cell cell : cellList){ put.addColumn(CellUtil.cloneFamily(cell),CellUtil.cloneQualifier(cell),CellUtil.cloneValue(cell)); context.write(key,put); } } }
2 自定义reducer类,转发数据
package com.ityouxin.mapreduce; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableReducer; import java.io.IOException; import java.util.Iterator; /** * @program: hbase * @description: 自定义Reducer * @author: lhx * @create: 2019-12-17 13:02 **/ public class Reducer extends TableReducer<ImmutableBytesWritable, Put,ImmutableBytesWritable> { @Override protected void reduce(ImmutableBytesWritable key, Iterable<Put> values, Context context) throws IOException, InterruptedException { Iterator<Put> iterator = values.iterator(); while (iterator.hasNext()){ context.write(key,iterator.next()); } } }
3 自定义driver
package com.ityouxin.mapreduce; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.mapreduce.Job; import java.io.IOException; /** * @program: hbase * @description: 自定义驱动类 * @author: lhx * @create: 2019-12-17 13:06 **/ public class Driver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration configuration = HBaseConfiguration.create(); Job job = Job.getInstance(configuration); job.setJarByClass(Driver.class); Scan scan = new Scan(); TableMapReduceUtil.initTableMapperJob("fruit", scan, Mapper.class, ImmutableBytesWritable.class, Put.class, job); TableMapReduceUtil.initTableReducerJob("fruit_mr", Reducer.class, job); boolean b = job.waitForCompletion(true); System.exit(b?0:1); } }
4 打包运行任务
$ /opt/module/hadoop-2.7.2/bin/yarn jar /opt/module/datas/hbasestudy-0.0.1-SNAPSHOT.jar \ com.ityouxin.hbase.mr.Fruit2FruitMRRunner 提示: 运行任务前,如果待数据导入的表不存在,则需要提前创建。 提示: maven 打包命令: -P local clean package 或-P dev clean package install(将第三方 jar 包 一同打包,需要插件: maven-shade-plugin)
-
自定义MapReduce(二)
目标:实现将HDFS的数据写入Hbase表中
分步实现:
1 ReadFruitFromHDFSMapper HDFSpackage com.ityouxin.hbase.mr; import java.io.IOException; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class ReadFruitFromHDFSMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //从 HDFS 中读取的数据 String lineValue = value.toString(); //读取出来的每行数据使用\t 进行分割,存于 String 数组 String[] values = lineValue.split("\t"); //根据数据中值的含义取值 String rowKey = values[0]; String name = values[1]; String color = values[2]; //初始化 rowKey ImmutableBytesWritable rowKeyWritable = new ImmutableBytesWritable(Bytes.toBytes(rowKey)); //初始化 put 对象 Put put = new Put(Bytes.toBytes(rowKey)); //参数分别:列族、列、值 put.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes(name)); put.add(Bytes.toBytes("info"), Bytes.toBytes("color"), Bytes.toBytes(color)); context.write(rowKeyWritable, put); } }
2. 构建 WriteFruitMRFromTxtReducer 类
package com.ityouxin.hbase.mr; import java.io.IOException; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.io.NullWritable; public class WriteFruitMRFromTxtReducer extends TableReducer<ImmutableBytesWritable, Put, NullWritable> { @Override protected void reduce(ImmutableBytesWritable key, Iterable<Put> values, Context context) throws IOException, InterruptedException { //读出来的每一行数据写入到 fruit_hdfs 表中 for(Put put: values){ context.write(NullWritable.get(), put); } } }
3 构建 类Txt2FruitRunner组装 Job
public int run(String[] args) throws Exception { //得到 Configuration Configuration conf = this.getConf(); //创建 Job 任务 Job job = Job.getInstance(conf, this.getClass().getSimpleName()); job.setJarByClass(Txt2FruitRunner.class); Path inPath = new Path("hdfs://hadoop102:9000/input_fruit/fruit.tsv"); FileInputFormat.addInputPath(job, inPath); //设置 Mapper job.setMapperClass(ReadFruitFromHDFSMapper.class); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(Put.class); //设置 Reducer TableMapReduceUtil.initTableReducerJob("fruit_mr", WriteFruitMRFromTxtReducer.class, job); //设置 Reduce 数量,最少 1 个 job.setNumReduceTasks(1); boolean isSuccess = job.waitForCompletion(true); if(!isSuccess){ throw new IOException("Job running with error"); } return isSuccess ? 0 : 1; }
4 调用执行 Job
public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); int status = ToolRunner.run(conf, new Txt2FruitRunner(), args); System.exit(status); }
5 打包运行
$ /opt/module/hadoop-2.7.2/bin/yarn jar hbasestudy-0.0.1-SNAPSHOT.jar com.ityouxin.hbase.mr.Txt2FruitRunner 运行任务前,如果待数据导入的表不存在,则需要提前创建之。 maven -P local clean package -P dev clean package install(将第三方 jar 包 一同打包,需要插件: maven-shade-plugin)