spark bulkload数据到hbase
1. 查询数据放到dataframe
查询数据放到dataframe
val imDate = sql(imSQL)
2. 放入指定格式的RDD并排序
放入指定格式的RDD并排序
val res = basicData.rdd
.flatMap { row => {
val kvs = new util.TreeSet[KeyValue](KeyValue.COMPARATOR)
val uuid = row.getAs[String]("uuid")
//根据uuid生成rowkey
val rowkeyValue = MD5Util.getMD5forUserid(uuid)
val rowkey = new ImmutableBytesWritable()
rowkey.set(Bytes.toBytes(rowkeyValue))
//TODO 放入column数据
serColumn2KVs(kvs)
kvs.toSeq.map(kv => (rowkey, kv))
}
}.sortBy(_._2.getKeyString)
3. 放入到hbase
将RDD中的数据存入到hfile,然后上传到hbase表
def save2Hbase(res: RDD[(ImmutableBytesWritable, KeyValue)]): Unit = {
res.saveAsNewAPIHadoopFile(OUTPUT, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2])
val conf = HBaseConfiguration.create
conf.set("hbase.zookeeper.quorum", "***")
conf.set("hbase.rootdir", "hdfs://hbase-58-cluster/home/hbase")
val table = new HTable(conf, Constant.TABLE_NAME)
val bulkLoader = new LoadIncrementalHFiles(conf)
bulkLoader.doBulkLoad(new Path(OUTPUT), table)
}
4. 遇到的问题
1.数据需要按照rowkey和column排序
2.保存hfile时路径需要新的路径