Spark读写HBase数据

博客介绍了信息技术领域的数据处理操作,包括使用Spark创建Hbase表并定义属性,读取Hbase数据写入ES,以及将数据从hive写入hbase,涉及大数据开发中的数据交互与处理。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.使用Spark创建Hbase表,以及定义表属性

object HBaseCreateTable {
  def main(args: Array[String]) {
    val TABLE_NAME = "test_yuan"
    val hBaseConf = HBaseConfiguration.create()
    hBaseConf.set(HConstants.ZOOKEEPER_QUORUM, "bq2.bq.cn,bq1.bq.cn")
    hBaseConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
    val connect = ConnectionFactory.createConnection(hBaseConf)
    val admin = connect.getAdmin
    try {
      if (admin.tableExists(TableName.valueOf(TABLE_NAME))) {
        admin.disableTable(TableName.valueOf(TABLE_NAME));
        admin.deleteTable(TableName.valueOf(TABLE_NAME));
      }
      //2\创建描述
      val h_table = new HTableDescriptor(TableName.valueOf(TABLE_NAME));
      val column = new HColumnDescriptor("base".getBytes());
      //column.setBlockCacheEnabled(true)
      //column.setBlocksize(2222222)
      // 添加列簇
      h_table.addFamily(column);
      h_table.addFamily(new HColumnDescriptor("gps".getBytes()));
      //3\创建表
      admin.createTable(h_table)
      val table = connect.getTable(TableName.valueOf(TABLE_NAME))

      //插入5条数据
      for (i <- 1 to 5) {
        // 这里是主键
        val put = new Put(Bytes.toBytes("row" + i))
        // 必须添加到已经存在的列簇,列名可以不存在。
        put.addColumn(Bytes.toBytes("base"), Bytes.toBytes("name"), Bytes.toBytes("value " + i))
        put.addColumn(Bytes.toBytes("base"), Bytes.toBytes("famm"), Bytes.toBytes("value " + i))
        table.put(put)
      }
      table.close()
    } catch {
      case ex: Exception => ex.printStackTrace()
    } finally {
      releaseConn(admin)
    }
  }

  def releaseConn(admin: Admin) = {
    try {
      if (admin != null) {
        admin.close();
      }
    } catch {
      case ex: Exception => ex.getMessage
    }
  }
}

2.读取Hbase数据并写入ES

object HbaseToES {
  def main(args: Array[String]): Unit = {
    val zookeeper_quorum = "bq2.bq.cn,bq1.bq.cn"
    val zookeeper_client_port = "2181"
    val config = ConfigUtil.getConfig
    val sparkConf = new SparkConf().setAppName("HbaseToES")
      .set("es.nodes", config.getString("app.es.ips"))
      .set("es.port", config.getString("app.es.port"))
      .set("es.index.auto.create", "true")
      .set("es.net.http.auth.user", config.getString("app.es.es_user_name"))
      .set("es.net.http.auth.pass", config.getString("app.es.es_user_pass"))

    val ssc = SparkSession.builder().appName("SparkFromHBase").master("local[*]").config(sparkConf).getOrCreate()
    val sc = ssc.sparkContext

    val tableName = "test_yuan"
    val hBaseConf = HBaseConfiguration.create()
    hBaseConf.set(HConstants.ZOOKEEPER_QUORUM, zookeeper_quorum)
    hBaseConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, zookeeper_client_port)
    hBaseConf.set(org.apache.hadoop.hbase.mapreduce.TableInputFormat.INPUT_TABLE, tableName)
    //读取数据并转化成rdd TableInputFormat是org.apache.hadoop.hbase.mapreduce包下的
    val hbaseRDD = sc.newAPIHadoopRDD(hBaseConf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
    val result = hbaseRDD.map(x => x._2).map { result =>
        (result.getRow,
          result.getValue(Bytes.toBytes("base"), Bytes.toBytes("name")),
          result.getValue(Bytes.toBytes("base"), Bytes.toBytes("address")),
          result.getValue(Bytes.toBytes("gps"), Bytes.toBytes("geohash")))
      }.map(row => testInsert(new String(row._1), new String(row._2), new String(row._3), new String(row._4)))
    println("数据量 " + result.count())
    //result.take(10).foreach(println)

    EsSpark.saveToEs(result, "test/hbase")
  }

  case class testInsert(row_id: String,
                        name: String,
                        address: String,
                        geohash: String)

}

3.从hive写入hbase

object HiveToHBase {
  def main(args: Array[String]): Unit = {
    val zookeeper_quorum = "bq2.bq.cn,bq1.bq.cn"
    val zookeeper_client_port = "2181"
    val TABLE_NAME = "test_yuan"TABLE_NAME)
    val sparkConf = new SparkConf().setAppName("HiveToHBase")
      .setMaster("local[*]")
    val ssc = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
    val dataFrame = ssc.sql("select mobiletelephone,customername,address,gps,geohash from graph.user_07_10 where mobiletelephone is not null limit 10")
    dataFrame.show(10)
    dataFrame.rdd.map(x => {
      val phone = Try(x(0).asInstanceOf[String]).getOrElse("0")
      val name = Try(x(1).asInstanceOf[String]).getOrElse("")
      val address =Try(x(2).asInstanceOf[String]).getOrElse("")
      val gps =Try(x(3).asInstanceOf[String]).getOrElse("")
      val geohash =Try(x(4).asInstanceOf[String]).getOrElse("")
      // rowkey
      // 列簇、列、值
      val p = new Put(Bytes.toBytes(phone))
      p.addColumn(Bytes.toBytes("base"), Bytes.toBytes("name"), Bytes.toBytes(name))
      p.addColumn(Bytes.toBytes("base"), Bytes.toBytes("address"), Bytes.toBytes(address))
      p.addColumn(Bytes.toBytes("gps"), Bytes.toBytes("gps"), Bytes.toBytes(gps))
      p.addColumn(Bytes.toBytes("gps"), Bytes.toBytes("geohash"), Bytes.toBytes(geohash))
    }).foreachPartition(Iterator => {
      //初始化jobconf,TableOutputFormat必须是org.apache.hadoop.hbase.mapred包下的!
      val jobConf = new JobConf(HBaseConfiguration.create())
      jobConf.set("hbase.zookeeper.quorum", zookeeper_quorum)
      jobConf.set("hbase.zookeeper.property.clientPort", "2181")
      // 走MapReduce   OutputFormat
      jobConf.setOutputFormat(classOf[TableOutputFormat])
      val table = new HTable(jobConf, TableName.valueOf(TABLE_NAME))
      import scala.collection.JavaConversions._
      table.put(seqAsJavaList(Iterator.toSeq))
    })
  }
}
回答: 要在Spark读写HBase,你需要进行以下几个步骤。首先,你需要在simple.sbt配置文件中指定HBase的版本号,包括hbase-client、hbase-common和hbase-server的版本号。你可以通过在Linux系统中打开终端并导航到HBase安装目录,然后使用命令"cd /usr/local/hbase"和"ls"来查找这些版本号。\[1\] 接下来,你需要创建一个Python文件,比如SparkOperateHBase.py,然后在文件中添加读取HBase数据的代码。你需要导入必要的库,设置Spark的配置,指定HBase的主机和表名,以及配置HBase的相关参数。然后,你可以使用Spark的newAPIHadoopRDD方法来读取HBase数据,并对数据进行操作。最后,你可以将结果打印出来。\[2\] 最后,你需要将支持HBase的jar包导入Spark的jars目录下。你可以使用命令"cp /usr/local/software/hbase/hbase-2.4.9/lib/hbase*.jar /usr/local/software/spark/spark-3.0.3-bin-hadoop2.7/jars"来完成这个步骤。\[3\] 这样,你就可以在Spark读写HBase数据了。 #### 引用[.reference_title] - *1* [大数据-05-Spark读写HBase数据](https://blog.youkuaiyun.com/weixin_33670713/article/details/85983819)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* *3* [Spark 读写Hbase](https://blog.youkuaiyun.com/jinxing_000/article/details/123706938)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值