场景
HDFS上存储了大量的parquet文件,我需要抽取其中几个字段存储到HBASE里。
中间出现了一个问题,应该是一次批量提交的数据量太多了,所以一直卡着,提示:
INFO AsyncRequestFutureImpl: #3, waiting for 172558 actions to finish on table:
网上没找到怎么解决,最后自己优化了一下代码解决了
代码
主函数:HBase2HDFSLocalTest
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.StructType
object HBase2HDFSLocalTest extends Serializable {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\Lwb\\hadoop-3.1.1")
System.setProperty("HADOOP_USER_NAME", "hdfs") // 这句必须加,否是会报权限错误,而hdfs是hadoop的用户名,这个账号是有权限操作hdfs的
val spark = SparkSession
.builder
.appName("ReadParquet")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.defaultFS", "hdfs://lwb")
sc.hadoopConfiguration.set("dfs.nameservices", "lwb")
sc.hadoopConfiguration.set("dfs.ha.namenodes.lwb", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.lwb.nn1", "namenode1:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.lwb.nn2", "namenode2:8020")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.lwb", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
var allData = spark.read.parquet("/hdfsdata/2021/10/")
allData.persist()
val rowKeyData = allData.select("id", "name")
// println(rowKeyData.count())
HbaseDAO.batchInsertHbase(rowKeyData)
}
}
插入HBASE代码:HBaseDAO
import java.util
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client.{Put, Table}
import org.apache.spark.sql.DataFrame
import scala.collection.JavaConversions
object HBaseDAO extends Serializable {
val tableName = "2:TestTable"
def batchInsertHbase(df: DataFrame): Unit = {
df.foreachPartition(iterator => {
var putList = new util.ArrayList[Put]
val table: Table = ManagerHBaseConnection.getConnection().getTable(TableName.valueOf(tableName))
iterator.foreach(message => {
val rowkey = message.getAs[String]("id")
var name= message.getAs[String]("name")
val put: Put = new Put(rowkey.getBytes())
put.addColumn("data".getBytes(), "name".getBytes(), name.getBytes())
putList.add(put)
//下边这行代码是解决一次插入数据量太大的问题的
// if (putList.size() > 1000) {
// table.put(putList)
// putList = new util.ArrayList[Put]
// }
})
if (putList.size() > 0) {
table.put(putList)
}
table.close()
})
println("batch-insert:TestTable")
}
}
HBASE连接管理
ManagerHBaseConnection
import java.io.FileInputStream
import java.util.Properties
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Table}
object ManagerHBaseConnection extends Serializable {
var connection: Connection = _
var table: Table = _
var quorumString: String = ""
def getConnection(): Connection = {
if (connection == null) {
val props = new Properties()
props.load(new
//要是本地的跑的话配置文件写具体磁盘地址
FileInputStream("application.properties"))
quorumString = props.getProperty("hbase.zookeeper.quorum")
println("zookeeper-quorum:" + quorumString)
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("spark.executor.memory", "3000m")
conf.set("hbase.zookeeper.quorum", quorumString)
conf.set("zookeeper.znode.parent", "/hbase")
connection = ConnectionFactory.createConnection(conf)
}
connection
}
}