Spark Streaming输出至HBase

一、环境

开发环境:
    系统:Win10
    开发工具:scala-eclipse-IDE
    项目管理工具:Maven 3.6.0
    JDK 1.8
    Scala 2.11.11
    Spark 2.4.3
    HBase 1.2.9

作业运行环境:
    系统:Linux CentOS7(两台机:主从节点,2核)
        master : 192.168.190.200
        slave1 : 192.168.190.201
    JDK 1.8
    Scala 2.11.11
    Spark 2.4.3
    Hadoop 2.9.2
    ZooKeeper 3.4.14
    HBase 1.2.9

二、案例

1. 在之前博文Spark Streaming之流式词频统计(Socket数据源)基础上,利用从Socket数据源获取单词,进行单词统计,并将统计结果输出至HBase数据库中;(注:相关Netcat 的安装使用,也可参看之前的这篇文章) 

三、代码实现

1. pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com</groupId>
  <artifactId>DStreamOutputHBase</artifactId>
  <version>0.1</version>
  <dependencies>
  	<dependency><!-- Spark核心依赖包 -->
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-core_2.11</artifactId>
  		<version>2.4.3</version>
  		<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
  	</dependency>
  	<dependency><!-- Spark Streaming依赖包 -->
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-streaming_2.11</artifactId>
  		<version>2.4.3</version>
  		<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
  	</dependency>
  	<dependency><!-- Log 日志依赖包 -->
  		<groupId>log4j</groupId>
  		<artifactId>log4j</artifactId>
  		<version>1.2.17</version>
  	</dependency>
  	<dependency><!-- 日志依赖接口 -->
  		<groupId>org.slf4j</groupId>
  		<artifactId>slf4j-log4j12</artifactId>
  		<version>1.7.12</version>
  	</dependency>
  	<dependency><!-- HBase 客户端依赖包 -->
  		<groupId>org.apache.hbase</groupId>
  		<artifactId>hbase-client</artifactId>
  		<version>1.2.9</version>
  	</dependency>
  	<dependency><!-- HBase 公共依赖包 -->
  		<groupId>org.apache.hbase</groupId>
  		<artifactId>hbase-common</artifactId>
  		<version>1.2.9</version>
  	</dependency>
  	<dependency><!-- HBase 服务端依赖包 -->
  		<groupId>org.apache.hbase</groupId>
  		<artifactId>hbase-server</artifactId>
  		<version>1.2.9</version>
  	</dependency>
  </dependencies>
  <build>
  	<plugins>
  		<!-- 混合scala/java编译 -->
  		<plugin><!-- scala编译插件 -->
  			<groupId>org.scala-tools</groupId>
  			<artifactId>maven-scala-plugin</artifactId>
  			<executions>
  				<execution>
  					<id>compile</id>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  					<phase>compile</phase>
  				</execution>
  				<execution>
  					<id>test-compile</id>
  					<goals>
  						<goal>testCompile</goal>
  					</goals>
  					<phase>test-compile</phase>
  				</execution>
  				<execution>
  					<phase>process-resources</phase>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin>
  			<artifactId>maven-compiler-plugin</artifactId>
  			<configuration>
  				<source>1.8</source><!-- 设置Java源 -->
  				<target>1.8</target>
  			</configuration>
  		</plugin>
  		<!-- for fatjar -->
  		<plugin><!-- 将所有依赖包打入同一个jar包中 -->
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-assembly-plugin</artifactId>
  			<version>2.4</version>
  			<configuration>
  				<descriptorRefs>
  					<!-- jar包的后缀名 -->
  					<descriptorRef>jar-with-dependencies</descriptorRef>
  				</descriptorRefs>
  			</configuration>
  			<executions>
  				<execution>
  					<id>assemble-all</id>
  					<phase>package</phase>
  					<goals>
  						<goal>single</goal>
  					</goals>
  				</execution>
  			</executions>
  		</plugin>
  		<plugin><!-- Maven打包插件 -->
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-jar-plugin</artifactId>
  			<configuration>
  				<archive>
  					<manifest>
  						<!-- 添加类路径 -->
  						<addClasspath>true</addClasspath>
  						<!-- 设置程序的入口类 -->
  						<mainClass>dstream.output.driver.WordCount</mainClass>
  					</manifest>
  				</archive>
  			</configuration>
  		</plugin>
  	</plugins>
  </build>
  <repositories>
  	<repository>  
		<id>alimaven</id>  
		<name>aliyun maven</name>  
		<url>http://maven.aliyun.com/nexus/content/groups/public/</url>  
		<releases>  
			<enabled>true</enabled>  
		</releases>  
		<snapshots>  
			<enabled>false</enabled>  
		</snapshots>  
	</repository>
  </repositories>
</project>

2.  HBase连接工具类:

1)依然使用惰性单例实现获取HBase连接;

2)原本想通过设置请求连接的计数器来判断连接的关闭时机,但是没有在Spark作业中开出并发的效果,也就没能进行测试;

3)Spark Streaming作业一般不停下来,所以又好像没必要关闭连接。(TODO待后期深入

package dstream.output.hbase

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.Connection

object HbaseUtil extends Serializable {
  //配置信息
  private val conf = HBaseConfiguration.create()
  conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
  conf.set(HConstants.ZOOKEEPER_QUORUM, "master,slave1")
  //HBase连接
  @volatile private var connection: Connection = _
  //请求的连接数计数器(为0时关闭)
  @volatile private var num = 0 
  
  //获取HBase连接
  def getHBaseConn: Connection = {
    synchronized {
      if (connection == null || connection.isClosed() || num == 0) {
        connection = ConnectionFactory.createConnection(conf)
        println("conn is created! " + Thread.currentThread().getName())
      }
      //每请求一次连接,计数器加一
      num = num + 1
      println("request conn num: " + num + " " + Thread.currentThread().getName())
    }
    connection
  }
  
  //关闭HBase连接
  def closeHbaseConn(): Unit = {
    synchronized {
      if (num <= 0) {
        println("no conn to close!")
        return
      }
      //每请求一次关闭连接,计数器减一
      num = num - 1
      println("request close num: " + num + " " + Thread.currentThread().getName())
      //请求连接计数器为0时关闭连接
      if (num == 0 && connection != null && !connection.isClosed()) {
        connection.close()
        println("conn is closed! " + Thread.currentThread().getName())
      }
    }
  }
  
}

3. 程序入口: 

package dstream.output.driver

import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.Time
import scala.collection.Iterator
import dstream.output.hbase.HbaseUtil
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes

object WordCount extends App {
    //Spark配置项
    val conf = new SparkConf()
      .setAppName("SocketWordFreq")
      .setMaster("spark://master:7077")
    //创建流式上下文,2s的批处理间隔
    val ssc = new StreamingContext(conf, Seconds(2))
    //创建一个DStream,连接指定的hostname:port,比如master:9999
    val lines = ssc.socketTextStream("master", 9999)  //DS1
    //将接收到的每条信息分割成单个词汇
    val words = lines.flatMap(_.split(" "))  //DS2
    //统计每个batch的词频
    val pairs = words.map(word => (word, 1))  //DS3
    // 汇总词汇 
    val wordCounts = pairs.reduceByKey(_ + _)  //DS4
    //在reduce聚合之后,输出结果至HBase(输出操作)
    wordCounts.foreachRDD((rdd: RDD[(String, Int)], time: Time) => {
      //RDD为空时,无需再向下执行,否则在分区中还需要获取数据库连接(无用操作)
      if (!rdd.isEmpty()) {
        //一个分区执行一批SQL
        rdd.foreachPartition((partition: Iterator[(String, Int)]) => {
          //每个分区都会创建一个task任务线程,分区多,资源利用率高
          //可通过参数配置分区数:"--conf spark.default.parallelism=20"
          if (!partition.isEmpty) {
            //partition和record共同位于本地计算节点Worker,故无需序列化发送conn和statement
            //如果多个分区位于一个Worker中,则共享连接(位于同一内存资源中)
            //获取HBase连接
            val conn = HbaseUtil.getHBaseConn
            if (conn == null) {
              println("conn is null.")  //在Worker节点的Executor中打印
            } else {
              println("conn is not null." + Thread.currentThread().getName())
              partition.foreach((record: (String, Int)) => {
                //每个分区中的记录在同一线程中处理
                println("record : " + Thread.currentThread().getName())
                //设置表名
                val tableName = TableName.valueOf("wordfreq")
                //获取表的连接
                val table = conn.getTable(tableName)
                try {
                  //设定行键(单词)
                  val put = new Put(Bytes.toBytes(record._1))
                  //添加列值(单词个数)
                  //三个参数:列族、列、列值
                  put.addColumn(Bytes.toBytes("statistics"),
                    Bytes.toBytes("cnt"),
                    Bytes.toBytes(record._2))
                  //执行插入
                  table.put(put)
                  println("insert (" + record._1 + "," + record._2 + ") into hbase success.")
                } catch {
                  case e: Exception => e.printStackTrace()
                } finally {
                  table.close()
                }
              })
              //关闭HBase连接(此处每个partition任务结束都会执行,会频繁开关连接,耗费资源)
//              HbaseUtil.closeHbaseConn()
            }
          }
        })
        //关闭HBase连接(此处只在Driver节点执行,故无效)
//        HbaseUtil.closeHbaseConn()
      }
    })
    //打印从DStream中生成的RDD的前10个元素到控制台中
    wordCounts.print()  //print() 是输出操作,默认前10条数据
    ssc.start()  //开始计算
    ssc.awaitTermination()  //等待计算结束
}

四、打包运行

1.在项目的根目录下运行命令行窗口(在目录下 "shift+右键",选择命令行窗口 Power Shell)
  执行如下命令:(编译代码)
    > mvn clean install
    编译成功后,会在当前目录的 ".\target\" 下产生两个jar包;
    其中的 DStreamOutputHBase-0.1-jar-with-dependencies.jar 用来提交给Spaek集群

用终端A(如:Windows的PowerShell)通过ssh登陆master节点,执行
2.master节点建立Socket服务器(9999端口)
    $ nc -lk 9999
    >

用终端B(如:Windows的PowerShell,可连接多台)通过ssh登陆master节点,执行
3.将Jar包提交至主节点上,执行Spark作业:
  提交Spark作业:(需先配置Spark_HOME环境变量)
    $ spark-submit \
      --class dstream.output.driver.WordCount \
      --executor-cores 2 \
      --conf spark.default.parallelism=20 \
      /opt/DStreamOutputHBase-0.1-jar-with-dependencies.jar
    注1:其中每行的末尾 "\" 代表不换行,命令需在一行上输入,此处只为方便观看
    注2:提交的Jar包放在 /opt/ 目录下
    注3:增加Worker节点上并发线程数 --executor-cores 2
    注4:增加分区数(每个分区一个task任务线程)配置:spark.default.parallelism=20
    
    Spark流式作业运行过程的输出(以2s间隔,不停打印时间信息)
    -----------------------------
    Time: 1560853450000 ms
    -----------------------------

    -----------------------------
    Time: 1560853452000 ms
    -----------------------------

    -----------------------------
    Time: 1560853454000 ms
    -----------------------------

4.在master节点进入HBase Shell,创建数据接收表
    hbase(main):001:0> create 'wordfreq', 'statistics'

5.在终端A执行,输入数据
    $ nc -lk 9999
    > hello word hello words hello world
    >

6.终端B中可以看到统计结果输出
    -----------------------------
    Time: 1561040076000 ms
    -----------------------------
    (hello,3)
    (word,1)
    (words,1)
    (world,1)
    -----------------------------
    Time: 1561040078000 ms
    -----------------------------

7.同时,进入HBase Shell中查看接收到的数据
    1)查询hello单词的统计值
    hbase(main):002:0> get 'wordfreq', 'hello'
    COLUMN                          CELL
    statistics:cnt                 timestamp=1561040009657, value=\x00\x00\x00\x03
    <表示:hello单词的统计值cnt为3>
    
    2)扫描全表数据
    hbase(main):034:0> scan 'wordfreq'
    ROW         COLUMN+CELL
    hello       column=statistics:cnt, timestamp=1561040009657, value=\x00\x00\x00\x03
    word        column=statistics:cnt, timestamp=1561040009680, value=\x00\x00\x00\x01
    words       column=statistics:cnt, timestamp=1561040009671, value=\x00\x00\x00\x01
    world       column=statistics:cnt, timestamp=1561040009660, value=\x00\x00\x00\x01

    <此刻,即表示成功将流数据输出至HBase中>

试验性测试:

当Socket服务端输入1 2 3 ... 30 串数字,Spark作业开启线程2个 --executor-cores 2 ,配置分区数20个 --conf spark.default.parallelism=20 时,如下输出(stdout日志中查看):并没有同时执行的线程,都是按顺序执行,分区倒是分了很多。

时间角度的日志:
conn is created!1561024578954
request conn num: 1 1561024578955
conn is not null.1561024578955
insert (22,1) into hbase success.
request conn num: 2 1561024579415
conn is not null.1561024579415
insert (23,1) into hbase success.
request conn num: 3 1561024579434
conn is not null.1561024579434
insert (24,1) into hbase success.
request conn num: 4 1561024579454
conn is not null.1561024579454
insert (25,1) into hbase success.
request conn num: 5 1561024579467
conn is not null.1561024579467
insert (26,1) into hbase success.
request conn num: 6 1561024579477
conn is not null.1561024579477
insert (27,1) into hbase success.
request conn num: 7 1561024579487
conn is not null.1561024579487
insert (28,1) into hbase success.
request conn num: 8 1561024579496
conn is not null.1561024579496
insert (10,1) into hbase success.
insert (29,1) into hbase success.
request conn num: 9 1561024579511
conn is not null.1561024579511
insert (11,1) into hbase success.
request conn num: 10 1561024579520
conn is not null.1561024579520
insert (30,1) into hbase success.
insert (12,1) into hbase success.
insert (1,1) into hbase success.
request conn num: 11 1561024579536
conn is not null.1561024579536
insert (2,2) into hbase success.
insert (13,1) into hbase success.
request conn num: 12 1561024579548
conn is not null.1561024579548
insert (14,1) into hbase success.
insert (3,1) into hbase success.
request conn num: 13 1561024579558
conn is not null.1561024579559
insert (4,1) into hbase success.
insert (15,1) into hbase success.
request conn num: 14 1561024579574
conn is not null.1561024579574
insert (5,1) into hbase success.
insert (16,1) into hbase success.
request conn num: 15 1561024579588
conn is not null.1561024579588
insert (6,1) into hbase success.
insert (17,1) into hbase success.
request conn num: 16 1561024579601
conn is not null.1561024579601
insert (7,1) into hbase success.
insert (18,1) into hbase success.
request conn num: 17 1561024579614
conn is not null.1561024579614
insert (8,1) into hbase success.
insert (19,1) into hbase success.
request conn num: 18 1561024579626
conn is not null.1561024579626
insert (9,1) into hbase success.
request conn num: 19 1561024579634
conn is not null.1561024579634
insert (20,1) into hbase success.
request conn num: 20 1561024579643
conn is not null.1561024579643
insert (21,1) into hbase success.
分区task任务线程角度日志:
conn is created!Executor task launch worker for task 212
request conn num: 1 Executor task launch worker for task 212
conn is not null.Executor task launch worker for task 212
record time: Executor task launch worker for task 212
insert (10,1) into hbase success.
request conn num: 2 Executor task launch worker for task 213
conn is not null.Executor task launch worker for task 213
record time: Executor task launch worker for task 213
insert (11,1) into hbase success.
request conn num: 3 Executor task launch worker for task 214
conn is not null.Executor task launch worker for task 214
record time: Executor task launch worker for task 214
insert (12,1) into hbase success.
record time: Executor task launch worker for task 214
insert (1,1) into hbase success.
request conn num: 4 Executor task launch worker for task 215
conn is not null.Executor task launch worker for task 215
record time: Executor task launch worker for task 215
insert (2,1) into hbase success.
record time: Executor task launch worker for task 215
insert (13,1) into hbase success.
request conn num: 5 Executor task launch worker for task 216
conn is not null.Executor task launch worker for task 216
record time: Executor task launch worker for task 216
insert (14,1) into hbase success.
record time: Executor task launch worker for task 216
insert (3,1) into hbase success.
request conn num: 6 Executor task launch worker for task 217
conn is not null.Executor task launch worker for task 217
record time: Executor task launch worker for task 217
insert (4,1) into hbase success.
record time: Executor task launch worker for task 217
insert (15,1) into hbase success.
request conn num: 7 Executor task launch worker for task 218
conn is not null.Executor task launch worker for task 218
record time: Executor task launch worker for task 218
insert (5,1) into hbase success.
request conn num: 8 Executor task launch worker for task 219
conn is not null.Executor task launch worker for task 219
record time: Executor task launch worker for task 219
insert (6,1) into hbase success.
request conn num: 9 Executor task launch worker for task 220
conn is not null.Executor task launch worker for task 220
record time: Executor task launch worker for task 220
insert (7,1) into hbase success.
request conn num: 10 Executor task launch worker for task 221
conn is not null.Executor task launch worker for task 221
record time: Executor task launch worker for task 221
insert (8,1) into hbase success.
request conn num: 11 Executor task launch worker for task 222
conn is not null.Executor task launch worker for task 222
record time: Executor task launch worker for task 222
insert (9,1) into hbase success.

 

五、参考文章

1.《Spark Streaming 实时流式大数据处理实战》

2.《HBase 实战》

3. Spark Streaming中,增大任务并发度的方法有哪些?

4. Spark官网参数配置说明 configuration.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值