一、环境
开发环境:
系统:Win10
开发工具:scala-eclipse-IDE
项目管理工具:Maven 3.6.0
JDK 1.8
Scala 2.11.11
Spark 2.4.3
HBase 1.2.9
作业运行环境:
系统:Linux CentOS7(两台机:主从节点,2核)
master : 192.168.190.200
slave1 : 192.168.190.201
JDK 1.8
Scala 2.11.11
Spark 2.4.3
Hadoop 2.9.2
ZooKeeper 3.4.14
HBase 1.2.9
二、案例
1. 在之前博文Spark Streaming之流式词频统计(Socket数据源)基础上,利用从Socket数据源获取单词,进行单词统计,并将统计结果输出至HBase数据库中;(注:相关Netcat 的安装使用,也可参看之前的这篇文章)
三、代码实现
1. pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com</groupId>
<artifactId>DStreamOutputHBase</artifactId>
<version>0.1</version>
<dependencies>
<dependency><!-- Spark核心依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
</dependency>
<dependency><!-- Spark Streaming依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.3</version>
<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
</dependency>
<dependency><!-- Log 日志依赖包 -->
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency><!-- 日志依赖接口 -->
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.12</version>
</dependency>
<dependency><!-- HBase 客户端依赖包 -->
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.9</version>
</dependency>
<dependency><!-- HBase 公共依赖包 -->
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>1.2.9</version>
</dependency>
<dependency><!-- HBase 服务端依赖包 -->
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.9</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 混合scala/java编译 -->
<plugin><!-- scala编译插件 -->
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<id>compile</id>
<goals>
<goal>compile</goal>
</goals>
<phase>compile</phase>
</execution>
<execution>
<id>test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
<phase>test-compile</phase>
</execution>
<execution>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source><!-- 设置Java源 -->
<target>1.8</target>
</configuration>
</plugin>
<!-- for fatjar -->
<plugin><!-- 将所有依赖包打入同一个jar包中 -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<!-- jar包的后缀名 -->
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin><!-- Maven打包插件 -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!-- 添加类路径 -->
<addClasspath>true</addClasspath>
<!-- 设置程序的入口类 -->
<mainClass>dstream.output.driver.WordCount</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
</project>
2. HBase连接工具类:
1)依然使用惰性单例实现获取HBase连接;
2)原本想通过设置请求连接的计数器来判断连接的关闭时机,但是没有在Spark作业中开出并发的效果,也就没能进行测试;
3)Spark Streaming作业一般不停下来,所以又好像没必要关闭连接。(TODO待后期深入)
package dstream.output.hbase
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.Connection
object HbaseUtil extends Serializable {
//配置信息
private val conf = HBaseConfiguration.create()
conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
conf.set(HConstants.ZOOKEEPER_QUORUM, "master,slave1")
//HBase连接
@volatile private var connection: Connection = _
//请求的连接数计数器(为0时关闭)
@volatile private var num = 0
//获取HBase连接
def getHBaseConn: Connection = {
synchronized {
if (connection == null || connection.isClosed() || num == 0) {
connection = ConnectionFactory.createConnection(conf)
println("conn is created! " + Thread.currentThread().getName())
}
//每请求一次连接,计数器加一
num = num + 1
println("request conn num: " + num + " " + Thread.currentThread().getName())
}
connection
}
//关闭HBase连接
def closeHbaseConn(): Unit = {
synchronized {
if (num <= 0) {
println("no conn to close!")
return
}
//每请求一次关闭连接,计数器减一
num = num - 1
println("request close num: " + num + " " + Thread.currentThread().getName())
//请求连接计数器为0时关闭连接
if (num == 0 && connection != null && !connection.isClosed()) {
connection.close()
println("conn is closed! " + Thread.currentThread().getName())
}
}
}
}
3. 程序入口:
package dstream.output.driver
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.Time
import scala.collection.Iterator
import dstream.output.hbase.HbaseUtil
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
object WordCount extends App {
//Spark配置项
val conf = new SparkConf()
.setAppName("SocketWordFreq")
.setMaster("spark://master:7077")
//创建流式上下文,2s的批处理间隔
val ssc = new StreamingContext(conf, Seconds(2))
//创建一个DStream,连接指定的hostname:port,比如master:9999
val lines = ssc.socketTextStream("master", 9999) //DS1
//将接收到的每条信息分割成单个词汇
val words = lines.flatMap(_.split(" ")) //DS2
//统计每个batch的词频
val pairs = words.map(word => (word, 1)) //DS3
// 汇总词汇
val wordCounts = pairs.reduceByKey(_ + _) //DS4
//在reduce聚合之后,输出结果至HBase(输出操作)
wordCounts.foreachRDD((rdd: RDD[(String, Int)], time: Time) => {
//RDD为空时,无需再向下执行,否则在分区中还需要获取数据库连接(无用操作)
if (!rdd.isEmpty()) {
//一个分区执行一批SQL
rdd.foreachPartition((partition: Iterator[(String, Int)]) => {
//每个分区都会创建一个task任务线程,分区多,资源利用率高
//可通过参数配置分区数:"--conf spark.default.parallelism=20"
if (!partition.isEmpty) {
//partition和record共同位于本地计算节点Worker,故无需序列化发送conn和statement
//如果多个分区位于一个Worker中,则共享连接(位于同一内存资源中)
//获取HBase连接
val conn = HbaseUtil.getHBaseConn
if (conn == null) {
println("conn is null.") //在Worker节点的Executor中打印
} else {
println("conn is not null." + Thread.currentThread().getName())
partition.foreach((record: (String, Int)) => {
//每个分区中的记录在同一线程中处理
println("record : " + Thread.currentThread().getName())
//设置表名
val tableName = TableName.valueOf("wordfreq")
//获取表的连接
val table = conn.getTable(tableName)
try {
//设定行键(单词)
val put = new Put(Bytes.toBytes(record._1))
//添加列值(单词个数)
//三个参数:列族、列、列值
put.addColumn(Bytes.toBytes("statistics"),
Bytes.toBytes("cnt"),
Bytes.toBytes(record._2))
//执行插入
table.put(put)
println("insert (" + record._1 + "," + record._2 + ") into hbase success.")
} catch {
case e: Exception => e.printStackTrace()
} finally {
table.close()
}
})
//关闭HBase连接(此处每个partition任务结束都会执行,会频繁开关连接,耗费资源)
// HbaseUtil.closeHbaseConn()
}
}
})
//关闭HBase连接(此处只在Driver节点执行,故无效)
// HbaseUtil.closeHbaseConn()
}
})
//打印从DStream中生成的RDD的前10个元素到控制台中
wordCounts.print() //print() 是输出操作,默认前10条数据
ssc.start() //开始计算
ssc.awaitTermination() //等待计算结束
}
四、打包运行
1.在项目的根目录下运行命令行窗口(在目录下 "shift+右键",选择命令行窗口 Power Shell)
执行如下命令:(编译代码)
> mvn clean install
编译成功后,会在当前目录的 ".\target\" 下产生两个jar包;
其中的 DStreamOutputHBase-0.1-jar-with-dependencies.jar 用来提交给Spaek集群
用终端A(如:Windows的PowerShell)通过ssh登陆master节点,执行
2.master节点建立Socket服务器(9999端口)
$ nc -lk 9999
>
用终端B(如:Windows的PowerShell,可连接多台)通过ssh登陆master节点,执行
3.将Jar包提交至主节点上,执行Spark作业:
提交Spark作业:(需先配置Spark_HOME环境变量)
$ spark-submit \
--class dstream.output.driver.WordCount \
--executor-cores 2 \
--conf spark.default.parallelism=20 \
/opt/DStreamOutputHBase-0.1-jar-with-dependencies.jar
注1:其中每行的末尾 "\" 代表不换行,命令需在一行上输入,此处只为方便观看
注2:提交的Jar包放在 /opt/ 目录下
注3:增加Worker节点上并发线程数 --executor-cores 2
注4:增加分区数(每个分区一个task任务线程)配置:spark.default.parallelism=20
Spark流式作业运行过程的输出(以2s间隔,不停打印时间信息)
-----------------------------
Time: 1560853450000 ms
-----------------------------
-----------------------------
Time: 1560853452000 ms
-----------------------------
-----------------------------
Time: 1560853454000 ms
-----------------------------
4.在master节点进入HBase Shell,创建数据接收表
hbase(main):001:0> create 'wordfreq', 'statistics'
5.在终端A执行,输入数据
$ nc -lk 9999
> hello word hello words hello world
>
6.终端B中可以看到统计结果输出
-----------------------------
Time: 1561040076000 ms
-----------------------------
(hello,3)
(word,1)
(words,1)
(world,1)
-----------------------------
Time: 1561040078000 ms
-----------------------------
7.同时,进入HBase Shell中查看接收到的数据
1)查询hello单词的统计值
hbase(main):002:0> get 'wordfreq', 'hello'
COLUMN CELL
statistics:cnt timestamp=1561040009657, value=\x00\x00\x00\x03
<表示:hello单词的统计值cnt为3>
2)扫描全表数据
hbase(main):034:0> scan 'wordfreq'
ROW COLUMN+CELL
hello column=statistics:cnt, timestamp=1561040009657, value=\x00\x00\x00\x03
word column=statistics:cnt, timestamp=1561040009680, value=\x00\x00\x00\x01
words column=statistics:cnt, timestamp=1561040009671, value=\x00\x00\x00\x01
world column=statistics:cnt, timestamp=1561040009660, value=\x00\x00\x00\x01
<此刻,即表示成功将流数据输出至HBase中>
试验性测试:
当Socket服务端输入1 2 3 ... 30 串数字,Spark作业开启线程2个 --executor-cores 2 ,配置分区数20个 --conf spark.default.parallelism=20 时,如下输出(stdout日志中查看):并没有同时执行的线程,都是按顺序执行,分区倒是分了很多。
时间角度的日志:
conn is created!1561024578954
request conn num: 1 1561024578955
conn is not null.1561024578955
insert (22,1) into hbase success.
request conn num: 2 1561024579415
conn is not null.1561024579415
insert (23,1) into hbase success.
request conn num: 3 1561024579434
conn is not null.1561024579434
insert (24,1) into hbase success.
request conn num: 4 1561024579454
conn is not null.1561024579454
insert (25,1) into hbase success.
request conn num: 5 1561024579467
conn is not null.1561024579467
insert (26,1) into hbase success.
request conn num: 6 1561024579477
conn is not null.1561024579477
insert (27,1) into hbase success.
request conn num: 7 1561024579487
conn is not null.1561024579487
insert (28,1) into hbase success.
request conn num: 8 1561024579496
conn is not null.1561024579496
insert (10,1) into hbase success.
insert (29,1) into hbase success.
request conn num: 9 1561024579511
conn is not null.1561024579511
insert (11,1) into hbase success.
request conn num: 10 1561024579520
conn is not null.1561024579520
insert (30,1) into hbase success.
insert (12,1) into hbase success.
insert (1,1) into hbase success.
request conn num: 11 1561024579536
conn is not null.1561024579536
insert (2,2) into hbase success.
insert (13,1) into hbase success.
request conn num: 12 1561024579548
conn is not null.1561024579548
insert (14,1) into hbase success.
insert (3,1) into hbase success.
request conn num: 13 1561024579558
conn is not null.1561024579559
insert (4,1) into hbase success.
insert (15,1) into hbase success.
request conn num: 14 1561024579574
conn is not null.1561024579574
insert (5,1) into hbase success.
insert (16,1) into hbase success.
request conn num: 15 1561024579588
conn is not null.1561024579588
insert (6,1) into hbase success.
insert (17,1) into hbase success.
request conn num: 16 1561024579601
conn is not null.1561024579601
insert (7,1) into hbase success.
insert (18,1) into hbase success.
request conn num: 17 1561024579614
conn is not null.1561024579614
insert (8,1) into hbase success.
insert (19,1) into hbase success.
request conn num: 18 1561024579626
conn is not null.1561024579626
insert (9,1) into hbase success.
request conn num: 19 1561024579634
conn is not null.1561024579634
insert (20,1) into hbase success.
request conn num: 20 1561024579643
conn is not null.1561024579643
insert (21,1) into hbase success.
分区task任务线程角度日志:
conn is created!Executor task launch worker for task 212
request conn num: 1 Executor task launch worker for task 212
conn is not null.Executor task launch worker for task 212
record time: Executor task launch worker for task 212
insert (10,1) into hbase success.
request conn num: 2 Executor task launch worker for task 213
conn is not null.Executor task launch worker for task 213
record time: Executor task launch worker for task 213
insert (11,1) into hbase success.
request conn num: 3 Executor task launch worker for task 214
conn is not null.Executor task launch worker for task 214
record time: Executor task launch worker for task 214
insert (12,1) into hbase success.
record time: Executor task launch worker for task 214
insert (1,1) into hbase success.
request conn num: 4 Executor task launch worker for task 215
conn is not null.Executor task launch worker for task 215
record time: Executor task launch worker for task 215
insert (2,1) into hbase success.
record time: Executor task launch worker for task 215
insert (13,1) into hbase success.
request conn num: 5 Executor task launch worker for task 216
conn is not null.Executor task launch worker for task 216
record time: Executor task launch worker for task 216
insert (14,1) into hbase success.
record time: Executor task launch worker for task 216
insert (3,1) into hbase success.
request conn num: 6 Executor task launch worker for task 217
conn is not null.Executor task launch worker for task 217
record time: Executor task launch worker for task 217
insert (4,1) into hbase success.
record time: Executor task launch worker for task 217
insert (15,1) into hbase success.
request conn num: 7 Executor task launch worker for task 218
conn is not null.Executor task launch worker for task 218
record time: Executor task launch worker for task 218
insert (5,1) into hbase success.
request conn num: 8 Executor task launch worker for task 219
conn is not null.Executor task launch worker for task 219
record time: Executor task launch worker for task 219
insert (6,1) into hbase success.
request conn num: 9 Executor task launch worker for task 220
conn is not null.Executor task launch worker for task 220
record time: Executor task launch worker for task 220
insert (7,1) into hbase success.
request conn num: 10 Executor task launch worker for task 221
conn is not null.Executor task launch worker for task 221
record time: Executor task launch worker for task 221
insert (8,1) into hbase success.
request conn num: 11 Executor task launch worker for task 222
conn is not null.Executor task launch worker for task 222
record time: Executor task launch worker for task 222
insert (9,1) into hbase success.
五、参考文章
1.《Spark Streaming 实时流式大数据处理实战》
2.《HBase 实战》