本实例旨在:通过Spark Streaming流式地处理一个数据服务从TCP套接字中接收到的数据。
一创建maven工程,引入相应依赖jar包
<properties>
<scala.version>2.11.8</scala.version>
</properties>
<repositories>
<repository>
<id>repos</id>
<name>Repository</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</repository>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>repos</id>
<name>Repository</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</pluginRepository>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<!--Spark Core核心依赖-->
<dependency> <!-- Spark ,依赖的Scala版本为Scala 2.12.x版本 (View all targets) 否则会报错
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object; -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
<scope>provided</scope><!--运行时提供,打包不添加,Spark集群已自带-->
</dependency>
<!-- Spark SQL核心依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
<scope>provided</scope><!--运行时提供,打包不添加,Spark集群已自带-->
</dependency>
<!-- Spark Streaming依赖包-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.0</version>
<scope>provided</scope><!--运行时提供,打包不添加,Spark集群已自带-->
</dependency>
<!-- 2.12.x需要与spark的2.12对应-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<!-- 混合Scala/Java编译-->
<plugin><!--Scala编译插件 -->
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<includes>
<include>**/*.scala</include>
</includes>
</configuration>
</execution>
<execution>
<id>scala-test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source><!-- 设置Java源-->
<target>1.8</target>
</configuration>
</plugin>
<plugin><!-- 将所有的依赖包打入同一个jar包-->
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<appendAssemblyId>false</appendAssemblyId>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef><!--jar包的后缀名-->
</descriptorRefs>
<archive>
<manifest>
<mainClass>org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordsFrep</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin><!--Maven打包插件 -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath><!--添加类路径-->
<!--设置程序的入口类-->
<mainClass>org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordsFrep</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
</project>
二:Scala代码如下:
package org.jy.data.yh.bigdata.drools.scala.sparkstreaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Spark Streaming 数据流统计词频
*/
object SparkStreamingWordsFrep {
def main(args: Array[String]): Unit = {
// Spark 配置项
val sparkConf = new SparkConf()
.setAppName("SparkStreamingWordsFrep")
.setMaster("spark://centoshadoop1:7077,centoshadoop2:7077")
// 创建流式上下文
val sparkStreamContext = new StreamingContext(sparkConf,Seconds(2))
// 创建一个DStream,连接指定的hostname:port,比如localhost:9999
val lines = sparkStreamContext.socketTextStream("centoshadoop1",9999)
// 将接收到的每条信息分割成次词语
val words = lines.flatMap(line =>{line.split(" ")})
// 统计每个batch的词频
val pairs = words.map(word =>(word,1))
// 汇总词频
val wordCounts = pairs.reduceByKey(_+_) // 将key相同的元组的value累积
// 打印从DStream中生成的RDD的前10个元素到控制台
wordCounts.print(10000)
sparkStreamContext.start() // 开始计算
sparkStreamContext.awaitTermination() //等待计算结束
}
}
三,linux系统安装nmap-ncat软件
yum install nc
四,打包到spark集群上运行,命令如下
bin/spark-submit \
--class org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordsFrep \
--num-executors 4 \
--driver-memory 2G \
--executor-memory 1g \
--executor-cores 1 \
--conf spark.default.parallelism=1000 \
/home/hadoop/tools/SSO-Scala-SparkStreaming-1.0-SNAPSHOT.jar
五,打开两个linux界面,在其中一个界面输入要统计的文本内容
如下图:

[hadoop@centoshadoop1 ~]$ nc -lk 9999
Spark streaming is an extension of the core Spark API
输出效果如下:

本文介绍如何使用Spark Streaming处理TCP套接字接收的数据,实现实时词频统计。包括创建Maven工程、配置依赖、Scala代码实现及在Spark集群上的运行步骤。
4979

被折叠的 条评论
为什么被折叠?



