SparkStreaming实战小结

本文介绍了使用Scala入门SparkStreaming的过程,包括整合Maven,创建StreamingContext,连接MongoDB和Kafka的InputDStream,以及如何统计和保存Kafka消费的单词数量到MongoDB。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.Scala入门

想用好Spark生态,必须学习Scala语言。想学习一门语言,必须先学会写hello world。传送门:Scala菜鸟教程

2.Scala整合Maven

1)下载一个支持Scala、Maven的IDE(如:Scala IDE for Eclipse

2)创建一个Maven工程

3)选择合适的Scala版本,并在pom文件中定义scala.version属性

4)在pom文件中定义scala.tools.version属性(注意要与scala.version匹配)

5)在maven中央仓库查找以下库的合适版本,注意要与scala.tools.version匹配,且这些库两两之间版本必须匹配

(通过maven中央仓库的详情页面中的Compile Dependencies表格可识别库之间的版本配套关系)

spark-core、spark-sql、spark-streaming、kafka、spark-streaming-kafka、mongo-spark-connector

pom文件示例:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

  <modelVersion>4.0.0</modelVersion>

  <groupId>cn.com.flaginfo.demo</groupId>

  <artifactId>scala-test</artifactId>

  <version>0.0.1-SNAPSHOT</version>

  <name>${project.artifactId}</name>

  <description>My wonderfull scala app</description>

  <inceptionYear>2010</inceptionYear>

  <licenses>

    <license>

      <name>My License</name>

      <url>http://....</url>

      <distribution>repo</distribution>

    </license>

  </licenses>

 

  <properties>

    <maven.compiler.source>1.6</maven.compiler.source>

    <maven.compiler.target>1.6</maven.compiler.target>

    <encoding>UTF-8</encoding>

    <scala.tools.version>2.11</scala.tools.version>

    <scala.version>2.11.0</scala.version>

  </properties>

 

  <dependencies>

    <dependency>

      <groupId>org.scala-lang</groupId>

      <artifactId>scala-library</artifactId>

      <version>${scala.version}</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->

    <dependency>

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-core_${scala.tools.version}</artifactId>

      <version>2.2.2</version>

    </dependency>

    <dependency>

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-sql_${scala.tools.version}</artifactId>

      <version>2.2.2</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->

    <dependency>

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-streaming_${scala.tools.version}</artifactId>

      <version>2.2.2</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->

    <dependency>

      <groupId>org.apache.kafka</groupId>

      <artifactId>kafka_${scala.tools.version}</artifactId>

      <version>0.8.2.1</version>

    </dependency>

    <dependency>

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-streaming-kafka_${scala.tools.version}</artifactId>

      <version>1.6.3</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector -->

    <dependency>

      <groupId>org.mongodb.spark</groupId>

      <artifactId>mongo-spark-connector_${scala.tools.version}</artifactId>

      <version>2.2.4</version>

    </dependency>

    <dependency>

      <groupId>org.projectlombok</groupId>

      <artifactId>lombok</artifactId>

      <optional>true</optional>

      <version>1.18.4</version>

      <scope>provided</scope>

    </dependency>

 

     

    <!-- Test -->

    <dependency>

      <groupId>junit</groupId>

      <artifactId>junit</artifactId>

      <version>4.11</version>

      <scope>test</scope>

    </dependency>

  </dependencies>

 

  <build>

    <sourceDirectory>src/main/scala</sourceDirectory>

    <testSourceDirectory>src/test/scala</testSourceDirectory>

    <plugins>

      <plugin>

        <!-- see http://davidb.github.com/scala-maven-plugin -->

        <groupId>net.alchim31.maven</groupId>

        <artifactId>scala-maven-plugin</artifactId>

        <version>3.1.3</version>

        <executions>

          <execution>

            <goals>

              <goal>compile</goal>

              <goal>testCompile</goal>

            </goals>

            <configuration>

              <args>

                <arg>-dependencyfile</arg>

                <arg>${project.build.directory}/.scala_dependencies</arg>

              </args>

            </configuration>

          </execution>

        </executions>

      </plugin>

      <plugin>

        <groupId>org.apache.maven.plugins</groupId>

        <artifactId>maven-surefire-plugin</artifactId>

        <version>2.13</version>

        <configuration>

          <useFile>false</useFile>

          <disableXmlReport>true</disableXmlReport>

          <!-- If you have classpath issue like NoDefClassError,... -->

          <!-- useManifestOnlyJar>false</useManifestOnlyJar -->

          <includes>

            <include>**/*Test.*</include>

            <include>**/*Suite.*</include>

          </includes>

        </configuration>

      </plugin>

    </plugins>

  </build>

</project>

3.创建StreamingContext

val conf = new SparkConf().setMaster("local[2]").setAppName("MyTestScalaApp")

val ssc = new StreamingContext(conf, Seconds(1))

4.创建用于连接MongoDB的SparkSession

注意此语句必须在创建StreamingContext之后。

val sparkSession = SparkSession.builder()

      .config("spark.mongodb.input.uri""mongodb://127.0.0.1/test.myCollection")

      .config("spark.mongodb.output.uri""mongodb://127.0.0.1/test.myCollection")

      .getOrCreate()

5.创建连接Kafka的InputDStream

var topic = Set("sparkTestTopic")

var groupId = "spark-test-consumer"

val kafkaParam = Map[String, String](

  "bootstrap.servers" -> "localhost:9092,localhost:9093,localhost:9094",  // 本机搭建的kafka集群

  "group.id" -> groupId,

  "auto.offset.reset" -> "largest",

  "enable.auto.commit" -> "true"

  )

var kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParam, topic)

6.统计从Kafka消费得到的单词的数量

val lines = kafkaStream.map(_._2)   // _1是topic名,_2是消息体。本例演示的消息体为空格分隔的单词,如"apple desk game"

// Split each line into words

val words = lines.flatMap(_.split(" "))

// Count each word in each batch

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

7.统计结果保存到MongoDB

val writeConfig = WriteConfig(Map("collection" -> "sparkTest""writeConcern.w" -> "majority"), Some(WriteConfig(sparkSession)))

wordCounts.foreachRDD { rdd =>

  var formattedRdd = rdd.map(row => {   // rdd中的数据为("apple", 3)格式的元组,此处将其转换为MongoDB的文档

    var doc = new Document()

    doc.put("name", row._1)

    doc.put("value", row._2)

    (doc)

  })

  MongoSpark.save(formattedRdd, writeConfig)

}

8.完整的hello world示例代码

package xxx.demo

 

import kafka.serializer.{StringDecoder, DefaultDecoder}

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.kafka._

import org.apache.spark.sql.SparkSession

import org.bson.Document

import com.mongodb.spark.config._

import com.mongodb.spark._

import com.mongodb._

 

/**

 * @author ${user.name}

 */

object App {

   

  def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)

   

  def main(args : Array[String]) {

    println( "Hello World!" )

    println("concat arguments = " + foo(args))

    // Create a local StreamingContext with two working thread and batch interval of 1 second.

    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("MyTestScalaApp")

    val ssc = new StreamingContext(conf, Seconds(1))

    val sparkSession = SparkSession.builder()

      .config("spark.mongodb.input.uri""mongodb://127.0.0.1/test.myCollection")

      .config("spark.mongodb.output.uri""mongodb://127.0.0.1/test.myCollection")

      .getOrCreate()

    // kafka configuration

    var topic = Set("sparkTestTopic")

    var groupId = "spark-test-consumer"

    val kafkaParam = Map[String, String](

      "bootstrap.servers" -> "localhost:9092,localhost:9093,localhost:9094",

      "group.id" -> groupId,

      "auto.offset.reset" -> "largest",

      "enable.auto.commit" -> "true"

      )

    var kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParam, topic)

    // _._1 : topic name, _._2 : message body

    val lines = kafkaStream.map(_._2)

    // Split each line into words

    val words = lines.flatMap(_.split(" "))

    // Count each word in each batch

    val pairs = words.map(word => (word, 1))

    val wordCounts = pairs.reduceByKey(_ + _)

    // write to mongodb

    val writeConfig = WriteConfig(Map("collection" -> "sparkTest""writeConcern.w" -> "majority"), Some(WriteConfig(sparkSession)))

    wordCounts.foreachRDD { rdd =>

      var formattedRdd = rdd.map(row => {

        var doc = new Document()

        doc.put("name", row._1)

        doc.put("value", row._2)

        (doc)

      })

      MongoSpark.save(formattedRdd, writeConfig)

      rdd.foreach { record =>

        println( record )

      }

    }

    ssc.start()              // Start the computation

    ssc.awaitTerminationOrTimeout( 1000000 )

  }

}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值