快学Big Data -- Spark Streaming 总结（二十五)

最新推荐文章于 2023-04-11 10:49:06 发布

盒马coding

最新推荐文章于 2023-04-11 10:49:06 发布

阅读量667

点赞数 1

分类专栏：大数据书籍文章标签： spark hadoop spark sql

本文链接：https://blog.youkuaiyun.com/xfg0218/article/details/82381690

版权

大数据书籍专栏收录该内容

30 篇文章

订阅专栏

Spark-Streaming 总结

官方文档

http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html

Spark Streaming类似于Apache Storm，用于流式数据的处理。根据其官方文档介绍，Spark Streaming有高吞吐量和容错能力强等特点。Spark Streaming支持的数据输入源很多，例如：Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如：map、reduce、join、window等进行运算。而结果也能保存在很多地方，如HDFS，redis, hbise数据库等。另外Spark Streaming也能和MLlib（机器学习）以及Graphx完美融合。

Spark Strraming 示意图展示

可以看出Spark Streaming只是做了中间对数据的处理的部分。

可以看出spark生态圈中有相互支持的组件，在计算时可以相互整合，大大的增加开发的效率与能力。

什么是DStream

Discretized Stream是Spark Streaming的基础抽象，代表持续性的数据流和经过各种Spark原语操作后的结果数据流。在内部实现上，DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据，如下图：

对数据的操作也是按照RDD为单位来进行的

计算过程由Spark engine来完成

1-1）、DStream相关操作

DStream上的原语与RDD的类似，分为Transformations（转换）和Output Operations（输出）两种，此外转换操作中还有一些比较特殊的原语，如：updateStateByKey()、transform()以及各种Window相关的原语。

1-2）、Transformations on DStreams

reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

1-3）、特殊的Transformations

UpdateStateByKey Operation

UpdateStateByKey原语用于记录历史记录，上文中Word Count示例中就用到了该特性。若不用UpdateStateByKey来更新状态，那么每次数据进来后分析完成后，结果输出后将不在保存

Transform Operation

Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。此外，MLlib（机器学习）以及Graphx也是通过本函数来进行结合的。

Window Operations

Window Operations有点类似于Storm中的State，可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态

reduceByKeyAndWindow(_+_,_-_, Seconds(6), Seconds(10)) 传递了两个函数，对性能进行了优化，主要是获取了之前的数据，在进行下一个计算，提高了运算的效率与速度。

Output Operations on DStreams

Output Operations可以将DStream的数据输出到外部的数据库或文件系统，当某个Output Operations原语被调用时（与RDD的Action相同），streaming程序才会开始真正的计算过程。

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Spark Streaming实现实时WordCount

1-1）、图解

1-2）、安装nc

[root@hadoop1 ~]# yum install -y nc

Loaded plugins: fastestmirror, langpacks

Loading mirror speeds from cached hostfile

* base: mirrors.neusoft.edu.cn

* extras: mirrors.neusoft.edu.cn

* updates: mirrors.nwsuaf.edu.cn

updates/7/x86_64/primary_db FAILED 99% [======================================================================-] 70 kB/s | 7.0 MB 00:00:00 ETA

http://mirrors.nwsuaf.edu.cn/centos/7.2.1511/updates/x86_64/repodata/f444054b66ff65397e29b26ca982cb38039365a4dcb20acc5876a487ac88d867-primary.sqlite.bz2: [Errno -1] Metadata file does not match checksum

Trying other mirror.

^Cdates/7/x86_64/primary_db 78% [======================================================== ] 76 kB/s | 5.6 MB 00:00:20 ETA

1-3）、常用的命令

[root@hadoop1 ~]# nc -help

Ncat 6.40 ( http://nmap.org/ncat )

Usage: ncat [options] [hostname] [port]

Options taking a time assume seconds. Append 'ms' for milliseconds,

's' for seconds, 'm' for minutes, or 'h' for hours (e.g. 500ms).

-4 Use IPv4 only

-6 Use IPv6 only

-U, --unixsock Use Unix domain sockets only

-C, --crlf Use CRLF for EOL sequence

-c, --sh-exec <command> Executes the given command via /bin/sh

-e, --exec <command> Executes the given command

--lua-exec <filename> Executes the given Lua script

-g hop1[,hop2,...] Loose source routing hop points (8 max)

-G <n> Loose source routing hop pointer (4, 8, 12, ...)

-m, --max-conns <n> Maximum <n> simultaneous connections

-h, --help Display this help screen

-d, --delay <time> Wait between read/writes

-o, --output <filename> Dump session data to a file

-x, --hex-dump <filename> Dump session data as hex to a file

-i, --idle-timeout <time> Idle read/write timeout

-p, --source-port port Specify source port to use

-s, --source addr Specify source address to use (doesn't affect -l)

-l, --listen Bind and listen for incoming connections

-k, --keep-open Accept multiple connections in listen mode

-n, --nodns Do not resolve hostnames via DNS

-t, --telnet Answer Telnet negotiations

-u, --udp Use UDP instead of default TCP

--sctp Use SCTP instead of default TCP

-v, --verbose Set verbosity level (can be used several times)

-w, --wait <time> Connect timeout

--append-output Append rather than clobber specified output files

--send-only Only send data, ignoring received; quit on EOF

--recv-only Only receive data, never send anything

--allow Allow only given hosts to connect to Ncat

--allowfile A file of hosts allowed to connect to Ncat

--deny Deny given hosts from connecting to Ncat

--denyfile A file of hosts denied from connecting to Ncat

--broker Enable Ncat's connection brokering mode

--chat Start a simple Ncat chat server

--proxy <addr[:port]> Specify address of host to proxy through

--proxy-type <type> Specify proxy type ("http" or "socks4")

--proxy-auth <auth> Authenticate with HTTP or SOCKS proxy server

--ssl Connect or listen with SSL

--ssl-cert Specify SSL certificate file (PEM) for listening

--ssl-key Specify SSL private key (PEM) for listening

--ssl-verify Verify trust and domain name of certificates

--ssl-trustfile PEM file containing trusted SSL certificates

--version Display Ncat's version information and exit

See the ncat(1) manpage for full options, descriptions and usage examples

1-4）、启动nc

[root@hadoop1 ~]# nc -lk 8888

dfhf

fbfr

ere

Gfr

1-5）、代码实现

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by Administrator on 2016/11/6.
  */
object SparkstringTest {
  def main(args: Array[String]) {
    // 加载win下的配置文件
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
    // 岁数据今次那个初始化
    val conf = new SparkConf().setAppName("SparkstringTest").setMaster("local[2]")
    // 设置上下文
    val ssc = new StreamingContext(conf, Seconds(5))
    // 建立连接
    val textStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop2", 8888)
    // 分割数据
    val map: DStream[String] = textStream.flatMap(_.split(" "))
    // 对数据进行累加
    val map1: DStream[(String, Int)] = map.map((_, 1))
    // 对数据进行累加
    val key: DStream[(String, Int)] = map1.reduceByKey(_ + _)
    // 显示处理的数据
    key.print()
    // 开启进程
    ssc.start()
    // 等待停止
    ssc.awaitTermination()
  }
}

key.print() 默认显示前10行的数据

1-6）、查看结果

*************

-------------------------------------------

Time: 1478415525000 ms

-------------------------------------------

(dff,1)

(a,2)

(dfed,1)

************

reduceByKey 支队当前的额数据进行累加，不会对全局的累加。

从TCP端口中读取数据，并对数据进行累加

准备JAR

需要的JAR

<groupId>org.apache.spark</groupId>

<artifactId>spark-sql_2.10</artifactId>

<version>${spark.version}</version>

</dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming_2.10</artifactId>

<version>${spark.version}</version>

</dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming-kafka_2.10</artifactId>

</dependency>

图解

UpdateStateByKey 实现方式

1-1）、代码实现

package streams

import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by Adminon 2016/8/26.
  */
object StateFulStreamingWordCount {

  val updateFunc = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
    //it.map(t => (t._1, t._2.sum + t._3.getOrElse(0)))
    it.map { case (x, y, z) => (x, y.sum + z.getOrElse(0)) }
  }

  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4")
    LoggerLevels.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("StateFulStreamingWordCount").setMaster("local[2]")
    //创建StreamingContext并设置产生批次的间隔时间
    val ssc = new StreamingContext(conf, Seconds(5))
    //设置ck目录
    ssc.checkpoint("E://ck0826")
    //从Socket端口中创建RDD
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop2", 8888)
    val words: DStream[String] = lines.flatMap(_.split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val result: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    //打印，默认的显示前10行的数据
    result.print()
    //开启程序
    ssc.start()
    //等待结束
    ssc.awaitTermination()
  }
}

1-2）、写入数据

[root@hadoop2 sbin]# nc -lk 8888

a b c d e f

djf ffgrg rghr rigrg righrg

a b c d e f

d d d d d d d

1-3）、查看结果

*********************

-------------------------------------------

Time: 1478421805000 ms

-------------------------------------------

(d,10)

(ffgrg,1)

(b,3)

(,1)

(f,3)

(djf,1)

(e,3)

(rghr,1)

(rigrg,1)

(a,3)

1-4）、设置Log级别

package streams

import org.apache.log4j.{Logger, Level}
import org.apache.spark.Logging

object LoggerLevels extends Logging {

  def setStreamingLogLevels() {
    val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
    if (!log4jInitialized) {
      logInfo("Setting log level to [WARN] for streaming example." +
        " To override add a custom log4j.properties to the classpath.")
      Logger.getRootLogger.setLevel(Level.WARN)
    }
  }
}

ReduceByKeyAndWindow 实现方式

1-1）、代码实现

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object WindowReduce {
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4")
    val conf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[4]")
    val sc = new SparkContext(conf)
    // 创建StreamingContext，batch interval为5秒
    val ssc = new StreamingContext(sc, Seconds(2))
    //Socket为数据源
    val lines = ssc.socketTextStream("skycloud1", 8888, StorageLevel.MEMORY_ONLY_SER)
    val words = lines.flatMap(_.split(" "))
    // windows操作，对窗口中的单词进行计数
    val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(6), Seconds(10))
    //显示数据
    wordCounts.print()
    // 开启计算
    ssc.start()
    //等待程序执行结束
    ssc.awaitTermination()
  }
}

Seconds(2) ：batch interval为5秒

Seconds(6)：window的长度是30秒，最近30秒的数据

Seconds(10)：计算的时间间隔

1-2）、查看结果

-------------------------------------------

Time: 1487556534000 ms

-------------------------------------------

Time: 1487556554000 ms

-------------------------------------------

(edef,2)

(de,1)

(wedfe,1)

(wefef,1)

(ewfef,1)

**********************************

详情请看：http://blog.youkuaiyun.com/xfg0218/article/details/56008383

Spark 结合Flume

1-1）、上传JAR包到FLume的lib下

JAR下载地址，如果无法下载请联系作者：

链接：http://pan.baidu.com/s/1kVz3bvT 密码：btnf

commons-lang3-3.3.2.jar scala-library-2.10.5.jar spark-streaming-flume-sink_2.10-1.6.1.jar

1-2）、修改Flume配置文件

[root@hadoop1 configurationFile]# vi flume-poll.conf

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# source

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir = /usr/local/flume/testDate

a1.sources.r1.fileHeader = true

# Describe the sink

a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink

a1.sinks.k1.hostname = hadoop1

a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

1-3）、启动Flume

[root@hadoop1 configurationFile]# flume-ng agent -n a1 -c conf -f flume-poll.conf -Dflume.root.logger=WARN,console

***************

16/11/06 01:48:46 INFO sink.SparkSink: Starting Avro server for sink: k1

16/11/06 01:48:46 INFO sink.SparkSink: Blocking Sink Runner, sink will continue to run..

1-4）、准备Flume JAR

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume_2.10</artifactId>
    <version>${spark.version}</version>
</dependency>

1-5）、代码实现

package streams

import java.net.InetSocketAddress

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumeStreamingWordCount {

  def main(args: Array[String]) {
    // 加载win下的配置文件
    System.setProperty("hadoop.home.dir",
      "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
    val conf = new SparkConf().setAppName("FlumeStreamingWordCount").setMaster("local[2]")
    LoggerLevels.setStreamingLogLevels()
    //创建StreamingContext并设置产生批次的间隔时间
    val ssc = new StreamingContext(conf, Seconds(5))
    //从Socket端口中创建RDD
    val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc, Array(new InetSocketAddress("hadoop1", 8888)), StorageLevel.MEMORY_AND_DISK)
    //去取Flume中的数据
    val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
    //打印
    result.print()
    //开启程序
    ssc.start()
    //等待结束
    ssc.awaitTermination()
  }
}

1-6）、测试数据

[root@hadoop1 conf]# cp flume-conf.properties.template /usr/local/flume/testDate/

1-7）、查看结果

**********************

-------------------------------------------

Time: 1478427325000 ms

-------------------------------------------

(Unless,1)

(config,1)

(this,5)

(KIND,,1)

(case,,1)

(is,3)

(under,4)

(follows.,1)

(memoryChannel,3)

(sinks,1)

Spark 结合Kafka

1-1）、启动Kafka

[root@hadoop1 start_sh]# cat kafka_start.sh

cat /usr/local/start_sh/slave |while read line

{

echo $line

ssh $line "source /etc/profile;nohup kafka-server-start.sh /usr/local/kafka/config/server.properties > /dev/null 2>&1&"

wait

done

1-2）、创建topic

[root@hadoop1 start_sh]# kafka-topics.sh --create --zookeeper hadoop1:2181 --replication-factor 2 --partitions 3 --topic lines

Created topic "lines".

1-3）、查看所有的topic

[root@hadoop1 start_sh]# kafka-topics.sh --list --zookeeper hadoop1:2181

lines

1-4）、查看topic的详情

[root@hadoop1 start_sh]# kafka-topics.sh --describe --zookeeper hadoop1:2181 --topic lines

Topic:lines PartitionCount:3 ReplicationFactor:2 Configs:

Topic: lines Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0

Topic: lines Partition: 1 Leader: 2 Replicas: 2,1 Isr: 2,1

Topic: lines Partition: 2 Leader: 0 Replicas: 0,2 Isr: 0,2

1-5）、启动一个生产者发送消息

[root@hadoop1 start_sh]# kafka-console-producer.sh --broker-list hadoop1:9092 --topic lines

aaaaaaaaaa

bbbbbbbbbbbbbbbbbbbbbbbbbbb

ccccccccccccccccccccccccccccccc

1-6）、启动一个消费者消费数据

[root@hadoop1 start_sh]# kafka-console-consumer.sh --zookeeper hadoop1:2181 --from-beginning --topic lines

aaaaaaaaaa

bbbbbbbbbbbbbbbbbbbbbbbbbbb

ccccccccccccccccccccccccccccccc

1-7）、代码

package streams

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by root on 2016/5/21.
  */
object KafkaWordCount {

  val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
    //iter.flatMap(it=>Some(it._2.sum + it._3.getOrElse(0)).map(x=>(it._1,x)))
    iter.flatMap { case (x, y, z) => Some(y.sum + z.getOrElse(0)).map(i => (x, i)) }
  }

  def main(args: Array[String]) {
    LoggerLevels.setStreamingLogLevels()
    val Array(zkQuorum, group, topics, numThreads) = args
    val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // 设置checkpoint
    ssc.checkpoint("c://ck200")
    //"alog-2016-04-16,alog-2016-04-17,alog-2016-04-18"
    //"Array((alog-2016-04-16, 2), (alog-2016-04-17, 2), (alog-2016-04-18, 2))"
    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_AND_DISK_SER)
    val words = data.map(_._2).flatMap(_.split(" "))
    val wordCounts = words.map((_, 1)).updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

ssc.setCheckPointDir() 也是设置checkpoint的方法，主要负责长时间的程序运算产生的垃圾数据。

1-8）、配置参数

hadoop1:2181,hadoop2:2181,hadoop3:2181 g1 lines 1

1-9）、测试数据

[root@hadoop1 start_sh]# kafka-console-producer.sh --broker-list hadoop1:9092 --topic lines

aaaa

bbb

ccc

dddd

eeee

fff

gggg

1-10）、查看结果

**************************

-------------------------------------------

Time: 1478444825000 ms

-------------------------------------------

(aaaa,1)

(bbb,1)

(dddd,1)

(ddddddddddddd,1)

(eeee,1)

(fff,1)

(f,1)

(gggg,1)

(fffffffffffffff,1)

(ccc,1)

1-11）、提交集群运行查看结果

A）、运行程序

[root@hadoop1 sparkJar]# spark-submit --class streams.KafkaWordCount --master spark://hadoop1:7077,hadoop2:7077 --executor-memory 1g --total-executor-cores 2 /usr/local/spark/sparkJar/sparkKafka.jar hadoop1:2181,hadoop2:2181,hadoop3:2181 sparkKafka lines 2

Spark 结合Redis

1-1）、创建Kafka 中的数据

import java.util.Properties

import kafka.javaapi.producer.Producer
import kafka.producer.{KeyedMessage, ProducerConfig}
import org.codehaus.jettison.json.JSONObject
import scala.util.Random

object KafkaEventProducer {

  private val users = Array("4A4D769EB9679C054DE81B973ED5D768",
    "8dfeb5aaafc027d89349ac9a20b3930f",
    "011BBF43B89BFBF266C865DF0397AA71",
    "f2a8474bf7bd94f0aabbd4cdd2c06dcf",
    "068b746ed4620d25e26055a9f804385f",
    "97edfc08311c70143401745a03a50706",
    "d7f141563005d1b5d0d3dd30138f3f62",
    "c8ee90aade1671a21336c721512b817a",
    "6b67c8c700427dee7552f81f3228c927", "a95f22eabc4fd4b580c011a3161a9d9d")

  private val random = new Random()
  private var pointer = -1

  def getUserID: String = {
    pointer = pointer + 1
    if (pointer >= users.length) {
      pointer = 0
      users(pointer)
    } else {
      users(pointer)
    }
  }
  def click(): Double = {
    random.nextInt(10)
  }
  def main(args: Array[String]) {
    val topic = "user_event"
    // 可以添加多个地址
    val brokers = "hadoop1:9092"
    val props = new Properties()
    props.put("metadata.broker.list", brokers)
    props.put("serializer.class", "kafka.serializer.StringEncoder")
    val kafka = new ProducerConfig(props)
    val producer = new Producer[String, String](kafka)
    while (true) {
      val event = new JSONObject()
      event.put("uid", getUserID)
      event.put("event_time", System.currentTimeMillis.toString)
      event.put("os_type", "Android")
      event.put("click_count", click)
      producer.send(new KeyedMessage[String, String](topic, event.toString))
      println("Message   sent:" + event)
      Thread.sleep(2000)
    }
  }
}

1-2）、链接Redis

package test

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object UserClickCountAnalytics {
  def main(args: Array[String]) {
    var masterUrl = "local[1]"
    if (args.length > 0) {
      masterUrl = args(0)
    }
    val conf = new SparkConf().setMaster(masterUrl).setAppName(this.getClass.getName)
    val ssc = new StreamingContext(conf, Seconds(5))

    val topic = Set("user_events")
    val brokers = "hadoop1:9092"
    val kafkaParames = Map[String, String]("metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder")
    val dbindex = 1
    val clickHashKey = "app:users::click"
    val kafkaStream = KafkaUtils.createDirectStream(ssc, kafkaParames, topic)
    val events = kafkaStream.flatMap(line => {
      Some(line)
    })

    val userClicks = events.map(x => (x._1, x._2))
    userClicks.foreachRDD(rdd => {
      rdd.foreachPartition(partititonOfRecords => {
        partititonOfRecords.foreach(pair => {
          val uid = pair._1
          val clickCount = pair._2
          val jedis = RedisClient.pool.getResource
          jedis.select(dbindex)
          jedis.hincrBy(clickHashKey, uid, clickCount)
          RedisClient.pool.returnResource(jedis)
        })
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

1-3）、Redis 连接池

package test

import org.apache.commons.pool2.impl.GenericObjectPoolConfig
import redis.clients.jedis.JedisPool

object RedisClient extends Serializable {
  val redisHost = "hadoop1"
  val redisPort = "6379".toInt
  val redisTimeout = 300000
  lazy val pool = new JedisPool(new GenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
  lazy val hook = new Thread {
    override def run = {
      println("hook thread:" + this)
      pool.destroy()
    }
  }
  sys.addShutdownHook(hook.run)
}