Spark-Streaming 总结
官方文档
http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html
概述
Spark Streaming类似于Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming有高吞吐量和容错能力强等特点。Spark Streaming支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如:map、reduce、join、window等进行运算。而结果也能保存在很多地方,如HDFS,redis, hbise数据库等。另外Spark Streaming也能和MLlib(机器学习)以及Graphx完美融合。
Spark Strraming 示意图展示
可以看出Spark Streaming只是做了中间对数据的处理的部分。
可以看出spark生态圈中有相互支持的组件,在计算时可以相互整合,大大的增加开发的效率与能力。
什么是DStream
Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种Spark原语操作后的结果数据流。在内部实现上,DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据,如下图:
对数据的操作也是按照RDD为单位来进行的
计算过程由Spark engine来完成
1-1) 、DStream相关操作
DStream上的原语与RDD的类似,分为Transformations(转换)和Output Operations(输出)两种,此外转换操作中还有一些比较特殊的原语,如:updateStateByKey()、transform()以及各种Window相关的原语。
1-2)、Transformations on DStreams
reduce(func) | Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel. |
countByValue() | When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream. |
reduceByKey(func, [numTasks]) | When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
join(otherStream, [numTasks]) | When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. |
cogroup(otherStream, [numTasks]) | When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. |
transform(func) | Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. |
updateStateByKey(func) | Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. |
1-3) 、特殊的Transformations
- UpdateStateByKey Operation
UpdateStateByKey原语用于记录历史记录,上文中Word Count示例中就用到了该特性。若不用UpdateStateByKey来更新状态,那么每次数据进来后分析完成后,结果输出后将不在保存
- Transform Operation
Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。此外,MLlib(机器学习)以及Graphx也是通过本函数来进行结合的。
- Window Operations
Window Operations有点类似于Storm中的State,可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态
reduceByKeyAndWindow(_+_,_-_, Seconds(6), Seconds(10)) 传递了两个函数,对性能进行了优化,主要是获取了之前的数据,在进行下一个计算,提高了运算的效率与速度。
Output Operations on DStreams
Output Operations可以将DStream的数据输出到外部的数据库或文件系统,当某个Output Operations原语被调用时(与RDD的Action相同),streaming程序才会开始真正的计算过程。
Output Operation | Meaning |
print() | Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. |
saveAsTextFiles(prefix, [suffix]) | Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsObjectFiles(prefix, [suffix]) | Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsHadoopFiles(prefix, [suffix]) | Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
foreachRDD(func) | The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs. |
Spark Streaming实现实时WordCount
1-1)、图解
1-2)、安装nc
[root@hadoop1 ~]# yum install -y nc
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.neusoft.edu.cn
* extras: mirrors.neusoft.edu.cn
* updates: mirrors.nwsuaf.edu.cn
updates/7/x86_64/primary_db FAILED 99% [======================================================================-] 70 kB/s | 7.0 MB 00:00:00 ETA
http://mirrors.nwsuaf.edu.cn/centos/7.2.1511/updates/x86_64/repodata/f444054b66ff65397e29b26ca982cb38039365a4dcb20acc5876a487ac88d867-primary.sqlite.bz2: [Errno -1] Metadata file does not match checksum
Trying other mirror.
^Cdates/7/x86_64/primary_db 78% [======================================================== ] 76 kB/s | 5.6 MB 00:00:20 ETA
1-3)、常用的命令
[root@hadoop1 ~]# nc -help
Ncat 6.40 ( http://nmap.org/ncat )
Usage: ncat [options] [hostname] [port]
Options taking a time assume seconds. Append 'ms' for milliseconds,
's' for seconds, 'm' for minutes, or 'h' for hours (e.g. 500ms).
-4 Use IPv4 only
-6 Use IPv6 only
-U, --unixsock Use Unix domain sockets only
-C, --crlf Use CRLF for EOL sequence
-c, --sh-exec <command> Executes the given command via /bin/sh
-e, --exec <command> Executes the given command
--lua-exec <filename> Executes the given Lua script
-g hop1[,hop2,...] Loose source routing hop points (8 max)
-G <n> Loose source routing hop pointer (4, 8, 12, ...)
-m, --max-conns <n> Maximum <n> simultaneous connections
-h, --help Display this help screen
-d, --delay <time> Wait between read/writes
-o, --output <filename> Dump session data to a file
-x, --hex-dump <filename> Dump session data as hex to a file
-i, --idle-timeout <time> Idle read/write timeout
-p, --source-port port Specify source port to use
-s, --source addr Specify source address to use (doesn't affect -l)
-l, --listen Bind and listen for incoming connections
-k, --keep-open Accept multiple connections in listen mode
-n, --nodns Do not resolve hostnames via DNS
-t, --telnet Answer Telnet negotiations
-u, --udp Use UDP instead of default TCP
--sctp Use SCTP instead of default TCP
-v, --verbose Set verbosity level (can be used several times)
-w, --wait <time> Connect timeout
--append-output Append rather than clobber specified output files
--send-only Only send data, ignoring received; quit on EOF
--recv-only Only receive data, never send anything
--allow Allow only given hosts to connect to Ncat
--allowfile A file of hosts allowed to connect to Ncat
--deny Deny given hosts from connecting to Ncat
--denyfile A file of hosts denied from connecting to Ncat
--broker Enable Ncat's connection brokering mode
--chat Start a simple Ncat chat server
--proxy <addr[:port]> Specify address of host to proxy through
--proxy-type <type> Specify proxy type ("http" or "socks4")
--proxy-auth <auth> Authenticate with HTTP or SOCKS proxy server
--ssl Connect or listen with SSL
--ssl-cert Specify SSL certificate file (PEM) for listening
--ssl-key Specify SSL private key (PEM) for listening
--ssl-verify Verify trust and domain name of certificates
--ssl-trustfile PEM file containing trusted SSL certificates
--version Display Ncat's version information and exit
See the ncat(1) manpage for full options, descriptions and usage examples
1-4)、启动nc
[root@hadoop1 ~]# nc -lk 8888
dfhf
fbfr
ere
Gfr
1-5)、代码实现
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Administrator on 2016/11/6.
*/
object SparkstringTest {
def main(args: Array[String]) {
// 加载win下的配置文件
System.setProperty("hadoop.home.dir",
"E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
// 岁数据今次那个初始化
val conf = new SparkConf().setAppName("SparkstringTest").setMaster("local[2]")
// 设置上下文
val ssc = new StreamingContext(conf, Seconds(5))
// 建立连接
val textStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop2", 8888)
// 分割数据
val map: DStream[String] = textStream.flatMap(_.split(" "))
// 对数据进行累加
val map1: DStream[(String, Int)] = map.map((_, 1))
// 对数据进行累加
val key: DStream[(String, Int)] = map1.reduceByKey(_ + _)
// 显示处理的数据
key.print()
// 开启进程
ssc.start()
// 等待停止
ssc.awaitTermination()
}
}
key.print() 默认显示前10行的数据
1-6)、查看结果
*************
-------------------------------------------
Time: 1478415525000 ms
-------------------------------------------
(dff,1)
(a,2)
(dfed,1)
************
reduceByKey 支队当前的额数据进行累加,不会对全局的累加。
从TCP端口中读取数据,并对数据进行累加
准备JAR
需要的JAR
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.1</version>
</dependency>
图解
UpdateStateByKey 实现方式
package streams
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Adminon 2016/8/26.
*/
object StateFulStreamingWordCount {
val updateFunc = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
//it.map(t => (t._1, t._2.sum + t._3.getOrElse(0)))
it.map { case (x, y, z) => (x, y.sum + z.getOrElse(0)) }
}
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir",
"E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4")
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf().setAppName("StateFulStreamingWordCount").setMaster("local[2]")
//创建StreamingContext并设置产生批次的间隔时间
val ssc = new StreamingContext(conf, Seconds(5))
//设置ck目录
ssc.checkpoint("E://ck0826")
//从Socket端口中创建RDD
val lines: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop2", 8888)
val words: DStream[String] = lines.flatMap(_.split(" "))
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
val result: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
//打印,默认的显示前10行的数据
result.print()
//开启程序
ssc.start()
//等待结束
ssc.awaitTermination()
}
}
[root@hadoop2 sbin]# nc -lk 8888
a b c d e f
a b c d e f
djf ffgrg rghr rigrg righrg
a b c d e f
d d d d d d d
*********************
-------------------------------------------
Time: 1478421805000 ms
-------------------------------------------
(d,10)
(ffgrg,1)
(b,3)
(,1)
(f,3)
(djf,1)
(e,3)
(rghr,1)
(rigrg,1)
(a,3)
package streams
import org.apache.log4j.{Logger, Level}
import org.apache.spark.Logging
object LoggerLevels extends Logging {
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
logInfo("Setting log level to [WARN] for streaming example." +
" To override add a custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
ReduceByKeyAndWindow 实现方式
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object WindowReduce {
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4")
val conf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[4]")
val sc = new SparkContext(conf)
// 创建StreamingContext,batch interval为5秒
val ssc = new StreamingContext(sc, Seconds(2))
//Socket为数据源
val lines = ssc.socketTextStream("skycloud1", 8888, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
// windows操作,对窗口中的单词进行计数
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(6), Seconds(10))
//显示数据
wordCounts.print()
// 开启计算
ssc.start()
//等待程序执行结束
ssc.awaitTermination()
}
}
Seconds(2) :batch interval为5秒
Seconds(6):window的长度是30秒,最近30秒的数据
Seconds(10):计算的时间间隔
-------------------------------------------
Time: 1487556534000 ms
-------------------------------------------
-------------------------------------------
Time: 1487556554000 ms
-------------------------------------------
(edef,2)
(de,1)
(wedfe,1)
(wefef,1)
(ewfef,1)
**********************************
详情请看:http://blog.youkuaiyun.com/xfg0218/article/details/56008383
Spark 结合Flume
1-1)、上传JAR包到FLume的lib下
JAR下载地址,如果无法下载请联系作者:
链接:http://pan.baidu.com/s/1kVz3bvT 密码:btnf
commons-lang3-3.3.2.jar scala-library-2.10.5.jar spark-streaming-flume-sink_2.10-1.6.1.jar
1-2)、修改Flume配置文件
[root@hadoop1 configurationFile]# vi flume-poll.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/flume/testDate
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = hadoop1
a1.sinks.k1.port = 8888
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
1-3)、启动Flume
[root@hadoop1 configurationFile]# flume-ng agent -n a1 -c conf -f flume-poll.conf -Dflume.root.logger=WARN,console
***************
16/11/06 01:48:46 INFO sink.SparkSink: Starting Avro server for sink: k1
16/11/06 01:48:46 INFO sink.SparkSink: Blocking Sink Runner, sink will continue to run..
1-4)、准备Flume JAR
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
1-5)、代码实现
package streams
import java.net.InetSocketAddress
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FlumeStreamingWordCount {
def main(args: Array[String]) {
// 加载win下的配置文件
System.setProperty("hadoop.home.dir",
"E:\\winutils-hadoop-2.6.4\\hadoop-2.6.4");
val conf = new SparkConf().setAppName("FlumeStreamingWordCount").setMaster("local[2]")
LoggerLevels.setStreamingLogLevels()
//创建StreamingContext并设置产生批次的间隔时间
val ssc = new StreamingContext(conf, Seconds(5))
//从Socket端口中创建RDD
val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc, Array(new InetSocketAddress("hadoop1", 8888)), StorageLevel.MEMORY_AND_DISK)
//去取Flume中的数据
val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" "))
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
//打印
result.print()
//开启程序
ssc.start()
//等待结束
ssc.awaitTermination()
}
}
1-6)、测试数据
[root@hadoop1 conf]# cp flume-conf.properties.template /usr/local/flume/testDate/
1-7)、查看结果
**********************
-------------------------------------------
Time: 1478427325000 ms
-------------------------------------------
(Unless,1)
(config,1)
(this,5)
(KIND,,1)
(case,,1)
(is,3)
(under,4)
(follows.,1)
(memoryChannel,3)
(sinks,1)
Spark 结合Kafka
1-1)、启动Kafka
[root@hadoop1 start_sh]# cat kafka_start.sh
cat /usr/local/start_sh/slave |while read line
do
{
echo $line
ssh $line "source /etc/profile;nohup kafka-server-start.sh /usr/local/kafka/config/server.properties > /dev/null 2>&1&"
}&
wait
done
1-2)、创建topic
[root@hadoop1 start_sh]# kafka-topics.sh --create --zookeeper hadoop1:2181 --replication-factor 2 --partitions 3 --topic lines
Created topic "lines".
1-3)、查看所有的topic
[root@hadoop1 start_sh]# kafka-topics.sh --list --zookeeper hadoop1:2181
lines
1-4)、查看topic的详情
[root@hadoop1 start_sh]# kafka-topics.sh --describe --zookeeper hadoop1:2181 --topic lines
Topic:lines PartitionCount:3 ReplicationFactor:2 Configs:
Topic: lines Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: lines Partition: 1 Leader: 2 Replicas: 2,1 Isr: 2,1
Topic: lines Partition: 2 Leader: 0 Replicas: 0,2 Isr: 0,2
1-5)、启动一个生产者发送消息
[root@hadoop1 start_sh]# kafka-console-producer.sh --broker-list hadoop1:9092 --topic lines
aaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccccccccccccccc
1-6)、启动一个消费者消费数据
[root@hadoop1 start_sh]# kafka-console-consumer.sh --zookeeper hadoop1:2181 --from-beginning --topic lines
aaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccccccccccccccc
1-7)、代码
package streams
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by root on 2016/5/21.
*/
object KafkaWordCount {
val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
//iter.flatMap(it=>Some(it._2.sum + it._3.getOrElse(0)).map(x=>(it._1,x)))
iter.flatMap { case (x, y, z) => Some(y.sum + z.getOrElse(0)).map(i => (x, i)) }
}
def main(args: Array[String]) {
LoggerLevels.setStreamingLogLevels()
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
// 设置checkpoint
ssc.checkpoint("c://ck200")
//"alog-2016-04-16,alog-2016-04-17,alog-2016-04-18"
//"Array((alog-2016-04-16, 2), (alog-2016-04-17, 2), (alog-2016-04-18, 2))"
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_AND_DISK_SER)
val words = data.map(_._2).flatMap(_.split(" "))
val wordCounts = words.map((_, 1)).updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
ssc.setCheckPointDir() 也是设置checkpoint的方法,主要负责长时间的程序运算产生的垃圾数据。
1-8)、配置参数
hadoop1:2181,hadoop2:2181,hadoop3:2181 g1 lines 1
1-9)、测试数据
[root@hadoop1 start_sh]# kafka-console-producer.sh --broker-list hadoop1:9092 --topic lines
aaaa
bbb
ccc
dddd
eeee
fff
gggg
1-10)、查看结果
**************************
-------------------------------------------
Time: 1478444825000 ms
-------------------------------------------
(aaaa,1)
(bbb,1)
(dddd,1)
(ddddddddddddd,1)
(eeee,1)
(fff,1)
(f,1)
(gggg,1)
(fffffffffffffff,1)
(ccc,1)
1-11)、提交集群运行查看结果
[root@hadoop1 sparkJar]# spark-submit --class streams.KafkaWordCount --master spark://hadoop1:7077,hadoop2:7077 --executor-memory 1g --total-executor-cores 2 /usr/local/spark/sparkJar/sparkKafka.jar hadoop1:2181,hadoop2:2181,hadoop3:2181 sparkKafka lines 2
Spark 结合Redis
1-1)、创建Kafka 中的数据
import java.util.Properties
import kafka.javaapi.producer.Producer
import kafka.producer.{KeyedMessage, ProducerConfig}
import org.codehaus.jettison.json.JSONObject
import scala.util.Random
object KafkaEventProducer {
private val users = Array("4A4D769EB9679C054DE81B973ED5D768",
"8dfeb5aaafc027d89349ac9a20b3930f",
"011BBF43B89BFBF266C865DF0397AA71",
"f2a8474bf7bd94f0aabbd4cdd2c06dcf",
"068b746ed4620d25e26055a9f804385f",
"97edfc08311c70143401745a03a50706",
"d7f141563005d1b5d0d3dd30138f3f62",
"c8ee90aade1671a21336c721512b817a",
"6b67c8c700427dee7552f81f3228c927", "a95f22eabc4fd4b580c011a3161a9d9d")
private val random = new Random()
private var pointer = -1
def getUserID: String = {
pointer = pointer + 1
if (pointer >= users.length) {
pointer = 0
users(pointer)
} else {
users(pointer)
}
}
def click(): Double = {
random.nextInt(10)
}
def main(args: Array[String]) {
val topic = "user_event"
// 可以添加多个地址
val brokers = "hadoop1:9092"
val props = new Properties()
props.put("metadata.broker.list", brokers)
props.put("serializer.class", "kafka.serializer.StringEncoder")
val kafka = new ProducerConfig(props)
val producer = new Producer[String, String](kafka)
while (true) {
val event = new JSONObject()
event.put("uid", getUserID)
event.put("event_time", System.currentTimeMillis.toString)
event.put("os_type", "Android")
event.put("click_count", click)
producer.send(new KeyedMessage[String, String](topic, event.toString))
println("Message sent:" + event)
Thread.sleep(2000)
}
}
}
1-2)、链接Redis
package test
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object UserClickCountAnalytics {
def main(args: Array[String]) {
var masterUrl = "local[1]"
if (args.length > 0) {
masterUrl = args(0)
}
val conf = new SparkConf().setMaster(masterUrl).setAppName(this.getClass.getName)
val ssc = new StreamingContext(conf, Seconds(5))
val topic = Set("user_events")
val brokers = "hadoop1:9092"
val kafkaParames = Map[String, String]("metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder")
val dbindex = 1
val clickHashKey = "app:users::click"
val kafkaStream = KafkaUtils.createDirectStream(ssc, kafkaParames, topic)
val events = kafkaStream.flatMap(line => {
Some(line)
})
val userClicks = events.map(x => (x._1, x._2))
userClicks.foreachRDD(rdd => {
rdd.foreachPartition(partititonOfRecords => {
partititonOfRecords.foreach(pair => {
val uid = pair._1
val clickCount = pair._2
val jedis = RedisClient.pool.getResource
jedis.select(dbindex)
jedis.hincrBy(clickHashKey, uid, clickCount)
RedisClient.pool.returnResource(jedis)
})
})
})
ssc.start()
ssc.awaitTermination()
}
}
1-3)、Redis 连接池
package test
import org.apache.commons.pool2.impl.GenericObjectPoolConfig
import redis.clients.jedis.JedisPool
object RedisClient extends Serializable {
val redisHost = "hadoop1"
val redisPort = "6379".toInt
val redisTimeout = 300000
lazy val pool = new JedisPool(new GenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
lazy val hook = new Thread {
override def run = {
println("hook thread:" + this)
pool.destroy()
}
}
sys.addShutdownHook(hook.run)
}
Some 表示有值,在后面便于操作,否则回报nothing错误。
Spark-Streaming 几种获取数据源的方式
1-1)、通过直连的方式查询数据
val ssc= new StreamingContext(‘spark://hadoop1:7077’,’WordCount’,Seconde(1),[Homes],[Jars])
第一个三叔是指定集群的运行地址,第三个参数是指定Spark运行时的窗口的大小,现在表示每一秒对数据进行一次Spark Job的处理。
1-2)、通过端口的形式处理数据
Val lins = ssc.scoketTextStream(‘localhost’,9999)
通过网络就的方式获取数据并对数据进行处理。
Spark 大数据处理技术总结
概述
以下的信息主要来自于《spark 大数据技术》一书的总结,内容简单易懂。如果遇到任何问题可以联系作者,或者访问:http://blog.youkuaiyun.com/xfg0218/article/details/55272083
资料在:链接:http://pan.baidu.com/s/1pKRAXjt 密码:ohpj 中
第一章
1-1)、RDD的表达能力
- 、迭代运算
B)、关系型查询
C)、MapReduce批处理
1-2)、Spark 子系统
1-3)、Spark 生态圈
1-4)、Spark 生态系统特征
第二章
1-1)、Spark RDD及编程接口
C)、Spark RDD
1-2)、RDD的优先位置
1-3)、RDD的依赖关系
1-4)、partitions
1-5)、preferredLocations
1-6)、dependencies
1-7)、compute
1-8)、partitioner
1-2)、RDD基本转换操作
A)、储存操作
B)、对RDD的分区重新划分
C)、集合造作
E)、键值RDD转换操作
F)、combineByKey
G)、行动操作
详细请看如下:
第三章
1-1)、Spark 运算模式及原理
详细如下
第四章
1-1)、Spark 调度管理原理
1-1)、SparkContext
A)、DAGScheduler
B)、逻辑是基于Akka Actor 的机制
详细如下
第五章
1-1)、Spark 的储存管理
1-2)、储存层
1-4)、shuffer储存块的方式
1-5)、shuffer数据的读取与传递的两种方式
1-1)、StorageLevel
详细如下
第六章
1-1)、Stage界面
1-2)、Storage界面
详细如下
第七章
1-1)、Spark架构与安装部署
.
详细如下
第八章
1-1)、用户自定义函数
1-2)、CLI中的用户自定义函数扩展相关的命令
1-3)、UDF关键点说明
详细如下
第九章
1-1)、Spark SQL
详细如下
第十章
1-1)、Spark Streaming
1-2)、性能优化
详细如下