使用Kafka
1. Start ZooKeeper:
hadoop@moon:/usr/local/cloud/zookeeper-3.4.6$ ./bin/zkServer.sh start &
2. Start the Kafka server:
./bin/kafka-server-start.sh config/server.properties &
3. Create a test topic:
$ ./bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions 1 --topic test
实战
1. Download the spark-streaming-kafka library and its dependencies:
$cd /usr/local/spark/spark-data
$ wget http://central.maven.org/maven2/org/apache/spark/spark-
streaming-kafka_2.10/1.2.0/spark-streaming-kafka_2.10-1.2.0.jar
$ wget http://central.maven.org/maven2/org/apache/kafka/
kafka_2.10/0.8.1/kafka_2.10-0.8.1.jar
$ wget http://central.maven.org/maven2/com/yammer/metrics/metrics-
core/2.2.0/metrics-core-2.2.0.jar
$ wget http://central.maven.org/maven2/com/101tec/zkclient/0.4/
zkclient-0.4.jar
2. Start the Spark shell and provide the spark-streaming-kafka library:
$ spark-shell --jars /usr/local/spark/spark-data/spark-streaming-kafka_2.10-1.2.0.jar,
kafka_2.10-0.8.1.jar,/usr/local/spark/spark-data/metrics-core-2.2.0.jar,/usr/local/spark/spark-data/zkclient-0.4.jar
3. Stream specific imports:
scala> import org.apache.spark.streaming.{Seconds,
StreamingContext}
4. 引入隐事转换:
scala> import org.apache.spark._
scala> import org.apache.spark.streaming._
scala> import org.apache.spark.streaming.StreamingContext._
scala> import org.apache.spark.streaming.kafka.KafkaUtils
5. 创建2s区间的StreamingContext
scala> val ssc = new StreamingContext(sc, Seconds(2))
6. 设置kafka变量
scala> val zkQuorum = "localhost:2181"
scala> val group = "test-group"
scala> val topics = "test"
scala> val numThreads = 1
7. 创建 topicMap :
scala> val topicMap = topics.split(",").map((_,numThreads.toInt)).
toMap
8. 创建 Kafka DStream:
scala> val lineMap = KafkaUtils.createStream(ssc, zkQuorum, group,
topicMap)
9. 从 lineMap中拉取数据
scala> val lines = lineMap.map(_._2)
10. Create flatMap of values:
scala> val words = lines.flatMap(_.split(" "))
11. 创建键值对 (word,occurrence):
scala> val pair = words.map( x => (x,1))
12. 滑动窗口的wordcount:
scala> val wordCounts = pair.reduceByKeyAndWindow(_ + _, _ - _,
Minutes(10), Seconds(2), 2)
scala> wordCounts.print
13. 设置 checkpoint的目录:
scala> ssc.checkpoint("hdfs://localhost:9000/user/hduser/
checkpoint")
14. 开启 StreamingContext :
scala> ssc.start
scala> ssc.awaitTermination
15. 新客户端发送数据到 Kafka 的 test topic
$ /opt/infoobjects/kafka/bin/kafka-console-producer.sh --broker-
list localhost:9092 --topic test
高级
如果你想维持动态计算每个词出现的次数。spark流有一个updateByKey操作。这个操作允许你维持任意提供新信息更新的状态
这个任意状态可以是聚合值,或者仅仅是一个状态的改变。
1. pairs RDD调用updateStateByKey:
scala> val runningCounts = wordCounts.updateStateByKey( (values:
Seq[Int], state: Option[Int]) => Some(state.sum + values.sum))
The updateStateByKey operation returns a new “state” DStream where
the state for each key is updated by applying the given function on the
previous state of the key and the new values for the key. This can be used to
maintain arbitrary state data for each key.
There are two steps involved in making this operation work:
f f Define the state
f f Define the state update function
The updateStateByKey operation is called once for each key, values
represent the sequence of values associated with that key, very much like
MapReduce and the state can be any arbitrary state, which we chose to make
Option[Int]. With every call in step 18, the previous state gets updated by
adding the sum of current values to it.
2. 打印结果:
scala> runningCounts.print
3. The following are all the steps combined to maintain the arbitrary state using the
updateStateByKey operation:
Scala> :paste
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(2))
val zkQuorum = "localhost:2181"
val group = "test-group"
val topics = "test"
val numThreads = 1
val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lineMap = KafkaUtils.createStream(ssc, zkQuorum, group,
topicMap)
val lines = lineMap.map(_._2)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(x => (x,1))
val runningCounts = pairs.updateStateByKey( (values: Seq[Int],
state: Option[Int]) => Some(state.sum + values.sum))
runningCounts.print
ssc.checkpoint("hdfs://localhost:9000/user/hduser/checkpoint")
ssc.start
ssc.awaitTermination
Ctrl+D执行上面的命令!