Spark Streaming 第二部分

本文介绍SparkStreaming的基础概念,包括StreamingContext的创建与配置,DStream的定义与操作,以及如何通过实例处理socket数据流。文章深入探讨了输入源、接收器、基本与高级流源的区别,展示了如何在实际应用中配置和运行SparkStreaming程序。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Streaming Context

To initialize a Spark Streaming program, 
a StreamingContext object has to be created which is the main entry point of all Spark Streaming functionality.

构造方法

  /**
   * Create a StreamingContext using an existing SparkContext.
   * @param sparkContext existing SparkContext
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }
  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }
The appName parameter is a name for your application to show on the cluster UI.
 master is a Spark, Mesos or YARN cluster URL,
 or a special “local[*]” string to run in local mode.
 In practice, when running on a cluster, you will not want to hardcode master in the program, 
but rather launch the application with spark-submit and receive it there.
 However, for local testing and unit tests, you can pass “local[*]” to run Spark Streaming in-process (detects the number of cores in the local system). 
Note that this internally creates a SparkContext (starting point of all Spark functionality) which can be accessed as ssc.sparkContext.

The batch interval must be set based on the latency requirements of your application and available cluster resources. See the Performance Tuning section for more details.

batch interval 可以根据应用需求的延时要求,和集群可用的资源情况来设置。

After a context is defined, you have to do the following.

Define the input sources by creating input DStreams.
Define the streaming computations by applying transformation and output operations to DStreams.
Start receiving data and processing it using streamingContext.start().
Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
The processing can be manually stopped using streamingContext.stop().

DStream----Discretized Streams 

a DStream is represented by a continuous series of RDDs

Each RDD in a DStream contains data from a certain interval

对DStream操作算计,比如map/flatmap,其实底层会被翻译为对DStream中每个RDD做相同的操作,因为一个Dstream是由

不同批次的RDD来构成的

 

Input Dstream 和Receivers

Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala docJava doc) object which receives the data from a source and stores it in Spark’s memory for processing.

except file stream, discussed later in this section

  /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }
abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)
  extends InputDStream[T](_ssc) {
 /**
   * Create an input stream that monitors a Hadoop-compatible filesystem
   * for new files and reads them as text files (using key as LongWritable, value
   * as Text and input format as TextInputFormat). Files must be written to the
   * monitored directory by "moving" them from another location within the same
   * file system. File names starting with . are ignored.
   * @param directory HDFS directory to monitor for new file
   */
  def textFileStream(directory: String): DStream[String] = withNamedScope("text file stream") {
    fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
  }
Spark Streaming provides two categories of built-in streaming sources.

Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.
When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. 
Either of these means that only one thread will be used for running tasks locally. 
If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, 
leaving no thread for processing the received data. 
Hence, when running locally, always use “local[n]” as the master URL,
 where n > number of receivers to run (see Spark Properties for information on how to set the master).

Extending the logic to running on a cluster, 
the number of cores allocated to the Spark Streaming application must be more than the number of receivers.
 Otherwise the system will receive data, but not be able to process it.

Transformations on DStreams

Similar to that of RDDs, transformations allow the data from the input DStream to be modified. 
DStreams support many of the transformations available on normal Spark RDD’s. Some of the common ones are as follows.

Output Operations on DStreams

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. 

 

Spark Streaming 处理socket数据

package com.rachel

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StatefulWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("StatefulWordCount")
      .setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(5))
    val lines = ssc.socketTextStream("192.168.1.6",1111)
    val result = lines.flatMap(_.split(" ")).map((_,1))
    result.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值