【Spark九十九】Spark Streaming的batch interval时间内的数据流转源码分析

最新推荐文章于 2021-06-22 21:01:32 发布

原创

最新推荐文章于 2021-06-22 21:01:32 发布 · 1.5k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #java #数据结构与算法

本文详细分析了Spark Streaming通过SocketReceiver接收数据的过程，从数据存储到BlockGenerator，再到BlockManager的流转。BlockGenerator按固定时间间隔生成Block并使用BlockPushingThread写入BlockManager。ReceiverTracker负责跟踪Block信息，确保数据处理的完整性和一致性。

以如下代码为例（SocketInputDStream）：

Spark Streaming从Socket读取数据的代码是在SocketReceiver的receive方法中，撇开异常情况不谈(Receiver有重连机制，restart方法，默认情况下在Receiver挂了之后，间隔两秒钟重新建立Socket连接)，读取到的数据通过调用store(textRead)方法进行存储。数据的流转需要关注如下几个问题：

1. 数据存储到什么位置了

2. 数据存储的结构如何？

3. 数据什么时候被读取

4. 读取到的数据(batch interval)如何转换为RDD

1. SocketReceiver#receive

  /** Create a socket connection and receive data until receiver is stopped */
  def receive() {
    var socket: Socket = null
    try {
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      logInfo("Stopped receiving")
      restart("Retrying connecting to " + host + ":" + port)
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case t: Throwable =>
        restart("Error receiving data", t)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
  }

2. SocketReceiver#receive=>SocketReceiver#store

  /**
   * Store a single item of received data to Spark's memory.
   * These single items will be aggregated together into data blocks before
   * being pushed into Spark's memory.
   */
  def store(dataItem: T) {
    executor.pushSingle(dataItem)
  }

数据存储作为Executor功能之一，store方法调用了executor中的pushSingle操作，此时的Single可以理解为一次数据读取，而dataItem就是一次读取的数据对象

3. SocketReceiver#store=>executor.pushSingle（ReceiverSupervisorImpl.pushSingle）

  /** Push a single record of received data into block generator. */
  def pushSingle(data: Any) {
    blockGenerator.addData(data)
  }

数据放入到了blockGenerator数据结构中了，blockGenerator，类型为BlockGenerator，顾名思义是一个block生成器，所谓的block生成器，是指Spark Streaming每隔一段时间(默认200毫秒, private val blockInterval

最低0.47元/天解锁文章