Spark Streaming之Executor容错安全性

本文深入探讨了Spark中Write-Ahead Log (WAL)机制的工作原理,包括如何在操作数据前通过日志记录数据,以及BlockManager和WriteAheadLogBasedBlockHandler在数据存储中的角色。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

内容:

1 Executor WAL

消息重放

其他



正文:

上一节提到的数据保存有两种方式,这两种方式都能保证多副本的存在:

a) WriteAheadLogBasedBlockHandler

b)  BlockManagerBasedBlockHandler

本章我们详细研究WriteAheadLogBasedBlockHandler

private val receivedBlockHandler: ReceivedBlockHandler = {
  if (WriteAheadLogUtils.enableReceiverLog(env.conf)){
    if (checkpointDirOption.isEmpty){
      throw new SparkException(
        "Cannot enable receiver write-ahead log withoutcheckpoint directory set. " +
          "Please use streamingContext.checkpoint() to set thecheckpoint directory. " +
          "See documentation for more details.")
    }
    new WriteAheadLogBasedBlockHandler(env.blockManager,receiver.streamId,
      receiver.storageLevel, env.conf,hadoopConf, checkpointDirOption.get)
  } else {
    new BlockManagerBasedBlockHandler(env.blockManager,receiver.storageLevel)
  }
}
如果我们开启了wal的方式,但是却没有制定checkpointDir,上述代码会仍出一个异常。

//BlockManagerBlockHandler
private[streaming] class BlockManagerBasedBlockHandler(
    blockManager: BlockManager,storageLevel: StorageLevel)
  extends ReceivedBlockHandler with Logging {

  def storeBlock(blockId:StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {

    var numRecords= None: Option[Long]

    val putResult:Seq[(BlockId, BlockStatus)] = block match {
      case ArrayBufferBlock(arrayBuffer)=>
        numRecords = Some(arrayBuffer.size.toLong)
        blockManager.putIterator(blockId,arrayBuffer.iterator, storageLevel,
          tellMaster = true)
      case IteratorBlock(iterator)=>
        val countIterator= new CountingIterator(iterator)
        val putResult= blockManager.putIterator(blockId, countIterator, storageLevel,
          tellMaster = true)
        numRecords = countIterator.count
        putResult
      case ByteBufferBlock(byteBuffer)=>
        blockManager.putBytes(blockId,byteBuffer, storageLevel, tellMaster = true)
      case o=>
        throw new SparkException(
          s"Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}")
    }
    if (!putResult.map{ _._1 }.contains(blockId)) {
      throw new SparkException(
        s"Could not store $blockId to block manager with storage level $storageLevel")
    }
    BlockManagerBasedStoreResult(blockId,numRecords)
  }

其实这个方法中关键的部分调用的都是blockmanager中的方法,不管是blockManager.putIterator,还是blockManager.putBytes,都是调用blockmanager中的doput的方法。参见DT 大数据IMF课程Spark Core解析部分。

接下来我们研究WAL的方式,也就是在操作数据之前先通过日志记录数据。

我们先来分析下面这段代码,因为在storeBlock的时候会用到writeAheadLog.

// Write ahead logmanages
private val writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
  conf, checkpointDirToLogDir(checkpointDir,streamId), hadoopConf)
 
private[streaming] object WriteAheadLogBasedBlockHandler{
  def checkpointDirToLogDir(checkpointDir:String, streamId: Int): String = {
    new Path(checkpointDir,new Path("receivedData", streamId.toString)).toString
  }
}

WriteAheadLogBasedBlockHandler类中的checkpointDirToLogDir创建了一个保存数据的路径。

我们来继续看createLogForReceiver方法,其实是调用了createLog方法:

def createLogForReceiver(
    sparkConf: SparkConf,
    fileWalLogDirectory: String,
    fileWalHadoopConf: Configuration
  ): WriteAheadLog = {
  createLog(false, sparkConf, fileWalLogDirectory, fileWalHadoopConf)
}

private def createLog(
    isDriver: Boolean,
    sparkConf: SparkConf,
    fileWalLogDirectory: String,
    fileWalHadoopConf: Configuration
  ): WriteAheadLog = {

  val classNameOption= if (isDriver) {
    sparkConf.getOption(DRIVER_WAL_CLASS_CONF_KEY)
  } else {
    sparkConf.getOption(RECEIVER_WAL_CLASS_CONF_KEY)
  }
  val wal =classNameOption.map { className =>
    try {
      instantiateClass(
        Utils.classForName(className).asInstanceOf[Class[_ <: WriteAheadLog]], sparkConf)
    } catch {
      case NonFatal(e) =>
        throw new SparkException(s"Couldnot create a write ahead log of class $className", e)
    }
  }.getOrElse {
   new FileBasedWriteAheadLog(sparkConf,fileWalLogDirectory, fileWalHadoopConf,
      getRollingIntervalSecs(sparkConf,isDriver), getMaxFailures(sparkConf, isDriver),
      shouldCloseFileAfterWrite(sparkConf,isDriver))
  }
  if (isBatchingEnabled(sparkConf,isDriver)) {
    new BatchedWriteAheadLog(wal,sparkConf)
  } else {
    wal
  }
}

默认创建的是FileBasedWriteAheadLog,为什么不是BatchedWriteAheadLog,因为上层方法传入的IsDriverfalse

接下来我们回到storeBlock方法中

def storeBlock(blockId: StreamBlockId, block:ReceivedBlock): ReceivedBlockStoreResult = {

  var numRecords= None: Option[Long]
  // Serialize the block so that it can be inserted intoboth
  val serializedBlock= block match {
    case ArrayBufferBlock(arrayBuffer)=>
      numRecords = Some(arrayBuffer.size.toLong)
      blockManager.dataSerialize(blockId,arrayBuffer.iterator)
    case IteratorBlock(iterator)=>
      val countIterator= new CountingIterator(iterator)
      val serializedBlock= blockManager.dataSerialize(blockId, countIterator)
      numRecords = countIterator.count
      serializedBlock
    case ByteBufferBlock(byteBuffer)=>
      byteBuffer
    case _=>
      throw new Exception(s"Could notpush $blockId to block manager, unexpected block type")
  }

  // Store the block in block manager
  val storeInBlockManagerFuture= Future {
    val putResult=
      blockManager.putBytes(blockId,serializedBlock, effectiveStorageLevel, tellMaster = true)
    if (!putResult.map{ _._1 }.contains(blockId)) {
      throw new SparkException(
        s"Could not store $blockId to block manager with storage level $storageLevel")
    }
  }

  // Store the block in write ahead log
  val storeInWriteAheadLogFuture= Future {
    writeAheadLog.write(serializedBlock, clock.getTimeMillis())
  }

  // Combine the futures, wait for both to complete, andreturn the write ahead log record handle
  val combinedFuture= storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
  val walRecordHandle= Await.result(combinedFuture, blockStoreTimeout)
  WriteAheadLogBasedStoreResult(blockId,numRecords, walRecordHandle)
}

通过源码注释和分析这段代码的处理逻辑,发现存入block manager和write ahead log同时执行。storeInBlockManagerFuture存储数据到Block manager中,storeInWriteAheadLogFuture存储数据用wal的方式。最重要的是合并了这两个futures,等待他们都完成,并返回了walRecordHandle。后续的方法就是传消息给ReceiverTrackerEndpoint了。


总结:

  • 如果开启WAL的方式,会将数据保存到checkpoint目录,如果checkpoint目录没有配置,就抛出异常。
    先看WriteAheadLogBasedBlockHandler,开启WAL后,采用BlockManager存储数据时就不需要复本了,否则和WAL同时做容错就是重复性工作,降低了系统的性能。
    再看BlockManagerBasedBlockHandler,就是将数据交给BlockManager存储,根据用户定义的存储级别来存储,系统一般默认存储级别为MEMORY_AND_DISK_SER_2,如果对数据安全性要求不高也可以不要复本。
  • 消息重放就是一种非常高效的方式,采用kafka的Direct API接口读取数据时首先计算offset的位置,如果job异常,根据消费的offset位置重新指定kafka的offset,从失败的位置读取。kafka直接做为文件存储系统,就像hdfs一样。



    ---------------------------------EOF------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    说明:文章以DT大数据定制班为基础,并结合其他同学的优秀博文总结而成。





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值