Spark-streaming-杂

本文深入探讨了Spark Streaming中的关键组件与时间概念封装,包括ContextWaiter、RateLimitedOutputStream、RawTextSender等实用工具及特性。同时,介绍了如何通过WriteAheadLogManager管理日志文件,以及其内部类如WriteAheadLogWriter、WriteAheadLogReader的功能实现。此外,文章还详细阐述了Receiver、RateLimiter等组件的工作原理和用途,为开发者提供了全面的Spark Streaming知识框架。

Spark-streaming-杂

@(spark)

ContextWaiter

基本就是锁

Time & Duration & Interval

对于时间概念的简单封装

Utils

HdfsUtils

封装了对hdfs文件的读取

RawTextHelper

基于文本的简单操作,比如split,topk等等。不要想象的高大上,简单的utils functions而已。

RateLimitedOutputStream

继承了java.io.OutputStream
基本逻辑就是,如果当前的发送速度比要求的rate快那么就持续不断的发,否则就sleep若干时间已达到rate要求的速度。

RawTextSender

A helper program that sends blocks of Kryo-serialized text strings out on a socket at a specified rate. Used to feed data into RawInputDStream.
注意这是一个单独的程序

WriteAheadLogManager

/**                                                                                                                                                                     
 * This class manages write ahead log files.                                                                                                                            
 * - Writes records (bytebuffers) to periodically rotating log files.                                                                                                   
 * - Recovers the log files and the reads the recovered records upon failures.                                                                                          
 * - Cleans up old log files.                                                                                                                                           
 *                                                                                                                                                                      
 * Uses [[org.apache.spark.streaming.util.WriteAheadLogWriter]] to write                                                                                                
 * and [[org.apache.spark.streaming.util.WriteAheadLogReader]] to read.                                                                                                 
 *                                                                                                                                                                      
 * @param logDirectory Directory when rotating log files will be created.                                                                                               
 * @param hadoopConf Hadoop configuration for reading/writing log files.                                                                                                
 * @param rollingIntervalSecs The interval in seconds with which logs will be rolled over.                                                                              
 *                            Default is one minute.                                                                                                                    
 * @param maxFailures Max number of failures that is tolerated for every attempt to write to log.                                                                       
 *                    Default is three.                                                                                                                                 
 * @param callerName Optional name of the class who is using this manager.                                                                                              
 * @param clock Optional clock that is used to check for rotation interval.                                                                                             
 */                                                                                                                                                                     
private[streaming] class WriteAheadLogManager(  

生成logDir的逻辑如下:

  def checkpointDirToLogDir(checkpointDir: String, streamId: Int): String = {                                                                                           
    new Path(checkpointDir, new Path("receivedData", streamId.toString)).toString                                                                                       
  }   
WriteAheadLogWriter
/**                                                                                                                                                                     
 * A writer for writing byte-buffers to a write ahead log file.                                                                                                         
 */                                                                                                                                                                     
private[streaming] class WriteAheadLogWriter(path: String, hadoopConf: Configuration)                                                                                   
  extends Closeable {     
WriteAheadLogReader
/**                                                                                                                                                                     
 * A reader for reading write ahead log files written using                                                                                                             
 * [[org.apache.spark.streaming.util.WriteAheadLogWriter]]. This reads                                                                                                  
 * the records (bytebuffers) in the log file sequentially and return them as an                                                                                         
 * iterator of bytebuffers.                                                                                                                                             
 */                                                                                                                                                                     
private[streaming] class WriteAheadLogReader(path: String, conf: Configuration)                                                                                         
  extends Iterator[ByteBuffer] with Closeable with Logging {  

WriteAheadLogBackedBlockRDD

/**                                                                                                                                                                     
 * This class represents a special case of the BlockRDD where the data blocks in                                                                                        
 * the block manager are also backed by segments in write ahead logs. For reading                                                                                       
 * the data, this RDD first looks up the blocks by their ids in the block manager.                                                                                      
 * If it does not find them, it looks up the corresponding file segment.                                                                                                
 *                                                                                                                                                                      
 * @param sc SparkContext                                                                                                                                               
 * @param blockIds Ids of the blocks that contains this RDD's data                                                                                                      
 * @param segments Segments in write ahead logs that contain this RDD's data                                                                                            
 * @param storeInBlockManager Whether to store in the block manager after reading from the segment                                                                      
 * @param storageLevel storage level to store when storing in block manager                                                                                             
 *                     (applicable when storeInBlockManager = true)                                                                                                     
 */                                                                                                                                                                     
private[streaming]                                                                                                                                                      
class WriteAheadLogBackedBlockRDD[T: ClassTag](                                                                                                                         
    @transient sc: SparkContext,                                                                                                                                        
    @transient blockIds: Array[BlockId],                                                                                                                                
    @transient segments: Array[WriteAheadLogFileSegment],                                                                                                               
    storeInBlockManager: Boolean,                                                                                                                                       
    storageLevel: StorageLevel)                                                                                                                                         
  extends BlockRDD[T](sc, blockIds) {   

Receiver

Abstract class of a receiver that can be run on worker nodes to receive external data. A custom receiver can be defined by defining the functions onStart() and onStop(). onStart() should define the setup steps necessary to start receiving data, and onStop() should define the cleanup steps necessary to stop receiving data. Exceptions while receiving can be handled either by restarting the receiver with restart(…) or stopped completely by stop(…) or

@DeveloperApi                                                                                                                                                           
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {   

RateLimiter

/** Provides waitToPush() method to limit the rate at which receivers consume data.                                                                                     
  *                                                                                                                                                                     
  * waitToPush method will block the thread if too many messages have been pushed too quickly,                                                                          
  * and only return when a new message has been pushed. It assumes that only one message is                                                                             
  * pushed at a time.                                                                                                                                                   
  *                                                                                                                                                                     
  * The spark configuration spark.streaming.receiver.maxRate gives the maximum number of messages                                                                       
  * per second that each receiver will accept.                                                                                                                          
  *                                                                                                                                                                     
  * @param conf spark configuration                                                                                                                                     
  */                                                                                                                                                                    
private[receiver] abstract class RateLimiter(conf: SparkConf) extends Logging {    

ReceiverSupervisor

/**                                                                                                                                                                     
 * Abstract class that is responsible for supervising a Receiver in the worker.                                                                                         
 * It provides all the necessary interfaces for handling the data received by the receiver.                                                                             
 */                                                                                                                                                                     
private[streaming] abstract class ReceiverSupervisor(                                                                                                                   
    receiver: Receiver[_],                                                                                                                                              
    conf: SparkConf                                                                                                                                                     
  ) extends Logging {      

ReceivedBlockHandler

/** Trait that represents a class that handles the storage of blocks received by receiver */                                                                            
private[streaming] trait ReceivedBlockHandler {                                                                                                                         

  /** Store a received block with the given block id and return related metadata */                                                                                     
  def storeBlock(blockId: StreamBlockId, receivedBlock: ReceivedBlock): ReceivedBlockStoreResult                                                                        

  /** Cleanup old blocks older than the given threshold time */                                                                                                         
  def cleanupOldBlocks(threshTime: Long)                                                                                                                                
}     

WriteAheadLogBasedBlockHandler

/**                                                                                                                                                                     
 * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which                                                                               
 * stores the received blocks in both, a write ahead log and a block manager.                                                                                           
 */                                                                                                                                                                     
private[streaming] class WriteAheadLogBasedBlockHandler(  

ReceiverSupervisorImpl

/**                                                                                                                                                                     
 * Concrete implementation of [[org.apache.spark.streaming.receiver.ReceiverSupervisor]]                                                                                
 * which provides all the necessary functionality for handling the data received by                                                                                     
 * the receiver. Specifically, it creates a [[org.apache.spark.streaming.receiver.BlockGenerator]]                                                                      
 * object that is used to divide the received data stream into blocks of data.                                                                                          
 */                                                                                                                                                                     
private[streaming] class ReceiverSupervisorImpl(   

其中一个核心函数就是:

  /** Store block and report it to driver */                                                                                                                            
  def pushAndReportBlock(  
BlockGenerator
/**                                                                                                                                                                     
 * Generates batches of objects received by a                                                                                                                           
 * [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately                                                                                    
 * named blocks at regular intervals. This class starts two threads,                                                                                                    
 * one to periodically start a new batch and prepare the previous batch of as a block,                                                                                  
 * the other to push the blocks into the block manager.                                                                                                                 
 */                                                                                                                                                                     
private[streaming] class BlockGenerator(                                                                                                                                
    listener: BlockGeneratorListener,                                                                                                                                   
    receiverId: Int,                                                                                                                                                    
    conf: SparkConf                                                                                                                                                     
  ) extends RateLimiter(conf) with Logging {  

ActorReceiver

基于Actor的receiver

<think>首先,用户的问题是:“spark-streaming”,这似乎是关于Spark Streaming的,但信息不完整。用户可能是在询问Spark Streaming的基本概念、用法或其他方面。由于用户说“请使用中文回答问题”,我需要用中文回复。 用户指令强调:在回答完问题后,必须根据用户的问题及我的回答提出3-5个相关问题。相关问题必须以“§§相关问题§§”开头,并输出在末尾。禁止换行分散,意思是整个输出要连贯。 示例格式显示:先回答问题,然后代码块(如果有),然后是“§§相关问题§§”后跟问题列表。 用户的问题是“spark-streaming”,这可能是一个拼写错误或简写,应该是“Spark Streaming”。Spark Streaming是Apache Spark的一个组件,用于实时流处理。 由于用户没有指定具体问题,我假设他们在问Spark Streaming的基本介绍。我应该提供一个简洁的概述。 我的回答结构: 1. 回答问题:用中文解释Spark Streaming是什么。 2. 如果需要,提供代码示例(如示例中所示)。 3. 最后,以“§§相关问题§§”开头,提出3-5个相关问题。 相关问题必须基于用户的问题和我的回答。用户的问题是“spark-streaming”,所以我需要提出相关问题。 可能的回答内容: - 定义Spark Streaming- 简要描述其工作原理。 - 可能提供一个简单的代码示例(如用Python或Scala),因为示例中有代码块。 示例格式中有一个Python代码块,所以我应该包括一个简单的Spark Streaming代码示例。 在Spark Streaming中,常见的代码是用Scala或Python写的。例如,一个简单的WordCount示例。 确保所有内容用中文。 步骤: 1. 回答:解释Spark Streaming。 2. 提供代码示例(可选,但示例中有)。 3. 结束于“§§相关问题§§”后跟问题列表。 相关问题:3-5个,基于主题。例如: - Spark Streaming与其他流处理框架的比较。 - 如何设置Spark Streaming- 常见用例。 - 优化技巧等。 最后,输出必须是连贯的,没有不必要的换行。 草拟回答: - 开头:Spark Streaming是Apache Spark的一个扩展,用于实时数据处理... - 代码:用Python写一个简单的Spark Streaming应用。 - 然后:§§相关问题§§ 1. 第一个问题 2. 第二个 3. 第三个 确保代码块正确格式化。在Markdown中,用三个反引号包围代码。 完整输出: 用户的问题:spark-streaming 我的回答:Spark Streaming 是 Apache Spark 的一个组件,用于处理实时数据流。它允许开发人员使用类似批处理的API来处理流数据,支持高吞吐量和容错性。 然后代码示例: ```python from pyspark.streaming import StreamingContext # 创建StreamingContext,每1秒批处理间隔 ssc = StreamingContext(sparkContext, 1) # 创建一个DStream,监听localhost:9999 lines = ssc.socketTextStream("localhost", 9999) # 分割每行成单词 words = lines.flatMap(lambda line: line.split(" ")) # 计数每个单词 pairs = words.map(lambda word: (word, 1)) wordCounts = pairs.reduceByKey(lambda x, y: x + y) # 打印结果 wordCounts.pprint() ssc.start() # 启动计算 ssc.awaitTermination() # 等待计算终止 ``` 然后:§§相关问题§§ 1. Spark Streaming 与 Kafka 如何集成? 2. 如何处理 Spark Streaming 中的故障恢复? 3. Spark Streaming 的微批处理模式是什么? 4. 有哪些常见的 Spark Streaming 用例? 我需要确保有3-5个问题。 现在,用中文写所有内容。</think>Spark Streaming 是 Apache Spark 的一个扩展模块,专门用于处理实时数据流。它采用微批处理(Micro-Batching)架构,将连续的数据流分割成小批次,然后使用 Spark 的批处理引擎进行处理。这允许开发者使用熟悉的 DataFrame 或 RDD API 来实现高吞吐、低延迟的流式应用,并支持容错性和 Exactly-Once 语义。核心概念包括 DStream(Discretized Stream),它代表一个连续的数据序列,通过转换操作(如 `map`、`reduceByKey`)进行处理。Spark Streaming 常用于日志分析、实时监控和事件处理等场景。 以下是一个简单的 Python 代码示例,展示如何使用 Spark Streaming 实现一个基本的单词计数应用(需要先启动一个本地 Socket 服务器): ```python from pyspark import SparkContext from pyspark.streaming import StreamingContext # 初始化 SparkContext 和 StreamingContext(批处理间隔为1秒) sc = SparkContext("local[2]", "SparkStreamingExample") ssc = StreamingContext(sc, 1) # 创建一个 DStream,监听本地端口 9999 lines = ssc.socketTextStream("localhost", 9999) # 处理数据流:分割单词并计数 words = lines.flatMap(lambda line: line.split(" ")) pairs = words.map(lambda word: (word, 1)) word_counts = pairs.reduceByKey(lambda a, b: a + b) # 打印结果 word_counts.pprint() # 启动流处理 ssc.start() ssc.awaitTermination() # 等待终止 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值