Spark Streaming监控HDFS输入流

本文介绍如何使用Spark Streaming实时监控HDFS中新增的数据文件,并通过代码示例展示了从检查点目录恢复Streaming Context的过程及常见错误处理方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark Streaming实时监控HDFS文件系统写入的数据。

以下是相关代码和注解以及特殊方法的源代码注释:

object SourceHdfs {

  /**
    * Definition HDFS checkpoint directory address
    */
  val checkpointDirectory = "hdfs://master:9000/sparkStreaming/Checkpoint_Data"

  def main(args: Array[String]): Unit = {

    /**
      * Either recreate a StreamingContext from checkpoint data or create a new StreamingContext.
      */
    val context = StreamingContext.getOrCreate(checkpointDirectory, createContext _)

    /**
      * Configure inputDStream source that HDFS address
      * No Receiver, SparkStreaming application monitor batch by timer
      */
    val DStream = context.textFileStream("hdfs://master:9000/quality/clipper_erp/2017-07-11")

    val wordCount = DStream.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_)

    wordCount.print()

    context.start()
    context.awaitTermination()

  }

  /**
    * Create spark streamingContext function for getOrCreate method
    */
  def createContext(): StreamingContext ={

    val conf = new SparkConf()
      .setAppName("HDFSInputData")
      .setMaster("spark://master:7077")

    val ssc = new StreamingContext(conf, Seconds(10))
    ssc.checkpoint(checkpointDirectory)
    ssc
  }

}

以下是完整代码的地址(https://github.com/DragonTong/Streaming/blob/master/src/main/scala/streaming/SourceHdfs.scala)

1. checkpoint 保持RDD状态和容错

  /**
 1. Set the context to periodically checkpoint the DStream operations for driver
 2. fault-tolerance.
 3. @param directory HDFS-compatible directory where the checkpoint data will be reliably stored.
 4.                  Note that this must be a fault-tolerant file system like HDFS.
   */

def checkpoint(directory: String) {
    if (directory != null) {
      val path = new Path(directory)
      val fs = path.getFileSystem(sparkContext.hadoopConfiguration)
      fs.mkdirs(path)
      val fullPath = fs.getFileStatus(path).getPath().toString
      sc.setCheckpointDir(fullPath)
      checkpointDir = fullPath
    } else {
      checkpointDir = null
    }
  }

2. getOrCreate 创建StreamingContext

 /**
   * Either recreate a StreamingContext from checkpoint data or create a new StreamingContext.
   * If checkpoint data exists in the provided `checkpointPath`, then StreamingContext will be
   * recreated from the checkpoint data. If the data does not exist, then the StreamingContext
   * will be created by called the provided `creatingFunc`.
   *
   * @param checkpointPath Checkpoint directory used in an earlier StreamingContext program
   * @param creatingFunc   Function to create a new StreamingContext
   * @param hadoopConf     Optional Hadoop configuration if necessary for reading from the
   *                       file system
   * @param createOnError  Optional, whether to create a new StreamingContext if there is an
   *                       error in reading checkpoint data. By default, an exception will be
   *                       thrown on error.
   */
  def getOrCreate(
      checkpointPath: String,
      creatingFunc: () => StreamingContext,
      hadoopConf: Configuration = SparkHadoopUtil.get.conf,
      createOnError: Boolean = false
    ): StreamingContext = {
    val checkpointOption = CheckpointReader.read(
      checkpointPath, new SparkConf(), hadoopConf, createOnError)
    checkpointOption.map(new StreamingContext(null, _, null)).getOrElse(creatingFunc())
  }

编译程序,在集群环境下运行:
spark-submit –class streaming.SourceHdfs –master spark://master:7077 spark.streaming.pro-1.0-SNAPSHOT.jar

注:第二次运行时会报错需要把Checkpoint_Data文件夹里面的内容删除。深入理解Checkpoint后再解决这个BUG。

报错内容:

Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.dstream.ShuffledDStream@321ca237 has not been initialized
    at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:312)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
    at scala.Option.orElse(Option.scala:289)
    at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
    at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
    at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
	at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
    at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
    at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:234)
	at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:229)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:229)
    at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:98)
    at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:102)
    at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:583)
	at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
    at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
	at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
	at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:578)
	at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
	at streaming.SourceHdfs$.main(SourceHdfs.scala:21)
	at streaming.SourceHdfs.main(SourceHdfs.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/07/11 19:19:27 INFO spark.SparkContext: Invoking stop() from shutdown hook
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值