How-to: effective store kafka data into hdfs via spark streaming

本文介绍了一种改进的Spark Streaming方法,通过自定义消费者每几分钟收集一次数据,并生成每小时一个文件,以减少HDFS存储空间浪费。该方法将Kafka主题数据实时收集到内存中,然后按时间戳和主题命名文件,最后将这些文件写入HDFS,实现高效存储和管理。通过调整参数和配置,实现了数据的有序存储和快速访问,适用于大规模实时数据处理场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

This is an improvement to 

How-to: make spark streaming collect data from Kafka topics and store data into hdfs


In 

How-to: make spark streaming collect data from Kafka topics and store data into hdfs

, spark streaming will generate a part-000* file for each line log. This will be a waste for hdfs storage. As maybe one line log maybe far small than a block size.  Here is a new consumer which will collect data every several minutes and generate one files each hour.

import java.util.Properties

import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import java.nio.charset.Charset
import java.util.Calendar;
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FSDataOutputStream
import org.apache.spark.rdd.RDD

object KafkaWriteHDFSConsumer {

  def main(args: Array[String]) {
    if (args.length < 6) {
      System.err.println("Usage: KafkaWriteHDFSConsumer <zkQuorum> <group> <topics> <numThreads> <output> <time>")
      System.exit(1)
    }
    val Array(zkQuorum, group, topics, numThreads, output, time) = args
    val sparkConf = new SparkConf().setAppName("KafkaWriteHDFSConsumer")
    val ssc = new StreamingContext(sparkConf, Minutes(time.toLong))
    ssc.checkpoint("checkpoint")

    val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
    val lines_ori = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
    val lines = lines_ori.map(_._2)

    lines.foreachRDD(rdd => {
      if (!rdd.isEmpty()) {
        val config = new Configuration();
        config.addResource(new Path(System.getenv("HADOOP_HOME") + "/etc/hadoop/core-site.xml"));
        config.addResource(new Path(System.getenv("HADOOP_HOME") + "/etc/hadoop/hdfs-site.xml"));
        val fs = FileSystem.get(config);
        val filenamePath = new Path(output + "/" + Calendar.getInstance().get(Calendar.YEAR) + "-" + (Calendar.getInstance().get(Calendar.MONTH) + 1 ) + "-" + Calendar.getInstance().get(Calendar.DATE) + "/" + Calendar.getInstance().get(Calendar.YEAR) + "-" + (Calendar.getInstance().get(Calendar.MONTH) + 1 )+ "-" + Calendar.getInstance().get(Calendar.DATE) + "-" + Calendar.getInstance().get(Calendar.HOUR_OF_DAY) + ".txt" );// intresting here: could not define the file name at first due to spark lazy module.
        if (!fs.exists(filenamePath)) {
          val fin = fs.create(filenamePath)
          if (!rdd.isEmpty()) {
            rdd.collect().foreach { string =>
              {
                fin.writeUTF(string + "\n")
              }
            }
          }
          fin.close()
        } else {
          val fin = fs.append(filenamePath)
          if (!rdd.isEmpty()) {
            rdd.collect().foreach { string =>
              {
                fin.writeUTF(string + "\n")
              }
            }
          }
          fin.close()
        }

      }
    })

    ssc.start()
    ssc.awaitTermination()
  }

}


You could store this file as examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWriteHDFSConsumer.scala in spark project and generate new spark examples jar via "mvn -Pyarn -DskipTests clean package" in examples directory.

Run spark as: 
${SPARK_HOME}/bin/spark-submit --master yarn-cluster --class org.apache.spark.examples.streaming.KafkaWriteHDFSConsumer ${SPARK_HOME}/lib/spark-examples-1.3.0-cdh5.4.1-hadoop2.6.0-cdh5.4.1.jar zk_node:2181 hdfs-consumer topics 1 output 10
The data will be stored in hdfs as:
[hadoop@master01 ~]$ hadoop fs -ls /user/chenfangfang/pinback/2015-6-13/
Found 2 items
-rw-r--r--   3 chenfangfang supergroup  127147607 2015-07-13 15:59 /user/chenfangfang/pinback/2015-6-13/2015-6-13-15.txt
-rw-r--r--   3 chenfangfang supergroup    6190824 2015-07-13 16:02 /user/chenfangfang/pinback/2015-6-13/2015-6-13-16.tx

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值