如何利用spark向HDFS的目录中追加数据?

本文介绍了如何利用Spark自定义OutputFormat来实现向HDFS目录追加数据的功能,主要涉及SparkContext的读取HDFS文件API以及PairRDDFunctions的数据落地API,特别是saveAsHadoopFile方法,通过继承FileOutputFormat并修改关键代码以实现追加写入。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

       我们在利用spark处理存储在HDFS上的数据时经常会有这样的需求,需要不断的向同一个目录写入数据(比如,每个小时将kafka中的数据落到HDFS的同一个目录),当然这种需求有很多解决方案可以使用,今天我们所探讨的就是如何通过修改spark 数据输出组件来实现这个功能,

1.1 SparkContext 这个类中、提供了多种读取HDFS上文件的API,如下代码所示: 

/**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   * @param path path to the text file on a supported file system
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of lines of the text file
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

  /**
   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
   * key-value pair, where the key is the path of each file, the value is the content of each file.
   *
   * <p> For example, if you have the following files:
   * {
  
  {
  
  {
   *   hdfs://a-hdfs-path/part-00000
   *   hdfs://a-hdfs-path/part-00001
   *   ...
   *   hdfs://a-hdfs-path/part-nnnnn
   * }}}
   *
   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   * <p> then `rdd` contains
   * {
  
  {
  
  {
   *   (a-hdfs-path/part-00000, its content)
   *   (a-hdfs-path/part-00001, its content)
   *   ...
   *   (a-hdfs-path/part-nnnnn, its content)
   * }}}
   *
   * @note Small files are preferred, large file is also allowable, but may cause bad performance.
   * @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files
   *       in a directory rather than `.../path/` or `.../path`
   * @note Partitioning is determined by data locality. This may result in too few partitions
   *       by default.
   *
   * @param path Directory to the input data files, the path can be comma separated paths as the
   *             list of inputs.
   * @param minPartitions A suggestion value of the minimal splitting number for input data.
   * @return RDD representing tuples of file path and the corresponding file content
   */
  def wholeTextFiles(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
    assertNotStopped()
    val job = NewHadoopJob.getInstance(hadoopConfiguration)
    // Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking
    // comma separated files as input. (see SPARK-7155)
    NewFileInputFormat.setInputPaths(job, path)
    val updateConf = job.getConfiguration
    new WholeTextFileRDD(
      this,
      classOf[WholeTextFileInputFormat],
      classOf[Text],
      classOf[Text],
      updateConf,
      minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path)
  }

  /**
   * Get an RDD for a Hadoop-readable dataset as PortableDataStream for each file
   * (useful for binary data)
   *
   * For example, if you have the following files:
   * {
  
  {
  
  {
   *   hdfs://a-hdfs-path/part-00000
   *   hdfs://a-hdfs-path/part-00001
   *   ...
   *   hdfs://a-hdfs-path/part-nnnnn
   * }}}
   *
   * Do
   * `val rdd = sparkContext.binaryFiles("hdfs://a-hdfs-path")`,
   *
   * then `rdd` contains
   * {
  
  {
  
  {
   *   (a-hdfs-path/part-00000, its content)
   *   (a-hdfs-path/part-00001, its content)
   *   ...
   *   (a-hdfs-path/part-nnnnn, its content)
   * }}}
   *
   * @note Small files are preferred; very large files may cause bad performance.
   * @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files
   *       in a directory rather than `.../path/` or `.../path`
   * @note Partitioning is determined by data locality. This may result in too few partitions
   *       by default.
   *
   * @param path Directory to the input data files, the path can be comma separated paths as the
   *             list of inputs.
   * @param minPartitions A suggestion value of the minimal splitting number for input data.
   * @return RDD representing tuples of file path and corresponding file content
   */
  def binaryFiles(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[(String, PortableDataStream)] = withScope {
    assertNotStopped()
    val job = NewHadoopJob.getInstance(hadoopConfiguration)
    // Use setInputPaths so that binaryFiles aligns with hadoopFile/textFile in taking
    // comma separated files as input. (see SPARK-7155)
    NewFileInputFormat.setInputPaths(job, path)
    val updateConf = job.getConfiguration
    new BinaryFileRDD(
      this,
      classOf[StreamInputFormat],
      classOf[String],
      classOf[PortableDataStream],
      updateConf,
      minPartitions).setName(path)
  }

  /**
   * Load data from a flat binary file, assuming the length of each record is constant.
   *
   * @note We ensure that the byte array for each record in the resulting RDD
   * has the provided record length.
   *
   * @param path Directory to the input data files, the path can be comma separated paths as the
   *             list of inputs.
   * @param recordLength The length at which to split the records
   * @param conf Configuration for setting up the dataset.
   *
   * @return An RDD of data with values, represented as byte arrays
   */
  def binaryRecords(
      path: String,
      recordLength: Int,
      conf: Configuration = hadoopConfiguration): RDD[Array[Byte]] = withScope {
    assertNotStopped()
    conf.setInt(FixedLengthBinaryInputFormat.RECORD_LENGTH_PROPERTY, recordLength)
    val br = newAPIHadoopFile[LongWritable, BytesWritable, FixedLengthBinaryInputFormat](path,
      classOf[FixedLengthBinaryInputFormat],
      classOf[LongWritable],
      classOf[BytesWritable],
      conf = conf)
    br.map { case (k, v) =>
      val bytes = v.copyBytes()
      assert(bytes.length == recordLength, "Byte array does not have correct length")
      bytes
    }
  }

  /**
   * Get an RDD for a Hadoop-readable dataset from a Hadoop JobConf given its InputFormat and other
   * necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable),
   * using the older MapReduce API (`org.apache.hadoop.mapred`).
   *
   * @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
   *             Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
   *             sure you won't modify the conf. A safe approach is always creating a new conf for
   *             a new RDD.
   * @param inputFormatClass storage format of the data to be read
   * @param keyClass `Class` of the key associated with the `inputFormatClass` parameter
   * @param valueClass `Class` of the value associated with the `inputFormatClass` parameter
   * @param minPartitions Minimum number of Hadoop Splits to generate.
   * @return RDD of tuples of key and corresponding value
   *
   * @note Because Hadoop's RecordReader class re-uses the same Writable object for each
   * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
   * operation will create many references to the same object.
   * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
   * copy them using a `map` function.
   */
  def hadoopRDD[K, V](
      conf: JobConf,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(conf)

    // Add necessary security credentials to the JobConf before broadcasting it.
    SparkHadoopUtil.get.addCredentials(conf)
    new HadoopRDD(this, conf, inputFormatClass, keyClass, valueClass, minPartitions)
  }

  /** Get an RDD for a Hadoop file with an arbitrary InputFormat
   *
   * @note Because Hadoop's RecordReader class re-uses the same Writable object for each
   * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
   * operation will create many references to the same object.
   * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
   * copy them using a `map` function.
   * @param path directory to the input data files, the path can be comma separated paths
   * as a list of inputs
   * @param inputFormatClass storage format of the data to be read
   * @param keyClass `Class` of the key associated with the `inputFormatClass` parameter
   * @param valueClass `Class` of the value associated with the `inputFormatClass` parameter
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of tuples of key and corresponding value
   */
  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(hadoopConfiguration)

    // A Hadoop configuration can be about 10 KB, which is pretty big, s
<think>嗯,用户想知道在Python中使用Spark数据存入HDFS的方法。首先,我需要回忆一下Spark的基本操作和HDFS的集成步骤。记得Spark可以通过PySpark库在Python中使用,而HDFS是Hadoop的分布式文件系统,所以可能需要先配置Hadoop的环境。 首先,用户可能需要确保他们的环境已经正确安装了Spark和Hadoop,并且HDFS服务正在运行。这包括检查Hadoop的配置文件,比如core-site.xml和hdfs-site.xml,确保Spark能够访问HDFS的地址和端口。如果用户的环境没有正确配置,可能会遇到连接问题。 接下来,用户需要创建一个SparkSession,这是Spark 2.0之后的入口点。创建SparkSession时可能需要指定master URL,比如"local[*]"表示本地模式,或者连接到集群的地址。然后,用户需要生成或加载数据到DataFrame中,这可能涉及到读取CSV、JSON等文件,或者从其他数据源转换而来。 将数据写入HDFS时,通常使用DataFrame的write方法,指定格式(如CSV、Parquet)和保存路径。HDFS的路径通常是hdfs://<namenode-host>:<port>/path,例如hdfs://localhost:9000/data/output。需要注意的是,用户需要具有写入HDFS目标目录的权限,否则可能会抛出权限异常。 可能的异常情况包括HDFS服务未启动、路径不存在、权限不足或者资源不足。这时候应该提示用户检查HDFS的状态,使用hdfs dfs命令预先创建目录,或者修改目录权限。此外,如果用户是在本地测试,可能需要确认Hadoop的配置是否正确,特别是如果他们在Windows上运行,可能会遇到路径或依赖库的问题,比如需要winutils工具。 另外,用户可能希望了解不同的存储格式选项,比如Parquet和CSV的区别,以及如何设置分区、压缩等参数。例如,使用partitionBy进行分区存储,或者设置mode为overwrite来覆盖已有数据。这些都是在实际应用中常见的需求,应该涵盖在回答中。 最后,需要给出一个完整的示例代码,从创建SparkSession到写入HDFS,并附上注意事项,帮助用户避免常见错误。可能还需要提醒用户根据实际环境修改HDFS的地址和端口,以及处理可能的依赖冲突。</think>在 Python 中使用 Apache Spark数据存入 HDFS 的步骤如下: --- ### **1. 环境准备** - **安装 Hadoop**:确保 HDFS 集群已启动(`hdfs://<namenode>:<port>` 可访问) - **安装 PySpark**:通过 `pip install pyspark` 安装 Spark Python API - **配置 Hadoop 路径**(可选):如果本地测试,需设置 `HADOOP_HOME` 环境变量 --- ### **2. 核心代码示例** ```python from pyspark.sql import SparkSession # 创建 SparkSession(HDFS 配置自动继承 Hadoop 环境) spark = SparkSession.builder \ .appName("Write to HDFS") \ .master("local[*]") \ # 本地模式,生产环境替换为集群地址 .getOrCreate() # 生成示例数据(替换为你的实际数据源) data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)] df = spark.createDataFrame(data, ["Name", "Age"]) # 写入 HDFS 的路径格式 hdfs_path = "hdfs://<namenode>:<port>/path/to/output" # 示例: hdfs://localhost:9000/data/output # 将数据写入 HDFS(支持 Parquet/CSV/JSON 等格式) df.write \ .format("parquet") \ # 可选格式:csv, json, orc 等 .mode("overwrite") \ # 模式:append, overwrite, ignore .save(hdfs_path) # 关闭 SparkSession spark.stop() ``` --- ### **3. 关键参数说明** - **HDFS 地址格式**:`hdfs://<namenode-host>:<port>/path`(默认端口通常为 9820 或 9000) - **写入模式**: - `overwrite`:覆盖已有数据 - `append`:追加数据 - `ignore`:路径存在时跳过写入 - **常用格式**: - **Parquet**(默认):列式存储,高效压缩 - **CSV**:文本格式,可读性好 - **JSON**:结构化存储 --- ### **4. 验证数据** 通过 Hadoop 命令检查 HDFS 文件: ```bash hdfs dfs -ls /path/to/output hdfs dfs -cat /path/to/output/part-*.parquet | head # 查看内容(Parquet 需工
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值