RDD创建

最新推荐文章于 2024-06-16 23:15:39 发布

原创最新推荐文章于 2024-06-16 23:15:39 发布 · 310 阅读

0 ·

CC 4.0 BY-SA版权

Spark 专栏收录该内容

16 篇文章

订阅专栏

本文详细介绍了在Spark中创建RDD的两种主要方式：并行化集合和引用外部存储系统的数据集。对于并行化集合，可通过在驱动程序中调用集合的方法形成分布式数据集。另一种方式是从Hadoop支持的任何存储源创建分布式数据集，包括HDFS、HBase等。此外，还讨论了Spark读取文件的注意事项，如文件必须在所有工作节点上可访问，支持目录、压缩文件和通配符，以及如何通过调整分区参数控制任务数量。默认情况下，task数量等于partition数量，这可能因默认并发数而异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

RDD详解：
https://blog.youkuaiyun.com/u013850277/article/details/73648742

RDD创建方式一：
Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
//val distData = sc.parallelize(data,5)

Once created, the distributed dataset (distData) can be operated on in parallel. For example, we might call distData.reduce((a, b) => a + b) to add up the elements of the array. We describe operations on distributed datasets later on.

RDD创建方式二：
referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection of lines. Here is an example invocation:

//如果在分布式环境使用本地文件作为数据源，须保证每个node节点都有该文件，否则报错
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:26

Spark读取文件的一些注意事项：

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

为什么创建1个RDD的task是两个？
task数量=partition数量，我觉得之所以是两个是因为读取数据时，默认的并发数为2.所以会创建2个分区，每个分区对应一个任务，所以是2个task。默认的并发数虽情况不同而不同，参考下列链接：
https://blog.youkuaiyun.com/HaixWang/article/details/79458341

通过Spark读取SequenceFile：
参考链接：https://blog.youkuaiyun.com/pelick/article/details/37650187