RDD详解:
https://blog.youkuaiyun.com/u013850277/article/details/73648742
RDD创建方式一:
Parallelized collections are created by calling SparkContext
’s parallelize
method on an existing collection in your driver program (a Scala Seq
). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
//val distData = sc.parallelize(data,5)
Once created, the distributed dataset (distData
) can be operated on in parallel. For example, we might call distData.reduce((a, b) => a + b)
to add up the elements of the array. We describe operations on distributed datasets later on.
RDD创建方式二:
referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext
’s textFile
method. This method takes an URI for the file (either a local path on the machine, or a hdfs://
, s3a://
, etc URI) and reads it as a collection of lines. Here is an example invocation:
//如果在分布式环境使用本地文件作为数据源,须保证每个node节点都有该文件,否则报错
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:26
Spark读取文件的一些注意事项:
-
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
-
All of Spark’s file-based input methods, including
textFile
, support running on directories, compressed files, and wildcards as well. For example, you can usetextFile("/my/directory")
,textFile("/my/directory/*.txt")
, andtextFile("/my/directory/*.gz")
. -
The
textFile
method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
为什么创建1个RDD的task是两个?
task数量=partition数量,我觉得之所以是两个是因为读取数据时,默认的并发数为2.所以会创建2个分区,每个分区对应一个任务,所以是2个task。默认的并发数虽情况不同而不同,参考下列链接:
https://blog.youkuaiyun.com/HaixWang/article/details/79458341
通过Spark读取SequenceFile:
参考链接:https://blog.youkuaiyun.com/pelick/article/details/37650187