[Spark基础]-- RDD解释

最新推荐文章于 2024-05-22 19:54:53 发布

oo寻梦in记

最新推荐文章于 2024-05-22 19:54:53 发布

阅读量793

点赞数

分类专栏： Apache Spark 文章标签： spark

Apache Spark 专栏收录该内容

137 篇文章 ¥19.90 ¥99.00

订阅专栏

超级会员免费看

RDD是Spark中的核心概念，表示不可变的、分布式的数据集。每个RDD被划分为多个分区，可以存储在集群中不同节点的内存或磁盘上。任务调度时，理想情况是将任务分配到包含对应分区数据的节点，实现局部性，提高效率。RDD是懒评估的，只在需要时计算，可缓存，并通过确定性操作、稳定存储或其它RDD创建。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

How do I make clear the concept of RDD in Spark?

recently, I spent a lot time on the paper RDD: A Fault-Tolerant Abstraction for in-memory cluster computing. But seems that I still didn't catch some concept of RDD, say what " locality-aware scheduling " means?

RDD is a dataset which is distributed, that is, it is divided into "partitions". Each of these partitions can be present in the memory or disk of different machines. If you want Spark to process the RDD, then Spark needs to launch one task per partition of the RDD. Its best that each task be sent to the machine have the partition that task is supposed to process. In that case, the task will be able to read the data of the partition from the local machine. Otherwise, the task would have to pull the partition data over the network from a different machine, which is less efficient. This scheduling of tasks (that is, allocation of tasks to machines) such that the tasks can read data "locally" is known as "locality aware scheduling".

RDD is representation of set of records, immutable collection of objects with distributed computing. RDD is large collection of data or RDD is an array of reference of partitioned objects. Each and every datasets in RDD is logically partitioned across many servers so that they can be computed on different nodes of cluster. RDDs are fault tolerant i.e. self-recovered/recomputed in case of failure. Dataset could be data loaded externally by the users which can be in the form of json file, csv file, text file or database via JDBC with no specific data structure.

RDD is Lazily Evaluated i.e. it is memorized or called when required or needed which saves lots of time and improves overall efficiency. RDD is a read only, partitioned collection of data. RDD can be created through deterministic operations or on stable storage or other RDDs. It can be generated by parallelizing an existing collection in your driver program or referencing a dataset in an external storage system. It is cacheable.

you can read about RDD from below links: