[Spark基础]-- RDD解释

RDD是Spark中的核心概念,表示不可变的、分布式的数据集。每个RDD被划分为多个分区,可以存储在集群中不同节点的内存或磁盘上。任务调度时,理想情况是将任务分配到包含对应分区数据的节点,实现局部性,提高效率。RDD是懒评估的,只在需要时计算,可缓存,并通过确定性操作、稳定存储或其它RDD创建。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

How do I make clear the concept of RDD in Spark?

 

recently, I spent a lot time on the paper RDD: A Fault-Tolerant Abstraction for in-memory cluster computing. But seems that I still didn't catch some concept of RDD, say what " locality-aware scheduling " means?

 

 

RDD is a dataset which is distributed, that is, it is divided into "partitions". Each of these partitions can be present in the memory or disk of different machines. If you want Spark to process the RDD, then Spark needs to launch one task per partition of the RDD. Its best that each task be sent to the machine have the partition that task is supposed to process. In that case, the task will be able to read the data of the partition from the local machine. Otherwise, the task would have to pull the partition data over the network from a different machine, which is less efficient. This scheduling of tasks (that is, allocation of tasks to machines) such that the tasks can read data "locally" is known as "locality aware scheduling".

 

RDD is representation of set of records, immutable collection of objects with distributed computing. RDD is large collection of data or RDD is an array of reference of partitioned objects. Each and every datasets in RDD is logically partitioned across many servers so that they can be computed on different nodes of cluster. RDDs are fault tolerant i.e. self-recovered/recomputed in case of failure. Dataset could be data loaded externally by the users which can be in the form of json file, csv file, text file or database via JDBC with no specific data structure.

RDD is Lazily Evaluated i.e. it is memorized or called when required or needed which saves lots of time and improves overall efficiency. RDD is a read only, partitioned collection of data. RDD can be created through deterministic operations or on stable storage or other RDDs. It can be generated by parallelizing an existing collection in your driver program or referencing a dataset in an external storage system. It is cacheable.

you can read about RDD from below links:

 

​​​​​​​

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值