How do I make clear the concept of RDD in Spark?
recently, I spent a lot time on the paper RDD: A Fault-Tolerant Abstraction for in-memory cluster computing. But seems that I still didn't catch some concept of RDD, say what " locality-aware scheduling " means?
RDD is a dataset which is distributed, that is, it is divided into "partitions". Each of these partitions can be present in the memory or disk of different machines. If you want Spark to process the RDD, then Spark needs to launch one task per partition of the RDD. Its best that each task be sent to the machine have the partition that task is supposed to process. In that case, the task will be able to read the data of the partition from the local machine. Otherwise, the task would have to pull the partition data over the network from a different machine, which is less efficient. This scheduling of tasks (that is, allocation of tasks to machines) such that the tasks can read data "locally" is known as "locality aware scheduling".
RDD is representation of set of records, immutable collection of objects with distributed computing. RDD is large collection of data or RDD is an array of reference of partitioned objects. Each and every datasets in RDD is logically partitioned across many servers so that they can be computed on different nodes of cluster. RDDs are fault tolerant i.e. self-recovered/recomputed in case of failure. Dataset could be data loaded externally by the users which can be in the form of json file, csv file, text file or database via JDBC with no specific data structure.
RDD is Lazily Evaluated i.e. it is memorized or called when required or needed which saves lots of time and improves overall efficiency. RDD is a read only, partitioned collection of data. RDD can be created through deterministic operations or on stable storage or other RDDs. It can be generated by parallelizing an existing collection in your driver program or referencing a dataset in an external storage system. It is cacheable.
you can read about RDD from below links:
- Resilient Distributed Datasets - RDD in Apache Spark - DataFlair
- Certified Apache Spark and Scala Training Course | DataFlair