High-level Spark architecture-优快云博客

本文链接：https://blog.youkuaiyun.com/iteye_19004/article/details/82675476

Spark是一种专为迭代处理和数据共享而设计的大数据处理框架，它通过利用内存减少磁盘I/O来提升性能。本文介绍了Spark相较于MapReduce的优势，以及其三大核心概念：弹性分布式数据集（RDD）、转换和操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

原创转载请注明出处：http://agilestyle.iteye.com/blog/2335696

Spark Introduction

MapReduce is the primary workhorse at the core of most Hadoop clusters. While highly effective for very large batch-analytic jobs, MapReduce has proven to be suboptimal for applications like graph analysis that require iterative processing and data sharing.

Spark is designed to provide a more flexible model that supports many of the multipass applications that falter in MapReduce. It accomplishes this goal by taking advantage of memory whenever possible in order to reduce the amount of data that is written to and read from disk. Unlike Pig and Hive, Spark is not a tool for making MapReduce easier to use. It is a complete replacement for MapReduce that includes its own work execution engine.

Spark operates with three core ideas:

Resilient Distributed Dataset (RDD)

RDDs contain data that you want to transform or analyze. They can either be be read from an external source, such as a file or a database, or they can be created by a transformation.

Transformation

A transformation modifies an existing RDD to create a new RDD. For example, a filter that pulls ERROR messages out of a log file would be a transformation.

Action

An action analyzes an RDD and returns a single result. For example, an action would count the number of results identified by our ERROR filter.