[spark-src] 1-overview

最新推荐文章于 2025-08-11 13:45:23 发布

leibnitz09

最新推荐文章于 2025-08-11 13:45:23 发布

阅读量167

点赞数

分类专栏： spark 文章标签： scala 数据库大数据

spark 专栏收录该内容

28 篇文章

订阅专栏

Apache Spark 是一个用于大规模数据处理的快速且通用的引擎。它通过使用 Resilient Distributed Dataset (RDD) 来实现高度并行化的计算，能够以内存或磁盘为媒介运行程序，速度比 Hadoop MapReduce 快10倍以上。Spark 强调内存使用，大多数中间结果保留在内存中，避免了 I/O 和序列化问题。它支持迭代算法、交互式分析和批处理任务，并能与多种流行框架集成。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

what is

"Apache Spark™ is a fast and general engine for large-scale data processing....Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk." stated in apache spark

in despite of it's real a fact or not, i think certain key concepts/components to support these points of view:

a.use Resilient Distributed Datasets(RDD) program modeling largely differs from common ideas,eg. mapreduce.spark uses many optimized algorithms(e.g. iterative,localization etc) spread workload to across many workers in cluster.specially in reuse of data computation.

RDD:A resilient distributed dataset (RDD) is a read-only col- lection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.[1]

b.uses memory as far as possible.most of the intermediate results from spark retains in memory other than disks,so it's needles suffer from the io problem and serial-deserial cases.

in fact we use many tools to do similar stuffs ,like memocache,redis..

c.emphasizes the parallism concept.

d.degrades the jvm supervior responsibilities.eg. use one executor to hold on certain tasks instead of one container per task in yarn.

architecture