High-level Spark architecture

Spark是一种专为迭代处理和数据共享而设计的大数据处理框架,它通过利用内存减少磁盘I/O来提升性能。本文介绍了Spark相较于MapReduce的优势,以及其三大核心概念:弹性分布式数据集(RDD)、转换和操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原创转载请注明出处:http://agilestyle.iteye.com/blog/2335696

 

Spark Introduction

MapReduce is the primary workhorse at the core of most Hadoop clusters. While highly effective for very large batch-analytic jobs, MapReduce has proven to be suboptimal for applications like graph analysis that require iterative processing and data sharing.

 

Spark is designed to provide a more flexible model that supports many of the multipass applications that falter in MapReduce. It accomplishes this goal by taking advantage of memory whenever possible in order to reduce the amount of data that is written to and read from disk. Unlike Pig and Hive, Spark is not a tool for making MapReduce easier to use. It is a complete replacement for MapReduce that includes its own work execution engine.

 

Spark operates with three core ideas:

Resilient Distributed Dataset (RDD)

RDDs contain data that you want to transform or analyze. They can either be be read from an external source, such as a file or a database, or they can be created by a transformation.

Transformation

A transformation modifies an existing RDD to create a new RDD. For example, a filter that pulls ERROR messages out of a log file would be a transformation.

Action

An action analyzes an RDD and returns a single result. For example, an action would count the number of results identified by our ERROR filter.

 

High-level Spark architecture


 

Reference

Apress.Big.Data.Analytics.with.Spark.A.Practitioners.Guide.to.Using.Spark.for.Large.Scale.Data.Analysis.2015

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值