apache spark_Apache Spark

最新推荐文章于 2025-03-11 20:45:06 发布

cunchi4221

最新推荐文章于 2025-03-11 20:45:06 发布

阅读量1.2k

点赞数

文章标签：分布式大数据 python java hadoop

原文链接：https://www.journaldev.com/20261/apache-spark

版权

Apache Spark是一个开源数据处理框架，因其内存计算能力而显著提高了大数据处理速度，至少比标准MapReduce快100倍。它支持Scala、Java、Python和R等语言的API，适用于机器学习算法的迭代任务。Spark包括核心API、Spark SQL、Spark Streaming和MLlib等组件，提供了SQL查询、实时流处理和机器学习等功能，是对Hadoop生态系统的强大补充。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

apache spark

Whenever we study about any tool which handles data, we must study how much volume of data can it process and why was the tool actually came into use. The reasons behind the development of Apache Spark are many. We will study about each of them here. Apart from this, we will also learn about the components and architecture of Apache Spark and how does it complement the Hadoop Framework. Let’s get Started.

每当我们研究任何处理数据的工具时，都必须研究它可以处理多少数据，以及该工具为何真正投入使用。开发Apache Spark的背后原因很多。我们将在这里研究它们中的每一个。除此之外，我们还将学习Apache Spark的组件和体系结构以及它如何对Hadoop框架进行补充。让我们开始吧。

为什么选择Apache Spark？ (Why Apache Spark?)

There are several reasons Apache Spark has gained so much popularity as a part of the Hadoop Ecosystem. Let’s study the reasons due to which Apache Spark is considered as an excellent tool to process data.

Apache Spark作为Hadoop生态系统的一部分获得如此广泛的受欢迎的原因有很多。让我们研究一下将Apache Spark视为处理数据的出色工具的原因。

数据速度 (Data Velocity)

Today, the speed at which data is created is humungous. Data is being generated everywhere around us by machines, sensors and servers. One of the biggest examples to describe the speed of data is when users use a website, website tracking system collects data about where did a user clicked on a website to visit another page. Image about thousands of users using the same website at the same time and the tracking system collecting this data. But what is the use of this data unless we start to store and process this data? This kind of data which is huge in number and which inflows at a high speed can be termed as Big Data.

如今，创建数据的速度令人震惊。机器，传感器和服务器在我们周围到处产生数据。描述数据速度的最大示例之一是用户使用网站时 ，网站跟踪系统会收集有关用户点击网站访问其他页面的数据。约有数千名用户同时使用同一网站和跟踪系统收集此数据的图像。但是，除非我们开始存储和处理这些数据，否则这些数据有什么用？这种数量巨大且高速流入的数据可以称为大数据 。

Big Data can be used to obtain results after insightful queries and processing have been done on it. But because of this data being highly unstructured, conventional ways aren’t applicable to perform this analytics. A framework which is capable of analyzing this type and size of data is known as Hadoop. Hadoop is an excellent framework to process this kind of Big Data but it has some limitations which we will study in coming sections (you can read it here as well). To cover these limitations, Apache Spark was developed.

在进行深入的查询和处理之后，可以使用大数据来获取结果。但是由于这些数据是高度非结构化的，因此常规方法不适用于执行此分析。能够分析这种类型和大小的数据的框架称为Hadoop 。 Hadoop是处理此类大数据的出色框架，但它有一些局限性，我们将在以后的部分中进行研究（您也可以在此处阅读）。为了克服这些限制，开发了Apache Spark 。

数据处理速度 (Data Processing Speed)

Hadoop is an excellent way of analyzing data but one thing it cannot do is present the insight results related to the Big Data we collected in short time. The Hadoop Map R Jobs can take a long time to execute and the business insights created might not be relevant at all by the time these results reach the stakeholders. These results are only helpful for business when provided in a minimum timeframe. This is another problem which is solved by Apache Spark.

Hadoop是一种分析数据的绝妙方法，但是它做不到的一件事就是呈现与我们在短时间内收集到的大数据相关的洞察结果。 Hadoop Map R Jobs可能需要很长时间才能执行，并且在这些结果到达涉众时，创建的业务洞察力可能根本不相关。这些结果仅在最短时间内提供，才对业务有用。这是Apache Spark解决的另一个问题。

什么是Apache Spark？ (What is Apache Spark?)

Apache Spark is an open source data processing framework which can perform analytic operations on Big Data in a distributed environment. It was an academic project in UC Berkley and was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009. Apache Spark was created on top of a cluster management tool known as Mesos. This was later modified and upgraded so that it can work in a cluster based environment with distributed processing.

Apache Spark是一个开放源数据处理框架，可以在分布式环境中对大数据执行分析操作。它是UC Berkley的一个学术项目，最初由Matei Zaharia在UC Berkeley的AMPLab于2009年启动。Apache Spark是在称为Mesos的群集管理工具之上创建的。后来对其进行了修改和升级，使其可以在具有分布式处理的基于集群的环境中工作。

Spark is an answer to the limitations presented in Hadoop related to processing speed of Big Data due to its ability to maintain the intermediate results of the processing it does in-memory. This means that the execution time of analytics operations are at least 100 times faster than standard MapReduce Jobs. The biggest advantage regarding managing data in-memory is that Apache Spark, very smartly, starts writing data to disk when in-memory data starts to reach its threshold. Spark follows the concept of Lazy Evaluations on its RDD (Resilient Distributed Datasets). This means that any actions which we perform on RDDs is not taken unless it is absolutely necessary. This avoids any transformations and triggers which might consume memory. Here is how we can visualise Lazy Evaluation:

Spark是对Hadoop中与大数据处理速度有关的局限性的一种解决方案，这是由于Spark能够保持其在内存中进行的处理的中间结果。这意味着分析操作的执行时间至少比标准MapReduce作业快100倍 。关于管理内存中数据的最大优势是，当内存中数据开始达到其阈值时，Apache Spark非常聪明地开始将数据写入磁盘。 Spark在其RDD（弹性分布式数据集）上遵循了惰性评估的概念。这意味着除非绝对必要，否则我们不会对RDD执行任何操作。这避免了可能消耗内存的任何转换和触发器。这是我们如何可视化惰性评估：

Apache Spark Lazy Evaluation

Apache Spark懒惰评估

This way, Spark avoids execution of tasks immediately and it maintains a meta-data about the operations it needs to perform in a DAG (Directed Acyclic Graph).
这样，Spark可以避免立即执行任务，并且在DAG（有向无环图）中维护有关需要执行的操作的元数据。

Some features related to Apache Spark are:

与Apache Spark相关的一些功能包括：

Spark supports much more operations on tasks as compared to MapReduce Framework functions
与MapReduce Framework函数相比，Spark支持更多的任务操作
Spark is written in Scala Programming Language and runs in a JVM
Spark用Scala编程语言编写，并在JVM中运行
Spark APIs are supported in various programming languages like Scala, Java, Python and R which makes applications to be made with Spark easy and flexible
Scala，Java，Python和R等各种编程语言均支持Spark API，这使得使用Spark制作应用程序变得轻松而灵活
Spark also offers an interactive shell for quick operations and task execution, however, this shell only supports Python and Scala as of now
Spark还提供了用于快速操作和任务执行的交互式外壳，但是，该外壳到目前为止仅支持Python和Scala。
Due to the reason that Spark runs on top of a Hadoop cluster, it can easily process data in JBase Structure and so, it acts like an extension to your current application environment
由于Spark在Hadoop集群上运行的原因，它可以轻松处理JBase Structure中的数据，因此，它就像是当前应用程序环境的扩展
It is an excellent processing framework for iterative tasks used in Machine Learning Algorithms
它是机器学习算法中用于迭代任务的出色处理框架

Apache Spark的组件 (Components of Apache Spark)

Before we start discussing components of Apache Spark, let’s look at how its components fit together to form an ecosystem:

在开始讨论Apache Spark的组件之前，让我们看一下它们的组件如何组合在一起以形成一个生态系统：

Apache Spark Ecosystem

Apache Spark生态系统

Apache Spark核心API (Apache Spark Core APIs)

Apache Spark Core APIs contains the execution engine of spark platform which provides the ability of memory computing and referencing datasets stored in external storage systems of the complete Platform. It is also responsible for task scheduling, dispatching and other I/O functionalities. In our programs, we can make use of Core APIs to expose the functionalities using Python, Scala, Java and R programming language.

Apache Spark Core API包含spark平台的执行引擎，该引擎提供了内存计算和引用完整平台的外部存储系统中存储的数据集的功能。它还负责任务调度，调度和其他I / O功能。在我们的程序中，我们可以利用Core API来使用Python，Scala，Java和R编程语言公开功能。

Spark SQL (Spark SQL)

Spark SQL provides user with SQL-based APIs to run SQL queries to perform computations on these Big Data-based datasets. This way, even a Business Analyst can run Spark Jobs by providing simple SQL Queries and deep dive into the available data.

Spark SQL为用户提供基于SQL的API，以运行SQL查询以对这些基于大数据的数据集执行计算。这样，即使是业务分析师，也可以通过提供简单SQL查询并深入研究可用数据来运行Spark Jobs。

火花流 (Spark Streaming)

Spark Streaming is an extremely useful API which allows us to make high-throughput, fault-tolerant stream processing form various data-sources as shown in the image below:

Spark Streaming是一个非常有用的API，它使我们能够从各种数据源进行高吞吐量，容错的流处理，如下图所示：

Apache Spark Streaming

Apache Spark流

This way, this API makes Spark an excellent tool to process real-time streaming data. The fundamental stream unit in the API is called a
这样，此API使Spark成为处理实时流数据的出色工具。 API中的基本流单位称为 DStream. DStream 。

MLlib（机器学习库） (MLlib (Machine Learning Library))

MLib is a collection of few Machine Learning Algorithms which can be used to perform tasks like data cleaning, classification, regression, feature extraction, dimension reduction etc. This also includes optimization primitives like SGD and BFGS.

MLib是少数几种机器学习算法的集合，可用于执行诸如数据清理，分类，回归，特征提取，降维等任务。这还包括SGD和BFGS之类的优化原语。

GraphX (GraphX)

GraphX is the Spark API for graph related computations. This API improves upon the Spark RDD by introducing the Resilient Distributed Property Graph.

GraphX是用于图形相关计算的Spark API。该API通过引入弹性分布式属性图来改进Spark RDD。

Just like Spark Streaming and Spark SQL APIs, GraphX also extends Spark RDD APIs which makes a directed graph (DAG). It also contains many operators which manipulate the graphs using the graph algorithms.

就像Spark Streaming和Spark SQL API一样，GraphX还扩展了Spark RDD API，从而创建了有向图（DAG）。它还包含许多使用图算法操纵图的运算符。

结论 (Conclusion)

In this lesson, we studied about Apache Spark which consists of many components and is an excellent way to get faster results in comparison to Hadoop. Read more Big Data Posts to gain deeper knowledge of available Big Data tools and processing frameworks.

在本课程中，我们研究了由许多组件组成的Apache Spark，这是与Hadoop相比获得更快结果的绝佳方法。阅读更多有关大数据的文章，以更深入地了解可用的大数据工具和处理框架。