Spark-Quick Start-Scala官网翻译

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.

本教针对如何使用Spark提供了一个程快速介绍。首先,我们将将介绍在shell(在Python或Scala)中使用的一些API,然后演示如何在Java、Scala和Python中编写应用程序。

To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.

要跟着这个指南操作,首先,从Spark的网站上下载一个稳定版本的安装包。因为我们不使用HDFS,所以可以下载任何版本的Hadoop的包。

Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more detailed reference at the RDD programming guide. However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the SQL programming guide to get more information about Dataset.

注意,在Spark 2.0之前,Spark的主要编程接口是弹性分布式数据集(RDD)。在Spark 2.0之后,RDDS被Dataset取代,Dataset是强类型的,类似于RDD,但是在底层有更丰富的优化。RDD接口仍然被支持,并且您可以在RDD编程指南中获得更详细的引用。但是,我们强烈建议您切换到使用Dataset,它具有比RDD更好的性能。请参阅SQL编程指南获取更多关于DataSet的信息。

Interactive Analysis with the Spark Shell(使用Spark 的shell 进行交互式分析)
Basics

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory:

Shell的shell提供了一个学习API的简单的方法,同时也是一个交互分析数据的强大工具。它可以在Scala或Python中运行(它运行在Java VM上,因此是可以很好的使用现有Java库)。可以通过在Spark的目录下运行下面的指令开始。

scala

./bin/spark-shell

Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Let’s make a new Dataset from the text of the README file in the Spark source directory:

Spark的主要抽象概念是一个称为DataSet的分布式集合。Dataset可以从Hadoop输入格式(如HDFS文件)或通过转换其他Dataset创建。让我们基于Spark开源目录中的自述文件创建一个新的Dataset:

scala

scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]

You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the API doc.

通过调用某些操作,或转换Dataset以获得新的Dataset。有关详细信息,您可以直接从DataSet获取值。请阅读API文档。

scala> textFile.count() // Number of items in this Dataset
res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs

scala> textFile.first() // First item in this Dataset
res1: String = # Apache Spark

Now let’s transform this Dataset into a new one. We call filter to return a new Dataset with a subset of the items in the file.

现在让我们把这个Dataset转换成一个新的Dataset。我们调用filter方法得到原来Dataset的一个子集,把它作为一个新的Dataset.

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

We can chain together transformations and actions:(我们可以把transformations类型的操作和actions类型的操作连起来使用)

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
More on Dataset Operations(关于Dataset的更多操作)

Dataset actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:

DataSet通过混合使用actions和transformations可以实现更复杂的计算。比如说,我们想找到单词最多的一行:

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15

This first maps a line to an integer value, creating a new Dataset. reduce is called on that Dataset to find the largest word count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll use Math.max() function to make this code easier to understand:

第一个map操作将一行转换为整数值,并创建一个新的数据集。在该数据集上进行reduce操作以查找最大的单词计数。map和reduce的参数是Scala函数字面量(闭包),可以使用任何语言特征或Scala/Java库。例如,我们可以很容易地调用在别处声明的函数。我们将使用Max.Max()函数使这段代码更容易理解:

scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15

One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

一种常见的数据源是在Hadoop中比较普遍的MapReduce.Spark可以很容易的使用MapReduce数据流。

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]

Here, we call flatMap to transform a Dataset of lines to a Dataset of words, and then combine groupByKey and count to compute the per-word counts in the file as a Dataset of (String, Long) pairs. To collect the word counts in our shell, we can call collect:

这里我们使用flatMap将数据类型是一行数据的Dataset转换为数据类型为字符串类型的Dataset,然后结合groupByKey和count操作计算出每个单词的出现次数,得到一个个(String,Long)类型的pair,并将其放到一个Dataset中。想要把我们收集的单词出现次数在shell中输出,我们可以使用collect算子。

scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

##### Caching(缓存)

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:

Spark通用支持把Dataset放到集群的内存中。当数据被重复使用的时候,这一点是很有用的,例如使用迭代算法计算网页排名。作为一个简单的例子,让我们标记我们的linesWithSpark可以被缓存。

scala> linesWithSpark.cache()
res7: linesWithSpark.type = [value: string]

scala> linesWithSpark.count()
res8: Long = 15

scala> linesWithSpark.count()
res9: Long = 15

It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/spark-shell to a cluster, as described in the RDD programming guide.

使用Spark来探索和缓存100行文本文件也许看起来是愚蠢的。有趣的是,这些相同的函数可以在非常大的数据集上使用,即使它们横跨数十或数百个节点。您还可以通过将bin /Shell shell连接到群集来交互式地进行这一操作,如RDD编程指南中所描述的那样。

Self-Contained Applications(独立的应用程序)

Suppose we wish to write a self-contained application using the Spark API. We will walk through a simple application in Scala (with sbt), Java (with Maven), and Python (pip).

假设我们希望使用Spark API编写一个独立的应用程序。我们将在Scala(用SBT)、Java(Maven)和Python(PIP)中完成一个简单的应用程序。

We’ll create a very simple Spark application in Scala–so simple, in fact, that it’s named SimpleApp.scala:

我们将在Scala中创建一个非常简单的应用程序——事实上,它是如此简单,以至于被命名为SimpleApp.scala:

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
  }
}

Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.

注意,应用程序应该定义一个main()方法,而不是继承scala.App。scala.App的子类可能无法正常工作。

This program just counts the number of lines containing "a" and the number containing "b" in the Spark README. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkSession, we initialize a SparkSession as part of the program.

这个程序只计算在Spark README中包含“a”的行数和包含“b”的行数。注意,你需要用你的安装目录的位置替换YOUR_SPARK_HOME。与前面使用Spark Shell初始化自己的SparkSession的示例不同,我们初始化SparkSession作为程序的一部分。

We call SparkSession.builder to construct a [[SparkSession]], then set the application name, and finally call getOrCreate to get the [[SparkSession]] instance.

我们调用SparkSession.builder来构造一个[SparkSession]],然后设置应用程序名称,最后调用getOrCreate来获取[[SparkSession]]实例。

Our application depends on the Spark API, so we’ll also include an sbt configuration file, build.sbt, which explains that Spark is a dependency. This file also adds a repository that Spark depends on:

我们的应用程序依赖于Spark API,因此我们还将包括一个sbt配置文件build.sbt,它解释了Spark是一个依赖项。该文件还添加了Spark依赖的存储库:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"

For sbt to work correctly, we’ll need to layout SimpleApp.scala and build.sbt according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use the spark-submit script to run our program.

sbtT要正确工作,我们需要根据典型的目录结构来布局SimpleApp.scala和build.sbt。完成之后,我们可以创建一个包含应用程序代码的JAR包,然后使用spark-submit脚本运行我们的程序。

# Your directory layout should look like this
$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.11/simple-project_2.11-1.0.jar
...
Lines with a: 46, Lines with b: 23
Where to Go from Here(接下来应该熟悉哪里)

Congratulations on running your first Spark application!(恭喜你,你已经运行了你的第一个Spark应用)

  • For an in-depth overview of the API, start with the RDD programming guide and the SQL programming guide, or see “Programming Guides” menu for other components.(想要更加深入的了解API,从RDD programming向导和SQL programming向导开始吧),或者你可以从Programming Guides的菜单中查看其他模块。
  • For running applications on a cluster, head to the deployment overview.(想要在集群上运行程序,可以查看部署相关章节)
  • Finally, Spark includes several samples in the examples directory (Scala, Java, Python, R). You can run them as follows:(最后,Spark在examples的目录下(Scala, Java, Python, R)包含了几个案例,你可以使用下面的方式运行他们)
# For Scala and Java, use run-example:
./bin/run-example SparkPi

# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py

# For R examples, use spark-submit directly:
./bin/spark-submit examples/src/main/r/dataframe.R
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值