大数据之spark学习记录二: Spark的安装与上手

最新推荐文章于 2025-05-17 21:23:34 发布

ChanZany

最新推荐文章于 2025-05-17 21:23:34 发布

阅读量487

点赞数

分类专栏：大数据文章标签：大数据 hadoop spark linux java

本文链接：https://blog.youkuaiyun.com/qq_41819729/article/details/107293610

版权

大数据之spark学习记录二: Spark的安装与上手

文章目录

大数据之spark学习记录二: Spark的安装与上手

Spark安装

下载Sparkhttp://spark.apache.org/，注意版本的选择

spark和hadoop的版本对应，我这里是hadoop2.7–spark2.1.1

spark和scala的版本对应,我这里是spark2.1.1–scala2.11

解压下载的.tgz压缩包

tar -zxvf yourspark.tgz -C 指定的目录位置

看个人喜好重命名spark

mv spark-2.1.1-bin-hadoop2.7/ spark

在这里插入图片描述

本地模式

Local 模式就是指的只在一台计算机上来运行 Spark.

通常用于测试的目的来使用 Local 模式, 实际的生产环境中不会使用 Local 模式.

cp -r spark spark-local

为避免多种运行模式之间繁琐的修改，直接拷贝出一个单独的spark软件包命名为spark-local,接下来本地模式的安装就在该安装包下进行

进入该目录可以看到spark的目录结构如下：

在这里插入图片描述

其中conf目录下是一些Spark的配置文件

在这里插入图片描述

local模式下无需对默认配置进行修改

使用官方测试jar包提交计算 $\pi$ 的job，测试本地模式

bin/spark-submit --master local[2] --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.11-2.1.1.jar 100

或者使用官方提供的测试命令

./bin/run-example SparkPi 10

在这里插入图片描述这里对spark发布应用程序的脚本参数做一个说明

语法:

./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

--master 指定 master 的地址，默认为local. 表示在本机运行.
--class 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
--deploy-mode 是否发布你的驱动到 worker节点(cluster 模式) 或者作为一个本地客户端 (client 模式) (default: client)
--conf: 任意的 Spark 配置属性，格式key=value. 如果值包含空格，可以加引号"key=value"
application-jar: 打包好的应用 jar,包含依赖. 这个 URL 在集群中全局可见。比如hdfs:// 共享存储系统，如果是 file:// path，那么所有的节点的path都包含同样的jar
application-arguments: 传给main()方法的参数
--executor-memory 1G 指定每个executor可用内存为1G
--total-executor-cores 6 指定所有executor使用的cpu核数为6个
--executor-cores 表示每个executor使用的 cpu 的核数

关于 Master URL 的说明

Master URL	Meaning
local	Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K]	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*]	Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT	Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
mesos://HOST:PORT	Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://… To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
yarn	Connect to a YARNcluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

spark还为我们提供了一个交互式命令窗口(类似于 Scala 的 REPL)

在 Spark-shell 中使用 Spark 来统计文件中各个单词的数量.
步骤1: 创建 2 个文本文件分并别在 a.txt 和 b.txt 内输入一些单词.

mkdir input
cd input
echo hello spark hello scala hello bigdata spark scala bigdata >> a.txt
echo hello spark hello scala bigdata spark scala >> b.txt

步骤2: 打开 Spark-shell

bin/spark-shell

在这里插入图片描述

步骤3: 运行 wordcount 程序

scala> sc.textFile("./input").flatMap(_.split("\\W+")).map((_,1)).reduceByKey(_ + _).collect
res4: Array[(String, Int)] = Array

最低0.47元/天解锁文章