spark伪分布的搭建

本文详细介绍了在Linux环境下安装与配置Spark 2.1.0的全过程,包括JDK安装、Spark包上传及解压、配置文件修改等关键步骤。通过设置SPARK_MASTER_HOST、SPARK_MASTER_PORT等参数,实现Spark集群的正确部署。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1、准备工作,安装linux,jdk(1.8)等

2、上传spark包spark-2.1.0-bin-hadoop2.7.tgz到虚拟机并解压

tar -zxvf spark-2.1.0-bin-hadoop2.7.tgz -C /usr/local/src/

配置文件:spark-env.sh

cd conf

cp spark-env.sh.template spark-env.sh

cp slaves.template slaves

vim spark-env.sh

在最后添加:

# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.

export JAVA_HOME=/usr/local/src/jdk1.8.0_191   //jdk
export SPARK_MASTER_HOST=192.168.200.128  //主机名或ip地址
export SPARK_MASTER_PORT=7077 //端口

slaves:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
192.168.200.128 //主机名或ip地址

cd sbin目录下:./start-all.sh

spark web console (内置tomcat:8080)http://192.168.200.128:8080

### Hadoop伪分布式模式下配置和运行Spark #### 1. 环境准备 为了在Hadoop伪分布式环境中成功配置并运行Spark,需先确认基础环境已准备好。推荐的软件版本组合为Java 8、Spark 2.4.0、Scala 2.11 和 Hadoop 3.1.1[^2]。 确保操作系统中已经安装了必要的依赖项,例如SSH服务用于节点间的通信[^3]。此外,在开始之前应验证Hadoop伪分布式环境是否正常工作,可以通过访问`http://localhost:9870`来检查HDFS的状态[^4]。 --- #### 2. Spark 的下载与解压 从Apache官网或其他可信源下载对应版本的Spark二进制包,并将其解压缩到目标目录: ```bash wget https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz tar -xzvf spark-2.4.0-bin-hadoop2.7.tgz mv spark-2.4.0 /usr/local/spark ``` 注意:如果使用的是更高版本的Hadoop,则需要选择兼容的Spark版本。 --- #### 3. 配置Spark以支持Hadoop 编辑Spark的配置文件以集成Hadoop的支持。主要涉及以下几个方面: ##### (a) 设置 `SPARK_HOME` 变量 修改系统的环境变量文件(如 `.bashrc` 或者 `/etc/profile`),添加如下内容: ```bash export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/bin ``` 随后执行命令使更改生效: ```bash source ~/.bashrc ``` ##### (b) 修改 `conf/spark-env.sh` 创建或编辑该脚本,指定 Java 安装路径以及其他必要参数: ```bash vi $SPARK_HOME/conf/spark-env.sh ``` 添加以下内容: ```bash export JAVA_HOME=/path/to/java export HADOOP_CONF_DIR=/path/to/hadoop/etc/hadoop ``` 这里的关键在于设置 `HADOOP_CONF_DIR` 参数指向Hadoop的配置目录,以便Spark能够读取Hadoop的相关配置信息[^5]。 --- #### 4. 测试Spark与Hadoop的集成 完成上述配置之后,可以尝试提交一个简单的作业至YARN资源管理器进行测试。启动YARN前,请确保已完成免密登录配置,并通过以下命令初始化NameNode: ```bash hdfs namenode -format ``` 接着分别启动HDFS和YARN服务: ```bash start-dfs.sh start-yarn.sh ``` 现在可以利用Spark自带的例子程序验证两者之间的协作关系。假设数据存储于HDFS根目录下的某个位置,可运行如下命令: ```bash $SPARK_HOME/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ /path/to/examples/jars/spark-examples_*.jar 10 ``` 此操作会向集群提交计算圆周率的任务,并通过YARN调度执行。 --- #### 5. 常见问题排查 - **Datanode无法启动** 如果遇到DataNode未能成功启动的情况,可能是因为多次格式化导致ID不一致。此时应当查阅日志文件定位具体原因,并手动调整相应配置使其匹配。 - **权限不足错误** 当某些操作提示缺少适当权限时,考虑赋予更宽松的访问权给相关用户组;或者切换成root账户重新部署整个流程。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值