Spark 2017 BigData Update(2)CentOS Cluster

最新推荐文章于 2025-12-03 17:29:49 发布

原创最新推荐文章于 2025-12-03 17:29:49 发布 · 103 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #运维 #git

Summary 专栏收录该内容

381 篇文章

订阅专栏

本文介绍了如何在 CentOS 集群上搭建 Spark 2.2.1 和 Hadoop 2.7.5 的环境，并安装 Zeppelin 作为远程中心服务器。包括 Java 和 Maven 版本检查、protobuf 安装、Hadoop 编译安装、Spark 和 Zeppelin 的配置等步骤。

Spark 2017 BigData Update(2)CentOS Cluster

Check ENV as well
>java -version
java version "1.8.0_60"

>mvn --version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T16:41:47+00:00)

Set up Old version 2.5.0 Protoc
>git clone https://github.com/google/protobuf.git
>git checkout tags/v2.5.0

Follow
http://sillycat.iteye.com/blog/2100276
http://sillycat.iteye.com/blog/2193762

Change the Code in autogen.sh
- curl http://googletest.googlecode.com/files/gtest-1.5.0.tar.bz2 | tar jx
- mv gtest-1.5.0 gtest
+ curl -L https://github.com/google/googletest/archive/release-1.5.0.tar.gz | tar zx
+ mv googletest-release-1.5.0 gtest

>./autogen.sh
>./configure
>make
>sudo make install
>protoc --version
libprotoc 2.5.0

>cmake --version
cmake version 3.10.1
Follow the link here to install that http://sillycat.iteye.com/blog/2405875

Build the Hadoop
>wget http://mirrors.ocf.berkeley.edu/apache/hadoop/common/hadoop-2.7.5/hadoop-2.7.5-src.tar.gz
>mvn package -Pdist,native -DskipTests -Dtar
It successfully builds. The final file will be hadoop-dist/target/hadoop-2.7.5.tar.gz

Place the hadoop in working directory
>sudo ln -s /home/ec2-user/tool/hadoop-2.7.5 /opt/hadoop-2.7.5
Config the 3 nodes SSH to each other.

Something similar to
>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
>cat ~/.ssh/id_dsa.pub
>vi ~/.ssh/authorized_keys

Add the hadoop to PATH
>vi ~/.profile
PATH="/opt/hadoop/bin:$PATH"

Execute this on all the machines
>hdfs namenode -format
Follow the Setting documents here to set up slaves, pdfs-site.xml and other settings in /opt/hadoop/etc/hadoop
http://sillycat.iteye.com/blog/2288141

Start HDFS
>sbin/start-dfs.sh

Visit Page http://fr-stage-api:50070/dfshealth.html#tab-overview

Start YARN
>sbin/start-yarn.sh

Visit Page http://fr-stage-api:8088/cluster/nodes

Install Spark on the main machine
>wget http://apache.spinellicreations.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
unzip and place in the right directory
>sudo ln -s /home/ec2-user/tool/spark-2.2.1 /opt/spark-2.2.1
>cp conf/spark-env.sh.template conf/spark-env.sh
>cat conf/spark-env.sh
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
>echo $SPARK_HOME
/opt/spark

Install Zeppelin on the Remote Center Server
>wget http://apache.mirrors.tds.net/zeppelin/zeppelin-0.7.3/zeppelin-0.7.3-bin-all.tgz
unzip and place in the right directory
>sudo ln -s /home/ec2-user/tool/zeppelin-0.7.3 /opt/zeppelin-0.7.3
>cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh

The content is as follow in that file
export SPARK_HOME="/opt/spark"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"

Start the Node Book
>bin/zeppelin-daemon.sh start

Visit Page http://fr-stage-api:8080
Change the Master of Spark in interpreter to ‘yarn’ from 'local[*]'

Choose the first easy tutorial
http://fr-stage-api:8080/#/notebook/2A94M5J1Z

You can see the task here as well
http://fr-stage-api:4040/stages/

But I can see it error out, and go and check the YARN logging
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Diagnostics: Container [pid=9207,containerID=container_1514501181478_0001_01_000001] is running beyond virtual memory limits. Current usage: 309.7 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.

Solution:
This configuration in yarn-site.xml fixed the problem.
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

Restart the YARN system. It works great this time.

References:
http://sillycat.iteye.com/blog/2405875