本文转自个人原创blog: http://www.javali.org/document/dive-into-spark-rdd.html
首先安装Spark集群
前置条件,10.0.18.14-16 三台机器分别已安装好hadoop2,spark cluster机器规划 10.0.18.16为master ,10.0.18.14-16三个结点为slave,确保master到slave能免密码ssh畅通
下载scala: http://www.scala-lang.org/download/
下载spark: http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
解压到master /home/work/hadoop/目录下
|
vi
~
/
.
bashrc
export
SCALA_HOME
=
/
home
/
work
/
hadoop
/
scala
export
SPARK_HOME
=
/
home
/
work
/
hadoop
/
spark
|
|
vi
$
SPARK_HOME
/
conf
/
spark
-
env
.
sh
export
SCALA_HOME
=
/
home
/
work
/
hadoop
/
scala
export
SPARK_SSH_OPTS
=
"-p22222"
export
SPARK_MASTER_IP
=
10.0.18.16
export
SPARK_MASTER_WEBUI_PORT
=
9088
export
SPARK_WORKER_WEBUI_PORT
=
9099
export
SPARK_WORKER_CORES
=
4
export
SPARK_WORKER_MEMORY
=
8g
|
|
vi
$
SPARK_HOME
/
conf
/
slaves
10.0.18.16
10.0.18.15
10.0.18.14
|
把配置好后的scala spark分别scp到其他的结点,在master上执行
|
$
SPARK_HOME
/
sbin
/
start
-
master
.
sh
$
SPARK_HOME
/
sbin
/
start
-
slaves
.
sh
|
测试:
./bin/run-example SparkPi
集群模式提交
./bin/spark-submit –class org.apache.spark.examples.SparkPi –master local[6] /home/work/hadoop/spark/lib/spark-examples-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar 1000
如无问题都会出现类似下方LOG :
15/02/25 14:14:44 INFO SparkContext: Job finished: reduce at SparkPi.scala:35, took 36.014186 s
Pi is roughly 3.14152356
Spark Shell使用:
|
.
/
bin
/
spark
-
shell
scala
>
val
data
=
Array
(
1
,
2
,
3
,
4
,
5
)
//产生data
scala
>
val
distData
=
sc
.
parallelize
(
data
)
//将data处理成RDD
scala
>
distData
.
reduce
(
_
+
_
)
//在RDD上进行运算,对data里面元素进行加和
运行得到结果:
res2
:
Int
=
15
|
读取hdfs文件并计算字符个数:
|
//hdfs 文件读取
val
distFile
=
sc
.
textFile
(
"hdfs://10.0.18.14/tmp/word.txt"
)
distFile
.
map
(
_
.
size
)
.
reduce
(
_
+
_
)
;
得到结果:
res0
:
Int
=
757
|
问题解决:
运行如下代码时出现异常
|
<
span
style
=
"color: #555555; font-family: Menlo, 'Lucida Console', Consolas, monospace;"
>
var
file
=
sc
.
textFile
(
"</span>hdfs://10.0.18.14/tmp/word.txt"
)
val
counts
=
file
.
flatMap
(
line
=
>
line
.
split
(
" "
)
)
.
map
(
word
=
>
(
word
,
1
)
)
.
reduceByKey
(
_
+
_
)
counts
.
saveAsTextFile
(
"hdfs://10.0.18.14/tmp/word.res"
)
Caused
by
:
java
.
lang
.
UnsatisfiedLinkError
:
/
tmp
/
snappy
-
1.0.5.3
-
dc04b01a
-
c28f
-
4d99
-
8906
-
ab320e360266
-
libsnappyjava
.
so
:
/
usr
/
lib64
/
libstdc
++
.
so
.
6
:
version
`
GLIBCXX_3
.
4.9'
not
found
(
required
by
/
tmp
/
snappy
-
1.0.5.3
-
dc04b01a
-
c28f
-
4d99
-
8906
-
ab320e360266
-
libsnappyjava
.
so
)
|
原因在于我安装的是spark1.1.0版,snappy依赖glibgc3.4.9 ;而我本机系统只到3.4.8 ,网上搜索了多种解决方案,其中替换libstdc++.so.6.0.*的方案很坑,我尝试过没奏效
所以有两种办法:
以集群模式运行spark steaming问题:Initial job has not accepted any resources; check your cluster UI to ensure。。。。
另外还会伴随着 All masters are unresponsive! Giving up. spark-submit 以及 disassociated akka ….等等错误
网上有说是内存配置问题,实际上是spark-env.sh里的export SPARK_MASTER_IP=10.0.18.16 用IP替换原来的hostname就没问题了。