Hadoop学习笔记(七)(Spark编译与配置)

Spark源码编译:

mvn编译命令:

./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

spark内置编译方式,编译后可以部署

#推荐使用:
./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz  -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0  --skip-java-test


./dev/make-distribution.sh  --name cdh2.6.0-cdh5.7.0 --skip-java-test --tgz -Pyarn -Dhadoop.version=2.6.0-cdh5.7.0  -Dscala-2.12 -Phive -Phive-thriftserver -rf :spark-sql_2.10


./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz  -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -X -U

编译错误:
(1)spark-launcher,hadoop clinet找不到
       pom.xml中添加:

<repository>
    <id>cloudera</id>
    <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>

<repository>
      <id>central</id>
      <!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution -->
      <name>Maven Repository</name>
      <url>https://repo.maven.apache.org/maven2</url>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
 </repository>

(2)was cached in the local repository
       编译后面加 -U,-X 更详细的编译信息


编译完成后:
spark-$VERSION-bin-$NAME.tgz(例如:spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz)

1)配置SPARK_HOME
    export SPARK_HOME=
    export PATH=$SPARK_HOME/bin:$PATH
    
                              ./spark-shell --master local[2]    local模式
                                                                [k]k个线程

Spark Standalone模式的架构和Hadoop HDFS/YARN很类似的,模式:(1 master + n worker)

2)spark-env.sh

SPARK_MASTER_HOST=hadoop001(localhost)
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=1(启动实例,1个worker)

3)多个worker配置:

hadoop1 : master
hadoop2 : worker
hadoop3 : worker
hadoop4 : worker
...
hadoop10 : worker

slaves:
hadoop2
hadoop3
hadoop4
....
hadoop10

==> start-all.sh   会在 hadoop1机器上启动master进程,在slaves文件配置的所有hostname的机器上启动worker进程


4)standalone模式启动spark-shell --master spark://hadoop001:7077

Spark WordCount统计

val file = spark.sparkContext.textFile("file:///home/hadoop/data/wc.txt")
val wordCounts = file.flatMap(line => line.split(",")).map((word => (word, 1))).reduceByKey(_ + _)
wordCounts.collect


监控页面localhost:4040

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值