Spark(8)Non Fat Jar/Cassandra Cluster Issue and Spark Version 1.3.1

最新推荐文章于 2025-12-04 18:10:19 发布

原创最新推荐文章于 2025-12-04 18:10:19 发布 · 144 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #scala #php

Distributed 专栏收录该内容

160 篇文章

订阅专栏

本文介绍了如何解决Spark 1.3.1版本部署过程中遇到的问题，包括Java 8兼容性调整、BouncyCastle Provider设置、JCE限制解除等，并详细记录了Spark集群搭建步骤及配置细节。

Spark(8)Non Fat Jar/Cassandra Cluster Issue and Spark Version 1.3.1

1. Can upgrade to Java8?
Fix the BouncyCastleProvider Problem
Visit https://www.bouncycastle.org/latest_releases.html, download the file bcprov-jdk15on-152.jar
Place the file in directory
/usr/lib/jvm/java-8-oracle/jre/lib/ext

And then go to this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security

edit this file
sudo vi java.security

Add this line
security.provider.10=org.bouncycastle.jce.provider.BouncyCastleProvider

I should download this file
http://repo1.maven.org/maven2/org/bouncycastle/bcprov-jdk15%2b/1.46/bcprov-jdk15%2b-1.46.jar

Fix the JCE Problem
Download the file from here
http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html

Unzip the file and place the jars in this directory
/usr/lib/jvm/java-8-oracle/jre/lib/security

2. Fat Jar?
https://github.com/apache/spark/pull/288?
https://issues.apache.org/jira/browse/SPARK-1154
http://apache-spark-user-list.1001560.n3.nabble.com/Clean-up-app-folders-in-worker-nodes-td20889.html
https://spark.apache.org/docs/1.0.1/spark-standalone.html

Based on my understanding, we should keep using assembly jar in scala, submit the task job to master, it will distribute the jobs to spark standalone cluster or YARN cluster. The clients should not require any setting up or jar dependencies.

3. Cluster Sync Issue in Cassandra 1.2.13
http://stackoverflow.com/questions/23345045/cassandra-cas-delete-does-not-work
http://wiki.apache.org/cassandra/DistributedDeletes

Need to use ntpd to sync the clock
https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/
https://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/

Cluster of Cassandra, all the nodes will do write operation with timestamp, if the system time are different across the cluster nodes. The cassandra can run into wired status. Sometimes, delete, update can not work.

4. Upgrade to 1.3.1 Version
https://spark.apache.org/docs/latest/

Download the Spark source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz

Unzip and place the spark file in working directory
> sudo ln -s /opt/spark-1.3.1 /opt/spark

My Java version and Scala version are as follow:
> java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

> scala -version
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL

Build the binary
> build/sbt clean
> build/sbt compile
Compile is not working for lack of dependencies. I will not spend time on that. I will directly download the binary.
>wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz
Unzip it and add it to the classpath.

Then my project sillycat-spark can easily run.
Simple Spark Cluster
download the source file
>wget http://apache.cs.utah.edu/spark/spark-1.3.1/spark-1.3.1.tgz

build the source
> build/sbt clean
> build/sbt compile
Not build on ubuntu as well. Using binary instead.
> wget http://www.motorlogy.com/apache/spark/spark-1.3.1/spark-1.3.1-bin-hadoop2.6.tgz

Prepare Configuration
Go to the CONF directory.
> cp spark-env.sh.template spark-env.sh
> cp slaves.template slaves

> cat slaves
# A Spark Worker will be started on each of the machines listed below.
ubuntu-dev1
ubuntu-dev2

>cat spark-env.sh
export SPARK_WORKER_MEMORY=768m
export SPARK_JAVA_OPTS="-Dbuild.env=lmm.sparkvm"
export USER=carl

copy the same settings to all the slaves
> scp -r ubuntu-master:/home/carl/tool/spark-1.3.1-hadoop2.6 ./

Call the shell to start the standalone cluster
> sbin/start-all.sh

How to build
https://spark.apache.org/docs/1.1.0/building-with-maven.html
> mvn -DskipTests clean package
Build successfully.

Build with Yarn and hive and JDBC support
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package

Go to directory
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install

Error Message:
[ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default) on project spark-assembly_2.10: Failed during scalastyle execution: Unable to find configuration file at location scalastyle-config.xml -> [Help 1]

Solution:
copy the [spark_root]/scalastyle-config.xml to [spark_root]/examples/scalastyle-config.xmlcan solve the problem

> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Pbigtop-dist -DskipTests clean package
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests clean package install

Changes in Resolver.scala
var mavenLocal = Resolver.mavenLocal

I set it up and running on batch mode on spark single cluster and yarn cluster. I will keep working on streaming mode and dynamic SQL.
All the based core codes are in project sillycat-spark now.

References:
Spark
http://sillycat.iteye.com/blog/1871204
http://sillycat.iteye.com/blog/1872478
http://sillycat.iteye.com/blog/2083193
http://sillycat.iteye.com/blog/2083194
http://sillycat.iteye.com/blog/2103288
http://sillycat.iteye.com/blog/2103457
http://sillycat.iteye.com/blog/2105430

Spark deployment
http://sillycat.iteye.com/blog/2166583
http://sillycat.iteye.com/blog/2167216
http://sillycat.iteye.com/blog/2183932

spark test
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
http://stackoverflow.com/questions/26170957/using-funsuite-to-test-spark-throws-nullpointerexception
http://blog.quantifind.com/posts/spark-unit-test/

spark docs
http://www.sparkexpert.com/
https://github.com/sujee81/SparkApps
http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/
http://dataunion.org/category/tech/spark-tech
http://dataunion.org/6308.html
http://endymecy.gitbooks.io/spark-programming-guide-zh-cn/content/spark-sql/README.html
http://zhangyi.farbox.com/post/access-postgresql-based-on-spark-sql

https://github.com/mkuthan/example-spark.git