Getting Spark Setup in Eclipse

最新推荐文章于 2018-04-27 17:17:59 发布

苏然Vincent

最新推荐文章于 2018-04-27 17:17:59 发布

阅读量4.5k

点赞数

分类专栏： Spark

Spark 专栏收录该内容

11 篇文章

订阅专栏

本文介绍了如何从源代码构建Apache Spark并解决依赖问题，同时提供了详细的步骤来配置Eclipse以支持Spark项目的开发。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

http://syndeticlogic.net/?p=311

Spark is a new distributed programming framework for analyzing large data sets. It took me a few steps to get the system setup in Eclipse, so I thought I’d write them down. Hopefully this post saves someone a few minutes.

Fair warning, the Spark project seems to be moving fast, so this could get out of date quickly…

Building from Source

First download the sources from the Git repository. Then try to build it. To build it you need to specify a profile. Below are the commands I used to accomplish these steps.

$ git clone github.com:mesos.git/spark
$ mvn -U -Phadoop2 clean install -DskipTests

Unfortunately, that didn’t just work for me. I have reason to believe the issue is environmental (see below), so it might work for you.

If this step works for you, then move on to the next section. Below is the build error I received.

[ERROR] Failed to execute goal on project spark-core: Could not
resolve dependencies for project
org.spark-project:spark-core:jar:0.7.1-SNAPSHOT: The following
artifacts could not be resolved: cc.spray:spray-can:jar:1.0-M2.1,
cc.spray:spray-server:jar:1.0-M2.1,
cc.spray:spray-json_2.9.2:jar:1.1.1: Could not find artifact
cc.spray:spray-can:jar:1.0-M2.1 in jboss-repo
(http://repository.jboss.org/nexus/content/repositories/releases/)

This error is bit misleading. The repository.jboss.org is just the last repo missing the artifacts. After inspecting spark/pom.xml, the problem is that mvn cannot download the jars from repo.spray.cc. The spark/pom.xml seems to be correct, and, surprisingly,repo.spray.cc seems to be okay too.

The spray docs indicate repo.spray.io is the maven repo. But both domains point the same IP address. For sanity, I tried it, but had the same problem.

The work around is to put the files in the .m2 repository manually. Below is the script I used.

for k in can io util server base; do
 dir="cc/spray/spray-$k/1.0-M2.1/"
 mkdir -p ~/.m2/repository/$dir
 cd ~/.m2/repository/$dir
 wget http://repo.spray.io/$dir/spray-$k-1.0-M2.1.pom
 wget http://repo.spray.io/$dir/spray-$k-1.0-M2.1.jar
done

dir="cc/spray/spray-json_2.9.2/1.1.1"
mkdir -p ~/.m2/repository/$dir
cd ~/.m2/repository/$dir
wget http://repo.spray.io/$dir/spray-json_2.9.2-1.1.1.jar
wget http://repo.spray.io/$dir/spray-json_2.9.2-1.1.1.pom

dir="cc/spray/twirl-api_2.9.2/0.5.2"
mkdir -p ~/.m2/repository/$dir
cd ~/.m2/repository/$dir
wget http://repo.spray.io/$dir/twirl-api_2.9.2-0.5.2.jar
wget http://repo.spray.io/$dir/twirl-api_2.9.2-0.5.2.pom

This really sucks, but it works for this error. I found a stackoverflow regarding a similar mvn issue – 1 poster claimed that downgrading to java 6 fixed it. It seems strange that it would be a java 7 issue, but I’ve encountered stranger things. I didn’t test downgrading.

For reference, below is my environment.

james@minerva:~/spark$ mvn -version
Apache Maven 3.0.4
Maven home: /usr/share/maven
Java version: 1.7.0_17, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-7-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.2.0-38-generic", arch: "amd64",
family: "unix"

Eclipse Setup

The Eclipse setup is pretty straight forward. But if you’ve never done a Java/Scala Eclipse setup it can take a couple hours to figure out what needs to happen.

From within Eclipse, install EGit and the Scala IDE plugin. Pay attention to the version of Eclipse and Scala. At the time of this writing Spark is based on Scala 2.9.2 and I was running Juno.

I never, ever use the m2eclipse plugin. Some people I know use it successfully, but not me. I use mvn to generate the .project and .classpath files. I don’t know anyone that mixes these approaches.

Below is the command that I used to generate the project files.

$ mvn -Phadoop2 eclipse:clean eclipse:eclipse

Next, import the projects from Git (at this time that includes spark-core, spark-bagel, spark-repl, spark-streaming and spark-examples). To do this, select File->import->Projects from git.

Next, we need to connect the Scala IDE plugin with each project that has Scala source files (spark-core, spark-bagel, spark-repl and spark-streaming). To do so right-click on the project and select Configure->Add Scala Nature. Below is a picture.

Next, we need to add the Scala source folders to the build path (each src/main/scala and src/test/scala folder). To accomplish this, right-click on the folder and select Add to Build Path->Use As Source Folder.

Spark mixes .java and .scala files in a non-standard way that can confuse the Scala IDE plugin, so we need to make sure that all the source folders include .scala files in the classpath. To check if this is the case, look at the .classpath. It should have an entry like the following for all the scala source folders.

 <classpathentry including="**/*.java|**/*.scala" kind="src"
  path="src/main/scala"/>

If the there is no **/*.scala in the classpathentry for any source folder with Scala code in it, then we need to add it. It can be added via Eclipse through the GUI, or we can edit the .classpath file directly.

Inclusion filters can be added from the Eclipse GUI by right-clicking on the source folder and selectiong Build Path->Configure Inclusion/Exclusion Filters and add **/*.scala.

Finally, add spark-core to the build path of spark-repl and spark-streaming. To do this, right-click on the project and Add to Build Path->Configure Build Path->Add projects(then select spark-core).