OS version:redhat 6.4
1、从apache官网下载hadoop2.3.0版本,并正确配置参数(此处不详述)
2、在github的apache/spark页面下载spark-0.9.1的源代码:http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.1/spark-0.9.1.tgz
需要注意的是截止目前官方提供的spark的版本是基于CDH5/hadoop 2.2.0编译的,spark-0.9.1在hadoop2.3.0上还存在点小问题:
spark启动时需要读取yarn-site.xml中的yarn.application.classpath,如果此参数没有显示配置,则默认的值是空,这时会抛出异常:
Exception in thread "main" java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.deploy.yarn.Client$.populateHadoopClasspath(Client.scala:498)
at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:519)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:333)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:94)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:78)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:125)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:200)
at shark.SharkContext.<init>(SharkContext.scala:42)
at shark.SharkContext.<init>(SharkContext.scala:61)
at shark.SharkEnv$.initWithSharkContext(SharkEnv.scala:78)
at shark.SharkEnv$.init(SharkEnv.scala:38)
at shark.SharkCliDriver.<init>(SharkCliDriver.scala:278)
at shark.SharkCliDriver$