Spark(2)SBT and Spark Shell in Quick Start

最新推荐文章于 2025-12-18 16:31:11 发布

原创最新推荐文章于 2025-12-18 16:31:11 发布 · 197 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#shell #scala #java

Distributed 专栏收录该内容

160 篇文章

订阅专栏

Spark(2)SBT and Spark Shell in Quick Start

1. Make SBT do the things like MVN INSTALL
When I run the command
>sudo sbt publish-local
SBT will publish all the jar and other files to for example:
/Users/carl/.ivy2/local/com.sillycat/easycassandraserver_2.10/1.0/jars/easycassandraserver_2.10-assembly.jar

If we plan to customized the path, we will do like this
publishTo := Some(Resolver.file("file", new File(Path.userHome.absolutePath+"/.m2/repository")))

But I will not, I will see if my dependency is working.
The name is the base project
name := "easycassandraserver"
organization := "com.sillycat"

version := "1.0"

The dependency project
"com.sillycat" %% "easycassandraserver" % "1.0"

2. Spark Shell in Quick Start
Start the shell by running command under the spark directory
>./spark-shell

Error Message:
java.lang.ClassNotFoundException: scala.tools.nsc.settings.MutableSettings$SettingValue

Solution:
My scala version is using 2.10.0, so I use soft link to change to 2.9.2. It will work fine.

scala>:help

2.1 Basics
RDDs can be created from Hadoop InputFormats(HDFS) or just transforming from other RDDs. Or other storage systems supported by Hadoop, including my local file system, Amazon S3, Hybertable, HBase, etc.

Some Simple Operations
scala>val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = MappedRDD[1] at textFile at <console>:12

sc is short for SparkContext, it will take URI of the file, it can be
hdfs://, s3n://, kfs://, etc URI

scala>textFile.count()
res0: Long = 68

scala> textFile.first()
res1: String = # Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = FilteredRDD[2] at filter at <console>:14

scala> linesWithSpark.count()
res2: Long = 12

2.2 More Operations on RDD
scala>textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res5: Int = 15
map(line => line.split(" ").size) This method can share back the words of one line

Reduce is one of Actions
syntax reduce(func)
This function will take 2 arguments and returns one.

scala>val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

flatMap make all the arrays in one Map from my understanding.
textFile.flatMap(line => line.split(" "))

Base on the collection, make it to one number
map(word => (word, 1))

reduceByKey is one of Transformation
That means it will transfer from RDD to another RDD.
syntax reduceByKey(func, [numTasks])
(K, V) ----> (K, V)

scala>wordCounts.collect()

collect is One of the Action
syntax collect()
Return all the elements of the dataset as an array.

count is One of the Action
syntax count() Return the number of elements in the dataset.

2.3 Caching
I do not quite understand the caching yet. Just note the syntax.
scala>linesWithSpark.cache()

3. Standalone Job in Scala
First of all, I am thinking the stable version of spark is 0.7.0. So I am trying to use scala version 2.9.2 in my projects.
The sbt file in base project is as follow:
import AssemblyKeys._
name := "easycassandraserver"
organization := "com.sillycat"
version := "1.0"
scalaVersion := "2.9.2"
scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8")
resolvers ++= Seq(
"sonatype releases" at "https://oss.sonatype.org/content/repositories/releases/", "sonatype snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/", "typesafe repo" at "http://repo.typesafe.com/typesafe/releases/", "spray repo" at "http://repo.spray.io/" ) libraryDependencies ++= Seq(
"ch.qos.logback" % "logback-classic" % "1.0.3" ) seq(Revolver.settings: _*)

The spark project with based on base project will be as follow:

import AssemblyKeys._
name := "easysparkserver"
organization := "com.sillycat"
version := "1.0"
scalaVersion := "2.9.2"
scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8")
resolvers ++= Seq(
"sonatype releases" at "https://oss.sonatype.org/content/repositories/releases/", "sonatype snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/", "typesafe repo" at "http://repo.typesafe.com/typesafe/releases/", "spray repo" at "http://repo.spray.io/", "Spray repo second" at "http://repo.spray.cc/", "Akka repo" at "http://repo.akka.io/releases/" ) libraryDependencies ++= Seq(
"com.sillycat" %% "easycassandraserver" % "1.0",
"org.spark-project" %% "spark-core" % "0.7.0" ) seq(Revolver.settings: _*) assemblySettings mainClass in assembly := Some("com.sillycat.easysparkserver.Boot") artifact in (Compile, assembly) ~= { art =>
art.copy(`classifier` = Some("assembly")) } excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
cp filter {_.data.getName == "parboiled-scala_2.10.0-RC1-1.1.3.jar"} } addArtifact(artifact in (Compile, assembly), assembly)
The implementation will be as follow as the example:
package com.sillycat.easysparkserver.temp

import spark.SparkContext
import spark.SparkContext._

object SimpleJob extends App {
vallogFile = "/var/log/apache2"// Should be some file on your system
valsc = new SparkContext("local",
"Simple Job",
"/opt/spark",
List("target/scala-2.9.2/easysparkserver_2.9.2-1.0.jar"),
Map())
vallogData = sc.textFile(logFile, 2).cache()
valnumAs = logData.filter(line => line.contains("a")).count()
valnumBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

}

When we want to run this test sample, we will do as follow:
>sudo sbt package
>sudo sbt run

4. Change the Version to Scala 2.10.0
Relink the scala from 2.9.2 to 2.10.0.
From the git source code of spark, find the branch with 2.10
>git status
# On branch scala-2.10

>sudo sbt/sbt clean
>sudo sbt/sbt compile
>sudo sbt/sbt pakcage
>sudo ./run spark.examples.SparkPi local

It works.

Publish this 2.10 version to local
>sudo sbt/sbt publish-local
[info] published spark-core_2.10 to /Users/carl/.m2/repository/org/spark-project/spark-core_2.10/0.7.0-SNAPSHOT/spark-core_2.10-0.7.0-SNAPSHOT.jar
[info] published spark-core_2.10 to /Users/carl/.ivy2/local/org.spark-project/spark-core_2.10/0.7.0-SNAPSHOT/jars/spark-core_2.10.jar

I change the version of build.sbt
scalaVersion := "2.10.0"
"org.spark-project" %% "spark-core" % "0.7.0-SNAPSHOT"

And change the implementation in the class as follow:
List("target/scala-2.10/easysparkserver_2.10-1.0.jar"),

But when I run with command
>sudo sbt run
I got these dead loop I think.

Error Message:
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - Checking for newly runnable parent stages
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - running: Set(Stage 0)
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - waiting: Set()
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - failed: Set()
17:16:48.167 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - Checking for newly runnable parent stages
17:16:48.168 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - running: Set(Stage 0)
17:16:48.168 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - waiting: Set()
17:16:48.168 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - failed: Set()

Solution:
I have no idea how to solve this problem right now.
I change the version of scala to my colleagues' 2.10.1. But it seem not working.

At last, I found that that is really a stupid mistake, I will change the implementation and add one line fixing that problem.
https://groups.google.com/forum/?fromgroups#!topic/spark-users/hNgvlwAKlAE

println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

sc.stop;

Just finish the SparkContext right after getting the result.

I will try to understand more examples based on our projects.

References:
sbt publish
http://www.scala-sbt.org/release/docs/Detailed-Topics/Publishing

spark shell
http://spark-project.org/docs/latest/quick-start.html

Spark Grammer and Examples
http://spark-project.org/examples/
https://github.com/mesos/spark/wiki/Spark-Programming-Guide
http://spark-project.org/docs/latest/java-programming-guide.html
http://spark-project.org/docs/latest/quick-start.html

http://rdc.taobao.com/team/jm/archives/1823