Spark(2)SBT and Spark Shell in Quick Start

Spark(2)SBT and Spark Shell in Quick Start

1. Make SBT do the things like MVN INSTALL
When I run the command
>sudo sbt publish-local
SBT will publish all the jar and other files to for example:
/Users/carl/.ivy2/local/com.sillycat/easycassandraserver_2.10/1.0/jars/easycassandraserver_2.10-assembly.jar

If we plan to customized the path, we will do like this
publishTo := Some(Resolver.file("file", new File(Path.userHome.absolutePath+"/.m2/repository")))

But I will not, I will see if my dependency is working.
The name is the base project
name := "easycassandraserver"
organization := "com.sillycat"

version := "1.0"

The dependency project
"com.sillycat" %% "easycassandraserver" % "1.0"

2. Spark Shell in Quick Start
Start the shell by running command under the spark directory
>./spark-shell

Error Message:
java.lang.ClassNotFoundException: scala.tools.nsc.settings.MutableSettings$SettingValue

Solution:
My scala version is using 2.10.0, so I use soft link to change to 2.9.2. It will work fine.

scala>:help

2.1 Basics
RDDs can be created from Hadoop InputFormats(HDFS) or just transforming from other RDDs. Or other storage systems supported by Hadoop, including my local file system, Amazon S3, Hybertable, HBase, etc.

Some Simple Operations
scala>val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = MappedRDD[1] at textFile at <console>:12

sc is short for SparkContext, it will take URI of the file, it can be
hdfs://, s3n://, kfs://, etc URI

scala>textFile.count()
res0: Long = 68

scala> textFile.first()
res1: String = # Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = FilteredRDD[2] at filter at <console>:14

scala> linesWithSpark.count()
res2: Long = 12

2.2 More Operations on RDD
scala>textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res5: Int = 15
map(line => line.split(" ").size) This method can share back the words of one line

Reduce is one of Actions
syntax reduce(func)
This function will take 2 arguments and returns one.

scala>val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

flatMap make all the arrays in one Map from my understanding.
textFile.flatMap(line => line.split(" "))

Base on the collection, make it to one number
map(word => (word, 1))

reduceByKey is one of Transformation
That means it will transfer from RDD to another RDD.
syntax reduceByKey(func, [numTasks])
(K, V) ----> (K, V)

scala>wordCounts.collect()

collect is One of the Action
syntax collect()
Return all the elements of the dataset as an array.

count is One of the Action
syntax count() Return the number of elements in the dataset.

2.3 Caching
I do not quite understand the caching yet. Just note the syntax.
scala>linesWithSpark.cache()

3. Standalone Job in Scala
First of all, I am thinking the stable version of spark is 0.7.0. So I am trying to use scala version 2.9.2 in my projects.
The sbt file in base project is as follow:
import AssemblyKeys._
name := "easycassandraserver"
organization := "com.sillycat"
version := "1.0"
scalaVersion := "2.9.2"
scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8")
resolvers ++= Seq(
"sonatype releases" at "https://oss.sonatype.org/content/repositories/releases/", "sonatype snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/", "typesafe repo" at "http://repo.typesafe.com/typesafe/releases/", "spray repo" at "http://repo.spray.io/" ) libraryDependencies ++= Seq(
"ch.qos.logback" % "logback-classic" % "1.0.3" ) seq(Revolver.settings: _*)


The spark project with based on base project will be as follow:


import AssemblyKeys._
name := "easysparkserver"
organization := "com.sillycat"
version := "1.0"
scalaVersion := "2.9.2"
scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8")
resolvers ++= Seq(
"sonatype releases" at "https://oss.sonatype.org/content/repositories/releases/", "sonatype snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/", "typesafe repo" at "http://repo.typesafe.com/typesafe/releases/", "spray repo" at "http://repo.spray.io/", "Spray repo second" at "http://repo.spray.cc/", "Akka repo" at "http://repo.akka.io/releases/" ) libraryDependencies ++= Seq(
"com.sillycat" %% "easycassandraserver" % "1.0",
"org.spark-project" %% "spark-core" % "0.7.0" ) seq(Revolver.settings: _*) assemblySettings mainClass in assembly := Some("com.sillycat.easysparkserver.Boot") artifact in (Compile, assembly) ~= { art =>
art.copy(`classifier` = Some("assembly")) } excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
cp filter {_.data.getName == "parboiled-scala_2.10.0-RC1-1.1.3.jar"} } addArtifact(artifact in (Compile, assembly), assembly)
The implementation will be as follow as the example:
package com.sillycat.easysparkserver.temp

import spark.SparkContext
import spark.SparkContext._

object SimpleJob extends App {
vallogFile = "/var/log/apache2"// Should be some file on your system
valsc = new SparkContext("local",
"Simple Job",
"/opt/spark",
List("target/scala-2.9.2/easysparkserver_2.9.2-1.0.jar"),
Map())
vallogData = sc.textFile(logFile, 2).cache()
valnumAs = logData.filter(line => line.contains("a")).count()
valnumBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

}


When we want to run this test sample, we will do as follow:
>sudo sbt package
>sudo sbt run

4. Change the Version to Scala 2.10.0
Relink the scala from 2.9.2 to 2.10.0.
From the git source code of spark, find the branch with 2.10
>git status
# On branch scala-2.10

>sudo sbt/sbt clean
>sudo sbt/sbt compile
>sudo sbt/sbt pakcage
>sudo ./run spark.examples.SparkPi local

It works.

Publish this 2.10 version to local
>sudo sbt/sbt publish-local
[info] published spark-core_2.10 to /Users/carl/.m2/repository/org/spark-project/spark-core_2.10/0.7.0-SNAPSHOT/spark-core_2.10-0.7.0-SNAPSHOT.jar
[info] published spark-core_2.10 to /Users/carl/.ivy2/local/org.spark-project/spark-core_2.10/0.7.0-SNAPSHOT/jars/spark-core_2.10.jar

I change the version of build.sbt
scalaVersion := "2.10.0"
"org.spark-project" %% "spark-core" % "0.7.0-SNAPSHOT"

And change the implementation in the class as follow:
List("target/scala-2.10/easysparkserver_2.10-1.0.jar"),

But when I run with command
>sudo sbt run
I got these dead loop I think.

Error Message:
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - Checking for newly runnable parent stages
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - running: Set(Stage 0)
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - waiting: Set()
17:16:48.156 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - failed: Set()
17:16:48.167 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - Checking for newly runnable parent stages
17:16:48.168 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - running: Set(Stage 0)
17:16:48.168 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - waiting: Set()
17:16:48.168 [DAGScheduler] DEBUG spark.scheduler.DAGScheduler - failed: Set()

Solution:
I have no idea how to solve this problem right now.
I change the version of scala to my colleagues' 2.10.1. But it seem not working.

At last, I found that that is really a stupid mistake, I will change the implementation and add one line fixing that problem.
https://groups.google.com/forum/?fromgroups#!topic/spark-users/hNgvlwAKlAE

println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

sc.stop;

Just finish the SparkContext right after getting the result.

I will try to understand more examples based on our projects.

References:
sbt publish
http://www.scala-sbt.org/release/docs/Detailed-Topics/Publishing

spark shell
http://spark-project.org/docs/latest/quick-start.html

Spark Grammer and Examples
http://spark-project.org/examples/
https://github.com/mesos/spark/wiki/Spark-Programming-Guide
http://spark-project.org/docs/latest/java-programming-guide.html
http://spark-project.org/docs/latest/quick-start.html

http://rdc.taobao.com/team/jm/archives/1823
Spark中,sbtScala Build Tool)是一种常用的构建工具,可用于编写和管理Spark应用程序。以下是一些Spark中使用sbt的相关内容: ### 编写Spark独立应用程序 可使用sbt编写Spark独立应用程序,并通过`spark-submit`命令提交运行。示例命令如下: ```bash /usr/local/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --class "SimpleApp" ~/sparkapp/target/scala-2.11/simple-project_2.11-1.0.jar 2>&1 | grep "Lines with a:" ``` 此命令将提交名为`SimpleApp`的Spark应用程序到集群运行,并对输出进行过滤,查找包含"Lines with a:"的行 [^1]。 ### 用IDEA开发Spark项目的sbt配置 在使用IDEA开发基于Spark 3.4.1和Scala 2.12.18的Spark项目时,可采用sbt方式进行配置。以下是一个简单的示例代码: ```scala package Test import org.apache.spark.{SparkConf, SparkContext} object RDD { def main(args:Array[String]): Unit ={ // 创建SparkContext对象 val conf = new SparkConf().setAppName("RDDtest1").setMaster("local[2]") val sc = new SparkContext(conf) // Spark处理逻辑代码 val datas = Array(8,10,3,9,5) val distData = sc.parallelize(datas) val i: Int = distData.reduce((a, b) => a + b) // 关闭SparkContext对象 println(i) sc.stop() } } ``` 该代码展示了如何使用sbt配置开发一个简单的Spark项目,创建`SparkContext`对象,进行数据处理,并最终关闭`SparkContext` [^2]。 ### sbt编译时的依赖配置 在使用sbt编译Spark项目时,可能会遇到依赖问题。例如,在Spark 0.91中使用sbt编译时,可对依赖进行配置。示例如下: ```scala object SparkBuild extends Build { // Hadoop version to build against. For example, "1.0.4" for Apache releases, or // "2.0.0-mr1-cdh4.2.0" for Cloudera Hadoop. Note that these variables can be set // through the environment variables SPARK_HADOOP_VERSION and SPARK_YARN. val DEFAULT_HADOOP_VERSION = "0.20.2-cdh3u5" } ``` 这里设置了默认的Hadoop版本,可根据具体情况调整依赖配置 [^3]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值