Example Self Contained Spark Application_spark self-contained reproducer-优快云博客

本文通过逐步指导的方式介绍如何在本地环境中安装并配置Scala IDE及构建工具，创建一个简单的Spark应用来计算Hadoop中CSV文件存储的平均温度数据。

For the past few weeks we’ve showed some simple examples on how to use Hive and Impala with different file formats along with partitioning. These approaches have exercised two very popular interfaces into Hadoop – namely Map/Reduce with Hive and Impala.

We’re now shifting gears to introduce a new up and comer – namely Spark. The goal of this post is to get you up and running with a very simple self contained Spark application. (“Self Contained” is an important concept in this posting as instead of simply pulling open a scala shell, we are showing you how to use a IDE locally and produce an artifact you can publish to a Spark cluster.)

Before you get started, make sure to go through the first two steps of loading some CSV data into Hadoop. Just the unzip, load, add to HDFS step and the load some initial data step will do. You technically don’t even need to do the second step, but the example code references the data from Hive so we just used the same source.

Oh, and while you could do this in Java or Python, we picked Scala as that is something a bit new to many folks so good to get some run time using a new language – if only to play.

Step 1) Install a Scala IDE. There are a bunch of them, but here is a very common one.http://scala-ide.org/

Step 2) Install a Scala build tool. Here is a good one. http://www.scala-sbt.org/

Step 3) Open the IDE and create a Scale Project and create a /src/main/scala folder structure. (See here for the recommended directory structure from Spark themselves.) Make the “scala” folder a Source Folder. Then add a Scala Class file called Temp.scala and along with one called TempsUtil.scala. Copy and paste these contents into them. Change the “<namenode>” value with the one that fits your environment. Your Hadoop administrator should know this.

Here is the Temps.Util.scala file


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext

object TempsUtil extends App {
    override def main(args: Array[String]) {
      val tempFile = "hdfs://:8020/user/hive/warehouse/temps_txt/hourly_TEMP_2014.csv"
      val conf = new SparkConf().setAppName("Get Average Temperatures")
      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)
      import sqlContext.createSchemaRDD
      val temps = sc.textFile(tempFile, 2).cache()
      val tempsrdd = temps.map (_.split(",")).map(t => Temp(
          t(0),
          t(1),
          t(2),
          t(3),
          t(4),
          t(5),
          t(6),
          t(7),
          t(8),
          t(9),
          t(10),
          t(11),
          t(12),
          t(13).toDouble,
          t(14),
          t(15),
          t(16),
          t(17),
          t(18),
          t(19),
          t(20),
          t(21)
      ))
      
      tempsrdd.registerTempTable("temps_txt")
      val avgTemps = sqlContext.sql("select avg(degrees) from temps_txt")
      avgTemps.map(t => "Average Temp is " + t(0)).collect().foreach(println)
  }
}

Here is the Temp.scala file

case class Temp(
  statecode:String,
  countrycode: String,
  sitenum: String,
  paramcode: String,
  poc: String,
  latitude: String,
  longitude: String,
  datum: String,
  param: String,
  datelocal: String,
  timelocal: String,
  dategmt: String,
  timegmt: String,
  degrees: Double,
  uom: String,
  mdl: String,
  uncert: String,
  qual: String,
  method: String,
  methodname: String,
  state: String,
  county: String
)

Step 4) In the IDE, create a file called temp.sbt with the following contents. This file is used by the Scala build tool you installed in step 2. Think of it like a Maven POM file. Just much simpler.

name := "Temperature Application"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.2.1"

Step 5) Convert the project to a Maven project (Right Click | Configure | Convert to Maven Project) and add the following two dependencies. If you’re not familiar with Maven, simply right click on the project and select Maven | Add Dependencies. Then fill out the Group ID, Artifact ID and Version information that you get from these two links.

http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.2.1

http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.2.1

Step 6) Now go out to the command line and get to the root directory of your project. E.g. C:\Users\Admin\workspace_scala\MyScalaProject. And then issue “sbt.bat package”. You might have to fully qualify the “sbt.bat” file. Under the “target” (might be one more deep than that) you should find your jar file. (Don’t laugh that we’re using Windows for this example. We’re doing that on purpose. Or at least that is the story we’re sticking with…)

Step 7) Move this jar file out to your Spark cluster and issue the following command to kick off the job.

spark-submit --master yarn --class TempsUtil temperature-application_2.10-1.0.jar

If you run into any issues, feel free to add a comment and we’ll try to resolve them for you. The goal is to make this as simple as possible so you can quickly get a perspective into what writing a standalone Spark job entails.

或者

spark-submit --master yarn-client --class TempsUtil temperature-application_2.10-1.0.jar