Example Self Contained Spark Application

For the past few weeks we’ve showed some simple examples on how to use Hive and Impala with different file formats along with partitioning. These approaches have exercised two very popular interfaces into Hadoop – namely Map/Reduce with Hive and Impala.

We’re now shifting gears to introduce a new up and comer – namely Spark. The goal of this post is to get you up and running with a very simple self contained Spark application. (“Self Contained” is an important concept in this posting as instead of simply pulling open a scala shell, we are showing you how to use a IDE locally and produce an artifact you can publish to a Spark cluster.)

Before you get started, make sure to go through the first two steps of loading some CSV data into Hadoop. Just the unzip, load, add to HDFS step and the load some initial data step will do. You technically don’t even need to do the second step, but the example code references the data from Hive so we just used the same source.

Oh, and while you could do this in Java or Python, we picked Scala as that is something a bit new to many folks so good to get some run time using a new language – if only to play.

Step 1) Install a Scala IDE. There are a bunch of them, but here is a very common one.http://scala-ide.org/

Step 2) Install a Scala build tool. Here is a good one. http://www.scala-sbt.org/

Step 3) Open the IDE and create a Scale Project and create a /src/main/scala folder structure. (See here for the recommended directory structure from Spark themselves.) Make the “scala” folder a Source Folder. Then add a Scala Class file called Temp.scala and along with one called TempsUtil.scala. Copy and paste these contents into them. Change the “<namenode>” value with the one that fits your environment. Your Hadoop administrator should know this.

Here is the Temps.Util.scala file


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext

object TempsUtil extends App {
    override def main(args: Array[String]) {
      val tempFile = "hdfs://:8020/user/hive/warehouse/temps_txt/hourly_TEMP_2014.csv"
      val conf = new SparkConf().setAppName("Get Average Temperatures")
      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)
      import sqlContext.createSchemaRDD
      val temps = sc.textFile(tempFile, 2).cache()
      val tempsrdd = temps.map (_.split(",")).map(t => Temp(
          t(0),
          t(1),
          t(2),
          t(3),
          t(4),
          t(5),
          t(6),
          t(7),
          t(8),
          t(9),
          t(10),
          t(11),
          t(12),
          t(13).toDouble,
          t(14),
          t(15),
          t(16),
          t(17),
          t(18),
          t(19),
          t(20),
          t(21)
      ))
      
      tempsrdd.registerTempTable("temps_txt")
      val avgTemps = sqlContext.sql("select avg(degrees) from temps_txt")
      avgTemps.map(t => "Average Temp is " + t(0)).collect().foreach(println)
  }
}

Here is the Temp.scala file

case class Temp(
  statecode:String,
  countrycode: String,
  sitenum: String,
  paramcode: String,
  poc: String,
  latitude: String,
  longitude: String,
  datum: String,
  param: String,
  datelocal: String,
  timelocal: String,
  dategmt: String,
  timegmt: String,
  degrees: Double,
  uom: String,
  mdl: String,
  uncert: String,
  qual: String,
  method: String,
  methodname: String,
  state: String,
  county: String
)

Step 4) In the IDE, create a file called temp.sbt with the following contents. This file is used by the Scala build tool you installed in step 2. Think of it like a Maven POM file. Just much simpler.

name := "Temperature Application"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.2.1"

Step 5) Convert the project to a Maven project (Right Click | Configure | Convert to Maven Project) and add the following two dependencies. If you’re not familiar with Maven, simply right click on the project and select Maven | Add Dependencies. Then fill out the Group ID, Artifact ID and Version information that you get from these two links.

http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.2.1

http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.2.1

Step 6) Now go out to the command line and get to the root directory of your project. E.g. C:\Users\Admin\workspace_scala\MyScalaProject. And then issue “sbt.bat package”. You might have to fully qualify the “sbt.bat” file. Under the “target” (might be one more deep than that) you should find your jar file. (Don’t laugh that we’re using Windows for this example. We’re doing that on purpose. Or at least that is the story we’re sticking with…)

Step 7) Move this jar file out to your Spark cluster and issue the following command to kick off the job.

spark-submit --master yarn --class TempsUtil temperature-application_2.10-1.0.jar

If you run into any issues, feel free to add a comment and we’ll try to resolve them for you. The goal is to make this as simple as possible so you can quickly get a perspective into what writing a standalone Spark job entails.

  

或者

spark-submit --master yarn-client --class TempsUtil temperature-application_2.10-1.0.jar

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值