spark的windows开发环境搭建

本文介绍如何在Windows环境下使用IntelliJ IDEA搭建Spark1.2.1开发环境,包括安装Scala、配置IDEA环境、创建Scala项目、添加Spark依赖、部署Spark应用程序等步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark1.2.1开发环境搭建(适合windows环境)

更多 0

1.环境准备

下载scala并安装,最好下载imsi版直接双击安装

2.IDEA的安装

官网jetbrains.com下载IntelliJ IDEA,有Community Editions 和& Ultimate Editions,前者免费,用户可以选择合适的版本使用。
根据安装指导安装IDEA后,需要安装scala插件,有两种途径可以安装scala插件:

  • 启动IDEA -> Welcome to IntelliJ IDEA -> Configure -> Plugins -> Install JetBrains plugin… -> 找到scala后安装。
  • 启动IDEA -> Welcome to IntelliJ IDEA -> Open Project -> File -> Settings -> plugins -> Install JetBrains plugin… -> 找到scala后安装。

如果你想使用那种酷酷的黑底界面,在File -> Settings -> Appearance -> Theme选择Darcula,同时需要修改默认字体,不然菜单中的中文字体不能正常显示。



2 建立scala应用程序

A:建立新项目

  • 创建名为sparkTest 的project:启动IDEA -> Welcome to IntelliJ IDEA -> Create New Project -> Scala -> Non-SBT -> 创建一个名为sparkWordCountTets的project(注意这里选择自己安装的JDK和scala编译器) -> Finish。
  • 添加Maven支持:右击项目->Add Framework support….->选择maven,为后面利用maven自动编译成Jar做准备


添加maven支持

然后配置pom.xml文件,添加spark 和hadoop依赖

spark依赖与hadoop依赖

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.10</artifactId>
  <version>1.2.1</version>
</dependency>

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.6.0</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>2.6.0</version>
</dependency>
<dependency>
  <groupId>org.scala-lang</groupId>
  <artifactId>scala-library</artifactId>
  <version>${scala.version}</version>
</dependency>
<properties>
  <scala.version>2.11.6</scala.version>
</properties>

注:这里使用的是spark1.2.1 和hadoop 2.6.0环境,请根据自己的情况配置。
如果你不选择使用maven则可以,手动添加依赖包
增加开发包:File -> Project Structure -> Libraries -> + -> java ->  选择自己所放置的下面包

  • spark-assembly-1.1.0-hadoop2.6.0.jar
  • scala-library.jar

项目目录结构如下:



B. 开发spark程序
我们直接使用spark里自带的SparkPi程序来做测试
import org.apache.spark._
import SparkContext._
object WordCount {

  def main(args: Array[String]) {
    if (args.length != 2 ){
      println("usage is org.test.WordCount <master> <input> <output>")
      return
    }

    val conf = new SparkConf()
    conf.setMaster("spark://192.168.246.107:7077").setAppName("My WordCount")
    val sc = new SparkContext(conf)
    val textFile = sc.textFile(args(0))
    val result = textFile.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1)).reduceByKey(_ + _)
    result.saveAsTextFile(args(1))
    sc.stop()
  }
}
说明:

Setmaster:master的地址。(设置为local,表示本地运行,需要有spark,hadoop环境等等。如果设置远程的spark的master(spark://hadoop:7070)
SetappName:应用的名称。
SetsparkHome:spark的安装地址。
Setjars:jar包的位置。设置编译打包后的jar的位置

C:生成程序包

配置SparkTest打包

点击左下角的按钮出现如下界面

maven使用命令

双击package即打包


上图的 Building jar: D:\SparkWordCountTest\target\Test-1.0-SNAPSHOT.jar即是打包路径及生成的jar包

3.Spark应用程序部署

 打包好的spark程序上传到集群,然后使用spark-submit方式部署运行。
详细参数如下
[hadoop@bigdata bin]$ ./spark-submit --help
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Usage: spark-submit [options] <app jar | python file> [app options]
Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 512M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --help, -h                  Show this help message and exit
  --verbose, -v               Print additional debug output

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).
  --supervise                 If given, restarts the driver on failure.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 YARN-only:
  --executor-cores NUM        Number of cores per executor (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
[hadoop@bigdata bin]$

此实例运行命令:
[hadoop@bigdata spark-1.2.1-bin-2.6.0]$ ./bin/spark-submit -class Test.WordCount /opt/app/spark-1.2.1-bin-2.6.0/test/Test-1.0-SNAPSHOT.jar /user/hadoop/test/input/test1.txt /user/hadoop/test/output00001

最后里面的两个参数是一个是输入文件,一个是输出目录

如--master spark://hadoop108:7077

--executor-memory 300m

可以在spark-env.sh里面配置

export JAVA_HOME=/opt/java/jdk1.7
export HADOOP_CONF_DIR=/opt/app/hadoop-2.6.0/etc/hadoop
export HIVE_CONF_DIR=/opt/app/hive-0.13.1/conf
export SCALA_HME=/opt/app/scala-2.10.5
export HADOOP_HOME=/opt/app/hadoop-2.6.0
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_MASTER_IP=192.168.246.107
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_MEMORY=1g
export SPARK_JAVA_OPTS=-Dspark.executor.memory=1g
export SPARK_HOME=/opt/app/spark-1.2.1-bin-2.6.0
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.2.1-hadoop2.6.0.jar
export PATH=$SPARK_HOME/bin:$PATH
export SPARK_CLASSPATH=$SPARK_CLASSPATH:


结束语:

本文的测试环境是在window上使用intellij开发的spark程序,IDE里运行提交到单独的spark集群上的开发模式。有对window下的开发环境的要求的可以选择这种方式。(window上没有spark和hadoop环境。当然在安装有spark和hadoop环境下开发是再好不过)
只需要设置:
val conf = new SparkConf().setAppName("Word Count").setMaster("spark://hadoop:7070").setJars(List("out\\sparkTest_jar\\sparkTest.jar"))
设置master为你的spark集群master;
设置Jar的位置为你编译打包的位置
或者:
val spark = new SparkContext("spark://hadoop:7070", "Word Count", "F:\\soft\\spark\\spark-1.2.1-bin-hadoop2.6", List("out\\sparkTest_jar\\sparkTest.jar"))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值