Spark1.2.1开发环境搭建(适合windows环境)
1.环境准备
下载scala并安装,最好下载imsi版直接双击安装
2.IDEA的安装
官网jetbrains.com下载IntelliJ IDEA,有Community Editions 和& Ultimate Editions,前者免费,用户可以选择合适的版本使用。
根据安装指导安装IDEA后,需要安装scala插件,有两种途径可以安装scala插件:
- 启动IDEA -> Welcome to IntelliJ IDEA -> Configure -> Plugins -> Install JetBrains plugin… -> 找到scala后安装。
- 启动IDEA -> Welcome to IntelliJ IDEA -> Open Project -> File -> Settings -> plugins -> Install JetBrains plugin… -> 找到scala后安装。
如果你想使用那种酷酷的黑底界面,在File -> Settings -> Appearance -> Theme选择Darcula,同时需要修改默认字体,不然菜单中的中文字体不能正常显示。
2 建立scala应用程序
A:建立新项目
- 创建名为sparkTest 的project:启动IDEA -> Welcome to IntelliJ IDEA -> Create New Project -> Scala -> Non-SBT -> 创建一个名为sparkWordCountTets的project(注意这里选择自己安装的JDK和scala编译器) -> Finish。
- 添加Maven支持:右击项目->Add Framework support….->选择maven,为后面利用maven自动编译成Jar包做准备
添加maven支持
然后配置pom.xml文件,添加spark 和hadoop依赖
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.2.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.0</version> </dependency><dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency><properties> <scala.version>2.11.6</scala.version> </properties>
注:这里使用的是spark1.2.1 和hadoop 2.6.0环境,请根据自己的情况配置。
如果你不选择使用maven则可以,手动添加依赖包
增加开发包:File -> Project Structure -> Libraries -> + -> java -> 选择自己所放置的下面包
- spark-assembly-1.1.0-hadoop2.6.0.jar
- scala-library.jar
项目目录结构如下:
import org.apache.spark._ import SparkContext._ object WordCount { def main(args: Array[String]) { if (args.length != 2 ){ println("usage is org.test.WordCount <master> <input> <output>") return } val conf = new SparkConf() conf.setMaster("spark://192.168.246.107:7077").setAppName("My WordCount") val sc = new SparkContext(conf) val textFile = sc.textFile(args(0)) val result = textFile.flatMap(line => line.split("\\s+")) .map(word => (word, 1)).reduceByKey(_ + _) result.saveAsTextFile(args(1)) sc.stop() } }
Setmaster:master的地址。(设置为local,表示本地运行,需要有spark,hadoop环境等等。如果设置远程的spark的master(spark://hadoop:7070)
SetappName:应用的名称。
SetsparkHome:spark的安装地址。
Setjars:jar包的位置。设置编译打包后的jar的位置
C:生成程序包
配置SparkTest打包
maven使用命令
双击package即打包
3.Spark应用程序部署
详细参数如下
[hadoop@bigdata bin]$ ./spark-submit --help
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Usage: spark-submit [options] <app jar | python file> [app options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
--supervise If given, restarts the driver on failure.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
YARN-only:
--executor-cores NUM Number of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
[hadoop@bigdata bin]$
此实例运行命令:
[hadoop@bigdata spark-1.2.1-bin-2.6.0]$ ./bin/spark-submit -class Test.WordCount /opt/app/spark-1.2.1-bin-2.6.0/test/Test-1.0-SNAPSHOT.jar /user/hadoop/test/input/test1.txt /user/hadoop/test/output00001
最后里面的两个参数是一个是输入文件,一个是输出目录
如--master spark://hadoop108:7077
--executor-memory 300m
可以在spark-env.sh里面配置
export JAVA_HOME=/opt/java/jdk1.7
export HADOOP_CONF_DIR=/opt/app/hadoop-2.6.0/etc/hadoop
export HIVE_CONF_DIR=/opt/app/hive-0.13.1/conf
export SCALA_HME=/opt/app/scala-2.10.5
export HADOOP_HOME=/opt/app/hadoop-2.6.0
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_MASTER_IP=192.168.246.107
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_MEMORY=1g
export SPARK_JAVA_OPTS=-Dspark.executor.memory=1g
export SPARK_HOME=/opt/app/spark-1.2.1-bin-2.6.0
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.2.1-hadoop2.6.0.jar
export PATH=$SPARK_HOME/bin:$PATH
export SPARK_CLASSPATH=$SPARK_CLASSPATH:
结束语:
val conf = new SparkConf().setAppName("Word Count").setMaster("spark://hadoop:7070").setJars(List("out\\sparkTest_jar\\sparkTest.jar"))
val spark = new SparkContext("spark://hadoop:7070", "Word Count
", "F:\\soft\\spark\\spark-1.2.1-bin-hadoop2.6", List("out\\sparkTest_jar\\sparkTest.jar
"))