SparkJob-Server简介
日期 |
版本 |
修订 |
审批 |
修订说明 |
2017.2.27 |
1.0 |
章鑫 |
|
初始版本 |
|
|
|
|
|
1 简介
Spark-jobserver提供了一个RESTful接口来提交和管理spark的jobs、jars和jobcontext。这个项目包含了完整的Spark job sever的项目,包括单元测试盒项目部署脚本。
2 版本特性
2.1 特性
l “Spark as Service”:针对job和contexts的各个方面提供了REST风格的api接口进行管理
l 支持SparkSQL、Hive、StreamingContext/jobs以及定制job contexts!
l 通过集成Apache Shiro来支持LDAP权限验证
l 通过长期运行的job contexts支持亚秒级别低延迟的任务
l 可以通过结束context来停止运行的作业(job)
l 分割jar上传步骤以提高job的启动
l 异步和同步的job API,其中同步API对低延时作业非常有效
l 支持Standalone Spark和Mesos、yarn
l Job和jar信息通过一个可插拔的DAO接口来持久化
l 对RDD或DataFrame对象命名并缓存,通过该名称获取RDD或DataFrame。这样可以提高对象在作业间的共享和重用
l 支持Scala 2.10版本和2.11版本
2.2 版本
可以查看notes/目录获取更多有关于版本的信息。
3 部署使用
根据上文中提到的版本对应关系,下载压缩包(spark-jobserver-0.6.2.tar.gz)并解压。注意需要有sbt编译环境!
目录结构如下:
[root@node8 spark-jobserver-0.6.2]# ll total 72 drwxrwxr-x 5 root root 56 Dec 19 14:00 akka-app drwxrwxr-x 3 root root 4096 Feb 23 15:58 bin -rw-rw-r-- 1 root root 27 Apr 28 2016 build.sbt lrwxrwxrwx 1 root root 17 Apr 28 2016 config -> job-server/config drwxrwxr-x 3 root root 4096 Apr 28 2016 doc drwxrwxr-x 6 root root 53 Dec 19 14:00 job-server drwxrwxr-x 4 root root 29 Dec 19 14:00 job-server-api drwxrwxr-x 5 root root 40 Dec 19 16:47 job-server-extras drwxrwxr-x 4 root root 29 Dec 19 14:14 job-server-tests -rw-rw-r-- 1 root root 579 Apr 28 2016 LICENSE.md drwxrwxr-x 2 root root 4096 Apr 28 2016 notes drwxrwxr-x 4 root root 4096 Dec 20 09:44 project -rw-rw-r-- 1 root root 35345 Apr 28 2016 README.md -rw-rw-r-- 1 root root 5939 Apr 28 2016 scalastyle-config.xml drwxrwxr-x 3 root root 17 Apr 28 2016 src drwxr-xr-x 5 root root 88 Dec 19 14:15 target -rw-rw-r-- 1 root root 31 Apr 28 2016 version.sbt |
最简单的启动模式就是尝试使用Dockercontainer安装包,该安装可以方便的完成分布式Spark和jobserver启动和部署。部署所需的脚本均在bin目录下。
另外的话:
l 利用SBT在本地Local模式下完成jobserver的构建和运行。提示:这种方式下不支持YARN模式,而且建议spark.master被设置成local[*]。最好尝试用YARN或者其他真实的集群来部署。
l 在集群上部署job server,这里有两种选择:
n server_deploy.sh将job server部署在一台远程设备的目录上
n server_packager.sh将job server部署在本地(local)目录上,你可以以目录的形式部署;在YARN或Mesos模式下也可以采用.tar.gz的形式部署。
l EC2部署脚本-参照EC2的说明搭建起一个Spark和job server以及一个给定示例的集群。
l EMR部署说明-参照EMR相关说明。
提示:当fork出来的JVM进程中的spark.jobserver.context-per-jvm配置项被设置成true时,Spark Job Server可以选择是否由它们自身来运行SparkContext。这一特性暂时不支持在SBT/local模式中使用。更多相关信息请参考下一小节。
1. Copyconfig/local.sh.template to <environment>.sh 并正确编辑内容。注意:若需要针对不同版本进行编译务必设置好正确的SPARK_VERSION,下面是一个例子
# Environment and deploy file # For use with bin/server_deploy, bin/server_package etc. DEPLOY_HOSTS="node6.dcom node7.dcom node8.dcom"
APP_USER=root APP_GROUP=root # optional SSH Key to login to deploy server #SSH_KEY=/root/.ssh/id_rsa.pub INSTALL_DIR=/home/spark/job-server LOG_DIR=/var/log/job-server PIDFILE=spark-jobserver.pid JOBSERVER_MEMORY=1G SPARK_VERSION=1.6.1 MAX_DIRECT_MEMORY=512M SPARK_HOME=/usr/hdp/current/spark-client SPARK_CONF_DIR=$SPARK_HOME/conf # Only needed for Mesos deploys #SPARK_EXECUTOR_URI=/home/spark/spark-1.6.1.tar.gz # Only needed for YARN running outside of the cluster # You will need to COPY these files from your cluster to the remote machine # Normally these are kept on the cluster in /etc/hadoop/conf YARN_CONF_DIR=/usr/hdp/current/hadoop-client/conf HADOOP_CONF_DIR=/usr/hdp/current/hadoop-client/conf # # Also optional: extra JVM args for spark-submit # export SPARK_SUBMIT_OPTS+="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5433" SCALA_VERSION=2.10.4 # or 2.11.6 |
2. Copyconfig/shiro.ini.template to shiro.ini 并正确编辑内容。注意:只在authentication = on时需要
3. Copyconfig/local.conf.template to <environment>.conf并正确编辑内容,下面是一个例子
# Template for a Spark Job Server configuration file # When deployed these settings are loaded when job server starts # # Spark Cluster / Job Server configuration spark { # spark.master will be passed to each job's JobContext #master = "local[4]" # master = "mesos://vm28-hulk-pub:5050" master = "yarn-client"
# Default # of CPUs for jobs to use for Spark standalone cluster job-number-cpus = 4
jobserver { port = 8090 jar-store-rootdir = /tmp/jobserver/jars
context-per-jvm = false
jobdao = spark.jobserver.io.JobFileDAO
job-binary-paths { WL = /home/spark-jobserver/VehicleFootHold-assembly-1.0.jar }
filedao { rootdir = /tmp/spark-job-server/filedao/data }
# When using chunked transfer encoding with scala Stream job results, this is the size of each chunk result-chunk-size = 1m }
# predefined Spark contexts # contexts { # my-low-latency-context { # num-cpu-cores = 1 # Number of cores to allocate. Required. # memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, 1G, etc. # } # # define additional contexts here # }
# universal context configuration. These settings can be overridden, see README.md context-settings { num-cpu-cores = 2 # Number of cores to allocate. Required. memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, #1G, etc.
# in case spark distribution should be accessed from HDFS (as opposed to being installed on every mesos slave) # spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
# uris of jars to be loaded into the classpath for this context. Uris is a string list, or a string separated by commas ',' # dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
# If you wish to pass any settings directly to the sparkConf as-is, add them here in passthrough, # such as hadoop connection settings that don't use the "spark." prefix passthrough { #es.nodes = "192.1.1.1" } }
# This needs to match SPARK_HOME for cluster SparkContexts to be created successfully home = "/home/spark/spark" }
# Note that you can use this file to define settings not only for job server, # but for your Spark jobs as well. Spark job configuration merges with this configuration file as defaults.
akka { remote.netty.tcp { # This controls the maximum message size, including job results, that can be sent # maximum-frame-size = 10 MiB } } |
4. bin/server_deploy.sh <environment> -- 这将会将job server与配置文件一起打包发送至你在<environment>.sh中配置的远程服务器上
5. 在远程服务器上,在部署目录下使用server_start.sh和server_stop.sh来启动和停止job server
server_start.sh脚本调用了spark-submit方法,用户可以输入spark-submit所支持的额外
参数。
注意:默认情况下依赖于job-server-extras打包出来的jar包会包含SQLContext和HiveContext,在使用过程中会面临着所有extra依赖带来的所有问题,请考虑修改install脚本用sbt job-server/assembly来代替,这种方式不会包含extra依赖。
3.1 开发模式
下面给的这个walk-through并不是job server在实际环境中的用法,我们通过这个利用SBT在local模式下运行job server的例子来帮助你更好的理解使用job server。
系统中必须已经正确安装SBT。
利用下方的语句来获得当前的SBT版本:
export VER=`sbt version | tail -1 | cut –f2`
在SBT shell模式下,输入“reStart”,将会采用默认的配置文件。指定配置文件的具体路径可作为可选参数。你也可以在”---“之后设置JVM参数。包括所有参数的命令如下所示:
Job-server/restart /path/to/my.conf --- -Xmx8g
需要注意reStart(SBT Revolver)会为job server另起一个单独的进程。如果你的代码有所变更,直接在SBT shell模式下再次输入reStart,系统将会重新编译代码并重启job server。这使得代码的更新编译变得方便快捷。
注意:不能在操作系统的shell上输入sbt reStart。SBT会启动job server并立即kill掉它。相关的例子在job-server-tests/project/目录中可以找到。
当你使用reStart命令时,日志文件保存在job-server/job-server-local.log中。我们也可以通过设置环境变量EXTRA_JAR往classpath中添加jar包。
3.1.1 WordCountExample
(1)发送jar包至服务器
第一步,使用sbt job-server-tests/package命令将包含WordCountExample的代码编译打包成jar包。
利用下面的命令上传jar包:
cur |