【Spark】No.3 Spark shell

本文介绍SparkShell的基本使用方法,包括启动配置、本地与集群模式的区别,以及通过一个经典的单词计数案例展示如何编写与运行Spark程序。

一 Spark shell 

1 spark shell 简介

Spark shell 的原理是把每一行 Scala 代码编译成类, 最终交由 Spark 执行

2 启动 spark shell (概念)

进入 Spark 安装目录后执行 spark-shell --master master 就可以提交Spark 任务

3 Master 的地址可以有如下几种设置方式

3.1 local[N]

使用 N 条 Worker 线程在本地运行

3.2 spark://host:port

在 Spark standalone 中运行,指定 Spark 集群的 Master 地址,端口默认为 7077

3.3 mesos://host:port

在 Apache Mesos 中运行, 指定 Mesos 的地址

3.4 yarn

在 yarn 中运行,yarn 的地址由环境变量 HADOOP_CONF_DIR 来指定

二 经典入门案例(单词统计)

1 在 /usr/local/apps 目录下创建一个 worcount.txt 文件 输入一些字符串

2 然后进入 spark master (mini-01)节点的 bin 目录下 (看清楚不是启动 spark 集群的 sbin目录)

cd spark/bin/

 3 启动 spark shell

spark-shell --master local[6]

出现如下图 spark-shell 启动成功

解释:

图中圈起来的 sc 是spark-shell 主动给我们创建的 sparkContext 如果在IDEA 中写代码我们需要自己创建 sc 

spark-shell --master local[6]   这里的6 是指定了6条线程来运行我们的spark程序

4 运行第一个Spark 程序 

上图最后的 Array 已经将结果收集起来 我们看到对之前输入到 wordCount.txt 中的字符串进行了数量统计 

代码解释

1 读取 wordCount.txt文件中的内容 (类似于Java的IO流)

var rdd1 = sc.textFile("file:///usr/local/apps/wordCount.txt")

2 对读取内容根据空格进行切分将其展平  flatMap是 spark rdd中的算子

val rdd2 = rdd1.flatMap(item=>item.split(" "))

3 这里的item最终成为一个元组即键值对类型  map是spark rdd中的算子

val rdd3 = rdd2.map(item=>(item,1))

4 reduceByKey 是 针对 KV 型数据来进行计算  reduceByKey是spark rdd中的算子

val rdd4 = rdd3.reduceByKey((curr,agg)=>curr+agg)

5 收集结果 collect是spark rdd中的算子

rdd4.collect()

 以上 感谢!

 

For more detailed output, check application tracking page:http://CX03:8088/cluster/app/application_1753839313749_0004Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1753839313749_0004_02_000001 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Container exited with a non-zero exit code 13 Failing this attempt. Failing the application. ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1753843543491 final status: FAILED tracking URL: http://CX03:8088/cluster/app/application_1753839313749_0004 user: root 25/07/30 10:46:03 ERROR yarn.Client: Application diagnostics message: Application application_1753839313749_0004 failed 2 times due to AM Container for appattempt_1753839313749_0004_000002 exited with exitCode: 13 For more detailed output, check application tracking page:http://CX03:8088/cluster/app/application_1753839313749_0004Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1753839313749_0004_02_000001 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Container exited with a non-zero exit code 13 Failing this attempt. Failing the application. Exception in thread "main" org.apache.spark.SparkException: Application application_1753839313749_0004 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1192) at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1583) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 25/07/30 10:46:03 INFO util.ShutdownHookManager: Shutdown hook called 25/07/30 10:46:03 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-afa0d880-c10c-4ecc-b203-a3c7286b9085 25/07/30 10:46:03 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fc09bc63-d4ae-4a0f-9246-377378cc5a55
最新发布
07-31
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值