Spark高频面试题总结

最新推荐文章于 2024-09-03 09:30:00 发布

原创

最新推荐文章于 2024-09-03 09:30:00 发布 · 6.6k 阅读

55 ·

CC 4.0 BY-SA版权

文章标签：

#Spark #面试题

1. Spark高频面试题总结

1.1 Spark有几种部署方式？请分别简要论述

Local:运行在一台机器上，通常是练手或者测试环境。
Standalone:构建一个基于Mster+Slaves的资源调度集群，Spark任务提交给Master运行。是Spark自身的一个调度系统。
Yarn: Spark客户端直接连接Yarn，不需要额外构建Spark集群。有yarn-client和yarn-cluster两种模式，主要区别在于：Driver程序的运行节点。
Mesos：国内大环境比较少用。

1.2 Spark任务使用什么进行提交，javaEE界面还是脚本

Shell 脚本。

1.3 Spark提交作业参数

在提交任务时的几个重要参数

- executor-cores —— 每个executor使用的内核数，默认为1，官方建议2-5个，我们企业是4个

- num-executors —— 启动executors的数量，默认为2

- executor-memory —— executor内存大小，默认1G

- driver-cores —— driver使用内核数，默认为1

- driver-memory —— driver内存大小，默认512M

2)   Spark on local,本地模式

  ./bin/spark-submit \
  --master local[5]  \
  --driver-cores 2   \
  --driver-memory 8g \
  --executor-cores 4 \
  --num-executors 10 \
  --executor-memory 8g \
  --class PackageName.ClassName XXXX.jar \
  --name "Spark Job Name" \
  InputPath \
  OutputPath


3) Spak Standalone模式

   ./bin/spark-submit  \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000
  
4）Spark on Yarn Cluster模式（生产环境常用此种模式）

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  100

5）Spark on Yarn Client 模式 

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  100