Spark记录-官网学习配置篇(二)

本文详细介绍了Apache Spark中的各项配置参数,包括Spark SQL、Spark Streaming、GraphX等模块的配置选项,并涵盖了部署模式、集群管理器配置及环境变量设置等内容。

### Spark SQL Running the SET -v command will show the entire list of the SQL configuration.

#scala
// spark is an existing SparkSession spark.sql("SET -v").show(numRows = 200, truncate = false)
#java
// spark is an existing SparkSession
spark.sql("SET -v").show(200, false);
#python
# spark is an existing SparkSession
spark.sql("SET -v").show(n=200, truncate=False);
#R
sparkR.session()
properties <- sql("SET -v") showDF(properties, numRows = 200, truncate = FALSE)
### Spark Streaming
Property NameDefaultMeaning
spark.streaming.backpressure.enabledfalseEnables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRateand spark.streaming.kafka.maxRatePerPartition if they are set (see below).
spark.streaming.backpressure.initialRatenot setThis is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.
spark.streaming.blockInterval200msInterval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the performance tuningsection in the Spark Streaming programing guide for more details.
spark.streaming.receiver.maxRatenot setMaximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details.
spark.streaming.receiver.writeAheadLog.enablefalseEnable write ahead logs for receivers. All the input data received through receivers will be saved to write ahead logs that will allow it to be recovered after driver failures. See the deployment guide in the Spark Streaming programing guide for more details.
spark.streaming.unpersisttrueForce RDDs generated and persisted by Spark Streaming to be automatically unpersisted from Spark's memory. The raw input data received by Spark Streaming is also automatically cleared. Setting this to false will allow the raw data and persisted RDDs to be accessible outside the streaming application as they will not be cleared automatically. But it comes at the cost of higher memory usage in Spark.
spark.streaming.stopGracefullyOnShutdownfalseIf true, Spark shuts down the StreamingContext gracefully on JVM shutdown rather than immediately.
spark.streaming.kafka.maxRatePerPartitionnot setMaximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka direct stream API. See the Kafka Integration guide for more details.
spark.streaming.kafka.maxRetries1Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the new Kafka direct stream API.
spark.streaming.ui.retainedBatches1000How many batches the Spark Streaming UI and status APIs remember before garbage collecting.
spark.streaming.driver.writeAheadLog.closeFileAfterWritefalseWhether to close the file after writing a write ahead log record on the driver. Set this to 'true' when you want to use S3 (or any file system that does not support flushing) for the metadata WAL on the driver.
spark.streaming.receiver.writeAheadLog.closeFileAfterWritefalseWhether to close the file after writing a write ahead log record on the receivers. Set this to 'true' when you want to use S3 (or any file system that does not support flushing) for the data WAL on the receivers.
### SparkR
Property NameDefaultMeaning
spark.r.numRBackendThreads2Number of threads used by RBackend to handle RPC calls from SparkR package.
spark.r.commandRscriptExecutable for executing R scripts in cluster modes for both driver and workers.
spark.r.driver.commandspark.r.commandExecutable for executing R scripts in client modes for driver. Ignored in cluster modes.
spark.r.shell.commandRExecutable for executing sparkR shell in client modes for driver. Ignored in cluster modes. It is the same as environment variable SPARKR_DRIVER_R, but take precedence over it. spark.r.shell.command is used for sparkR shell while spark.r.driver.command is used for running R script.
spark.r.backendConnectionTimeout6000Connection timeout set by R process on its connection to RBackend in seconds.
spark.r.heartBeatInterval100Interval for heartbeats sent from SparkR backend to R process to prevent connection timeout.
### GraphX
Property NameDefaultMeaning
spark.graphx.pregel.checkpointInterval-1Checkpoint interval for graph and message in Pregel. It used to avoid stackOverflowError due to long lineage chains after lots of iterations. The checkpoint is disabled by default.
### Deploy
Property NameDefaultMeaning
spark.deploy.recoveryModeNONEThe recovery mode setting to recover submitted Spark jobs with cluster mode when it failed and relaunches. This is only applicable for cluster mode when running with Standalone or Mesos.
spark.deploy.zookeeper.urlNoneWhen `spark.deploy.recoveryMode` is set to ZOOKEEPER, this configuration is used to set the zookeeper URL to connect to.
spark.deploy.zookeeper.dirNoneWhen `spark.deploy.recoveryMode` is set to ZOOKEEPER, this configuration is used to set the zookeeper directory to store recovery state.
### Cluster Managers Each cluster manager in Spark has additional configuration options. Configurations can be found on the pages for each mode: #### [YARN](running-on-yarn.html#configuration) #### [Mesos](running-on-mesos.html#configuration) #### [Standalone Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables Certain Spark settings can be configured through environment variables, which are read from the `conf/spark-env.sh` script in the directory where Spark is installed (or `conf/spark-env.cmd` on Windows). In Standalone and Mesos modes, this file can give machine specific information such as hostnames. It is also sourced when running local Spark applications or submission scripts. Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can copy `conf/spark-env.sh.template` to create it. Make sure you make the copy executable. The following variables can be set in `spark-env.sh`:
Environment VariableMeaning
JAVA_HOMELocation where Java is installed (if it's not on your default PATH).
PYSPARK_PYTHONPython binary executable to use for PySpark in both driver and workers (default is python2.7 if available, otherwise python). Property spark.pyspark.python take precedence if it is set
PYSPARK_DRIVER_PYTHONPython binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python take precedence if it is set
SPARKR_DRIVER_RR binary executable to use for SparkR shell (default is R). Property spark.r.shell.command take precedence if it is set
SPARK_LOCAL_IPIP address of the machine to bind to.
SPARK_PUBLIC_DNSHostname your Spark program will advertise to other machines.
 除上述之外,还可以选择设置Spark [独立群集脚本](spark-standalone.html#cluster-launch-scripts),例如每台机器上使用的内核数量和最大内存。由于`spark-env.sh`是一个shell脚本,其中一些可以通过程序设置 - 例如,您可以通过查找特定网络接口的IP来计算`SPARK_LOCAL_IP`。注意:在`cluster`模式下在YARN上运行Spark时,需要使用`conf / spark-defaults.conf`文件中的`spark.yarn.appMasterEnv。[EnvironmentVariableName]`属性来设置环境变量。在`spark-env.sh`中设置的环境变量不会在`cluster`模式中反映在YARN Application Master进程中。有关更多信息,请参阅[与YARN相关的Spark属性](run-on-yarn.html#spark-properties)。#配置日志记录Spark使用[log4j](http://logging.apache.org/log4j/)进行日志记录。你可以通过在`conf`目录下添加`log4j.properties`文件来配置它。一种开始的方法是复制现有的`log4j.properties.template`。#覆盖配置目录要指定不同于默认“SPARK_HOME / conf”的配置目录,可以设置SPARK_CONF_DIR。Spark将使用该目录中的配置文件(spark-defaults.conf,spark-env.sh,log4j.properties等)。#继承Hadoop集群配置如果您计划使用Spark从HDFS进行读写,则需要在Spark类路径中包含两个Hadoop配置文件:*`hdfs-site.xml`,它提供HDFS客户端的默认行为。*`core-site.xml`,其中设置了默认的文件系统名称。这些配置文件的位置因Hadoop版本而异,但常见的位置在`/ etc / hadoop / conf`中。一些工具可以即时创建配置,但提供了一个下载它们的机制。要使这些文件对Spark可见,请将`$ SPARK_HOME / spark-env.sh`中的`HADOOP_CONF_DIR`设置为包含配置文件的位置。

转载于:https://www.cnblogs.com/xinfang520/p/7797428.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值