Hive on Spark: Getting Started

本文提供了一步一步的指导,详细介绍了如何安装并配置 Spark 和 Hive,使其在 YARN 模式下协同工作。包括从安装 Spark 到配置 Yarn、Hive,再到解决常见问题等全过程。
Skip to end of metadata
Go to start of metadata

Hive on Spark is now part of Hive since Hive 1.1 release. It is still under active development in "spark" branch, and is periodically merged into the "master" branch for Hive.  See HIVE-7292 and its subtasks and linked issues.

Spark Installation

Follow instructions to install Spark:  http://spark.apache.org/docs/latest/running-on-yarn.html (or https://spark.apache.org/docs/latest/spark-standalone.html, if you are running Spark Standalone mode). Hive on Spark supports Spark on Yarn mode as default. In particular, for the installation you'll need to:

  1. Install Spark (either download pre-built Spark, or build assembly from source).  
    • Install/build a compatible version.  Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with. 
    • Install/build a compatible distribution.  Each version of Spark has several distributions, corresponding with different versions of Hadoop.
    • Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
    • Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. To remove Hive jars from the installation, simply use the following command under your Spark repository:

      . /make-distribution .sh --name  "hadoop2-without-hive"  --tgz  "-Pyarn,hadoop-provided,hadoop-2.4"
  2. Start Spark cluster (both standalone and Spark on YARN are supported).
    • Keep note of the <Spark Master URL>.  This can be found in Spark master WebUI.

Configuring Yarn

yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Configuring Hive

  1. There are several ways to add the Spark dependency to Hive:
    1. Set the property 'spark.home' to point to the Spark installation:

      hive> set spark.home=/location/to/sparkHome;
    2. Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

      export  SPARK_HOME= /usr/lib/spark ....
    3. Link the spark-assembly jar to HIVE_HOME/lib.

  2. Configure Hive execution to Spark:

    hive> set hive.execution.engine=spark;

    See the Spark section of Hive Configuration Properties for other properties for configuring Hive and the Remote Spark Driver.

     

  3. Configure Spark-application configs for Hive.  See: http://spark.apache.org/docs/latest/configuration.html.  This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml). For instance:

    hive> set spark.master=<Spark Master URL>
     
    hive> set spark.eventLog.enabled= true ;
     
    hive> set spark.eventLog.dir=<Spark event log folder (must exist)>
     
    hive> set spark.executor.memory=512m;             
     
    hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

    A little explanation for some of the configuration properties:

    • spark.executor.memoryAmount of memory to use per executor process.
    • spark.executor.cores: Number of cores per executor.
    • spark.yarn.executor.memoryOverhead: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.

    • spark.executor.instances: The number of executors assigned to each application.
    • spark.driver.memory: The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
    • spark.yarn.driver.memoryOverhead: We recommend 400 (MB).

Configuring Spark

Setting executor memory size is more complicated than simply setting it to be as large as possible. There are several things that need to be taken into consideration:

  • More executor memory means it can enable mapjoin optimization for more queries.

  • More executor memory, on the other hand, becomes unwieldy from GC perspective.

  • Some experiments shows that HDFS client doesn’t handle concurrent writers well, so it may face race condition if executor cores are too many. 

When running Spark on Yarn mode, we generally recommend setting spark.executor.cores to be 5, 6 or 7, depending on what the typical node is divisible by. For instance, if yarn.nodemanager.resource.cpu-vcores is 19, then 6 is a better choice (all executors can only have the same number of cores, here if we chose 5, then every executor only gets 3 cores; if we chose 7, then only 2 executors are used, and 5 cores will be wasted). If it’s 20, then 5 is a better choice (since this way you’ll get 4 executors, and no core is wasted).

For spark.executor.memory, we recommend to calculate yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead. According to our experiment, we recommend setting spark.yarn.executor.memoryOverhead to be around 15-20% of the total memory.

After you’ve decided on how much memory each executor receives, you need to decide how many executors will be allocated to queries. In the GA release Spark dynamic executor allocation will be supported. However for this beta only static resource allocation can be used. Based on the physical memory in each node and the configuration of  spark.executor.memory and spark.yarn.executor.memoryOverhead, you will need to choose the number of instances and set spark.executor.instances.

Now a real world example. Assuming 10 nodes with 64GB of memory per node with 12 virtual cores, e.g., yarn.nodemanager.resource.cpu-vcores=12. One node will be used as the master and as such the cluster will have 9 slave nodes. We’ll configure spark.executor.cores to 6. Given 64GB of ram yarn.nodemanager.resource.memory-mb will be 50GB. We’ll determine the amount of memory for each executor as follows: 50GB * (6/12) = 25GB. We’ll assign 20% to spark.yarn.executor.memoryOverhead, or 5120, and 80% to spark.executor.memory, or 20GB.

On this 9 node cluster we’ll have two executors per host. As such we can configure spark.executor.instances somewhere between 2 and 18. A value of 18 would utilize the entire cluster.

Common Issues (Green are resolved, will be removed from this list)

Issue
Cause
Resolution
Error: Could not find or load main class org.apache.spark.deploy.SparkSubmit Spark dependency not correctly set. Add Spark dependency to Hive, see Step 1 above.

org.apache.spark.SparkException: Job aborted due to stage failure:

Task 5.0:0 had a not serializable result: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable

Spark serializer not set to Kryo. Set spark.serializer to be org.apache.spark.serializer.KryoSerializer, see Step 3 above.

[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

Hive has upgraded to Jline2 but jline 0.94 exists in the Hadoop lib.
  1. Delete jline from the Hadoop lib directory (it's only pulled in transitively from ZooKeeper).
  2. export HADOOP_USER_CLASSPATH_FIRST=true
  3. If this error occurs during mvn test, perform a mvn clean install on the root project and itests directory.

Spark executor gets killed all the time and Spark keeps retrying the failed stage; you may find similar information in the YARN nodemanager log.

WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=217989,containerID=container_1421717252700_0716_01_50767235] is running beyond physical memory limits. Current usage: 43.1 GB of 43 GB physical memory used; 43.9 GB of 90.3 GB virtual memory used. Killing container.

For Spark on YARN, nodemanager would kill Spark executor if it used more memory than the configured size of "spark.executor.memory" + "spark.yarn.executor.memoryOverhead". Increase "spark.yarn.executor.memoryOverhead" to make sure it covers the executor off-heap memory usage.

Run query and get an error like:

FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

In Hive logs, it shows:

java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
  at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79)

Happens on Mac (not officially supported).

This is a general Snappy issue with Mac and is not unique to Hive on Spark, but workaround is noted here because it is needed for startup of Spark client.

Run this command before starting Hive or HiveServer2:

export HADOOP_OPTS="-Dorg.xerial.snappy.tempdir=/tmp -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib $HADOOP_OPTS"

Stack trace: ExitCodeException exitCode=1: .../launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR.../usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*: bad substitution

 

The key mapreduce.application.classpath in/etc/hadoop/conf/mapred-site.xml contains a variable which is invalid in bash.

From mapreduce.application.classpath remove

:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo- 0.6 . 0 .${hdp.version}.jar

from

/etc/hadoop/conf/mapred-site.xml

Exception in thread "Driver" scala.MatchError: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/TaskAttemptContext (of class java.lang.NoClassDefFoundError)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:432)

MR is not on the YARN classpath.

If on HDP change from

/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework

to

/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework

Recommended Configuration

# see HIVE-9153
mapreduce.input.fileinputformat.split.maxsize=750000000
hive.vectorized.execution.enabled=true

hive.cbo.enable=true
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.orc.splits.include.file.footer=false
hive.merge.mapfiles=true
hive.merge.sparkfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.orcfile.stripe.level=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096
hive.vectorized.groupby.flush.percent=0.1
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.exec.orc.default.stripe.size=67108864
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.fetch.task.aggr=false
mapreduce.input.fileinputformat.list-status.num-threads=5
spark.kryo.referenceTracking=false
spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch

Design documents

(/root/.conda/envs/untitled) [root@master untitled]# spark-submit /root/IdeaProjects/untitled/test.py25/06/21 03:39:23 INFO spark.SparkContext: Running Spark version 3.2.425/06/21 03:39:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable25/06/21 03:39:23 INFO resource.ResourceUtils: ==============================================================25/06/21 03:39:23 INFO resource.ResourceUtils: No custom resources configured for spark.driver.25/06/21 03:39:23 INFO resource.ResourceUtils: ==============================================================25/06/21 03:39:23 INFO spark.SparkContext: Submitted application: NBAPlayerStatsAnalysis25/06/21 03:39:23 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)25/06/21 03:39:23 INFO resource.ResourceProfile: Limiting resource is cpu25/06/21 03:39:23 INFO resource.ResourceProfileManager: Added ResourceProfile id: 025/06/21 03:39:23 INFO spark.SecurityManager: Changing view acls to: root25/06/21 03:39:23 INFO spark.SecurityManager: Changing modify acls to: root25/06/21 03:39:23 INFO spark.SecurityManager: Changing view acls groups to: 25/06/21 03:39:23 INFO spark.SecurityManager: Changing modify acls groups to: 25/06/21 03:39:23 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()25/06/21 03:39:23 INFO util.Utils: Successfully started service 'sparkDriver' on port 38477.25/06/21 03:39:23 INFO spark.SparkEnv: Registering MapOutputTracker25/06/21 03:39:23 INFO spark.SparkEnv: Registering BlockManagerMaster25/06/21 03:39:23 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information25/06/21 03:39:23 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up25/06/21 03:39:23 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat25/06/21 03:39:23 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-87b7f430-0829-4b2f-a6db-66a1918672c425/06/21 03:39:23 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MiB25/06/21 03:39:23 INFO spark.SparkEnv: Registering OutputCommitCoordinator25/06/21 03:39:23 INFO util.log: Logging initialized @3782ms to org.sparkproject.jetty.util.log.Slf4jLog25/06/21 03:39:24 INFO server.Server: jetty-9.4.44.v20210927; built: 2021-09-27T23:02:44.612Z; git: 8da83308eeca865e495e53ef315a249d63ba9332; jvm 1.8.0_241-b0725/06/21 03:39:24 INFO server.Server: Started @3903ms25/06/21 03:39:24 INFO server.AbstractConnector: Started ServerConnector@2f4a9795{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}25/06/21 03:39:24 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@78b4468f{/jobs,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6a05bffa{/jobs/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@27327a28{/jobs/job,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28681368{/jobs/job/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1128b2de{/stages,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f721d99{/stages/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@69bede41{/stages/stage,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f9681b9{/stages/stage/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1acbd51a{/stages/pool,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@198ff579{/stages/pool/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2c276600{/storage,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@415fc67c{/storage/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e5f2770{/storage/rdd,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@480634c3{/storage/rdd/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5c301060{/environment,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@79e3b995{/environment/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@209ce76a{/executors,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15978f66{/executors/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@34bec70d{/executors/threadDump,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@94dd3ec{/executors/threadDump/json,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e1f7856{/static,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5e864ef9{/,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7cfc3ab1{/api,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57a41479{/jobs/job/kill,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2e673e9c{/stages/stage/kill,null,AVAILABLE,@Spark}25/06/21 03:39:24 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master:404025/06/21 03:39:24 INFO executor.Executor: Starting executor ID driver on host master25/06/21 03:39:24 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39469.25/06/21 03:39:24 INFO netty.NettyBlockTransferService: Server created on master:3946925/06/21 03:39:24 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy25/06/21 03:39:24 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master, 39469, None)25/06/21 03:39:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager master:39469 with 366.3 MiB RAM, BlockManagerId(driver, master, 39469, None)25/06/21 03:39:24 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master, 39469, None)25/06/21 03:39:24 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, master, 39469, None)25/06/21 03:39:24 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@72f89ab{/metrics/json,null,AVAILABLE,@Spark}25/06/21 03:39:25 INFO internal.SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.25/06/21 03:39:25 INFO internal.SharedState: Warehouse path is 'file:/root/IdeaProjects/untitled/spark-warehouse'.25/06/21 03:39:25 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6035fe3d{/SQL,null,AVAILABLE,@Spark}25/06/21 03:39:25 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e7501f{/SQL/json,null,AVAILABLE,@Spark}25/06/21 03:39:25 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3725262b{/SQL/execution,null,AVAILABLE,@Spark}25/06/21 03:39:25 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@47ac637{/SQL/execution/json,null,AVAILABLE,@Spark}25/06/21 03:39:25 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1c6e4ad8{/static/sql,null,AVAILABLE,@Spark}Traceback (most recent call last): File "/root/IdeaProjects/untitled/test.py", line 8, in <module> df = spark.read.option("header", True).option("inferSchema", True).csv("hdfs://master:9000/usr/local/hadoop/clean_data_final.csv") File "/opt/spark3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 410, in csv File "/opt/spark3.2/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/opt/spark3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in decopyspark.sql.utils.AnalysisException: Path does not exist: hdfs://master:9000/usr/local/hadoop/clean_data_final.csv25/06/21 03:39:26 INFO spark.SparkContext: Invoking stop() from shutdown hook25/06/21 03:39:26 INFO server.AbstractConnector: Stopped Spark@2f4a9795{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}25/06/21 03:39:26 INFO ui.SparkUI: Stopped Spark web UI at http://master:404025/06/21 03:39:26 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!25/06/21 03:39:26 INFO memory.MemoryStore: MemoryStore cleared25/06/21 03:39:26 INFO storage.BlockManager: BlockManager stopped25/06/21 03:39:26 INFO storage.BlockManagerMaster: BlockManagerMaster stopped25/06/21 03:39:26 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!25/06/21 03:39:26 INFO spark.SparkContext: Successfully stopped SparkContext25/06/21 03:39:26 INFO util.ShutdownHookManager: Shutdown hook called25/06/21 03:39:26 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-3541346c-fa51-49e7-8bc4-702377c3536f25/06/21 03:39:26 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-3541346c-fa51-49e7-8bc4-702377c3536f/pyspark-b60f643a-25ae-49bd-a26e-159c23c5ebb425/06/21 03:39:26 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-bf55404c-8e84-4f31-bd07-d3f8b28a2704 怎么解决
最新发布
06-22
从你的错误信息来看,问题的核心是 Spark 任务在尝试读取 HDFS 上的文件时,路径 `hdfs://master:9000/usr/local/hadoop/clean_data_final.csv` 不存在。以下是详细的分析和解决方案。 --- ### **问题分析** 1. **错误核心**: - 错误信息显示:`pyspark.sql.utils.AnalysisException: Path does not exist: hdfs://master:9000/usr/local/hadoop/clean_data_final.csv` - 这表明 Spark 在 HDFS 中找不到指定的文件路径。 2. **可能原因**: - 文件路径错误:可能是路径拼写错误或文件未上传到 HDFS。 - 文件未上传:目标文件可能未正确上传到 HDFS。 - 权限问题:HDFS 上的文件可能没有正确的权限供 Spark 访问。 --- ### **解决方案** #### **步骤 1:验证 HDFS 文件是否存在** 使用 HDFS 命令检查文件是否存在于指定路径: ```bash hdfs dfs -ls /usr/local/hadoop/ ``` - 如果文件存在,确认文件名是否为 `clean_data_final.csv`。 - 如果文件不存在,需要将文件上传到 HDFS。 #### **步骤 2:上传文件到 HDFS** 如果文件不存在,可以使用以下命令将本地文件上传到 HDFS: ```bash hdfs dfs -mkdir -p /usr/local/hadoop/ hdfs dfs -put /path/to/local/clean_data_final.csv /usr/local/hadoop/ ``` - `-mkdir -p`:确保目录 `/usr/local/hadoop/` 存在。 - `/path/to/local/clean_data_final.csv`:替换为本地文件的实际路径。 #### **步骤 3:检查文件路径** 确认代码中读取文件的路径是否正确。以下是示例代码: ```python from pyspark.sql import SparkSession # 创建 SparkSession spark = SparkSession.builder \ .appName("NBAPlayerStatsAnalysis") \ .getOrCreate() # 读取 HDFS 上的 CSV 文件 df = spark.read.option("header", True).option("inferSchema", True).csv("hdfs://master:9000/usr/local/hadoop/clean_data_final.csv") # 显示 DataFrame 的前几行 df.show() # 停止 SparkSession spark.stop() ``` - 确保路径 `"hdfs://master:9000/usr/local/hadoop/clean_data_final.csv"` 是正确的。 - 如果路径中有任何拼写错误,请修正。 #### **步骤 4:检查 HDFS 权限** 确保 Spark 用户有权限访问 HDFS 上的文件。可以通过以下命令检查文件权限: ```bash hdfs dfs -ls /usr/local/hadoop/ ``` - 如果权限不足,可以修改文件权限: ```bash hdfs dfs -chmod 777 /usr/local/hadoop/clean_data_final.csv ``` #### **步骤 5:测试代码** 重新运行 Spark 提交任务: ```bash spark-submit /root/IdeaProjects/untitled/test.py ``` --- ### **解释** 1. **HDFS 路径问题**: - Spark 使用 HDFS 路径读取文件时,路径必须正确且文件必须存在于 HDFS 中。 - 如果路径错误或文件不存在,Spark 将抛出 `Path does not exist` 错误。 2. **文件上传**: - 如果文件不在 HDFS 中,需要使用 `hdfs dfs -put` 命令将其上传到 HDFS。 3. **权限问题**: - HDFS 文件的权限可能会限制 Spark 访问文件。通过 `hdfs dfs -chmod` 修改权限可以解决问题。 4. **代码逻辑**: - 使用 `spark.read.csv` 方法读取 CSV 文件,并设置 `header` 和 `inferSchema` 参数以自动解析表头和数据类型。 --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值