hive/spark-sql : Cannot find DistCp

探讨Spark-SQL执行insert overwrite操作时效率低下问题及解决方案。通过调整hive.exec.stagingdir参数避免使用DistCp,提高move文件操作效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

最近发现spark-sql执行insert overwrite等操作时 最后move文件是一个一个的操作,效率较低而且还会存在bug(具体bug其余文章讲解)。因此进行了修改,修改后发现如下报错。

Caused by: java.io.IOException: Cannot find DistCp class package: org.apache.hadoop.tools.DistCp
    at org.apache.hadoop.hive.shims.Hadoop23Shims.runDistCp(Hadoop23Shims.java:1158)
    at org.apache.hadoop.hive.common.FileUtils.copy(FileUtils.java:553)
    at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2625)

网上遇到同样问题的人很多,一般的解决办法都是找到hadoop-distcp-*.jar放到Spark/hive的CLASSPATH里。但这样其实治标不治本,move文件操作在不跨ns的情况下就不应该有distcp操作。否则效率很低。
分析Hive.java源码,发现如下一段, FileUtils.copy操作是产生distcp的原因。而之所以走进这段逻辑是因为destIsSubDir是true。

boolean destIsSubDir = isSubDir(srcf, destf, fs, isSrcLocal);
...
...
if (destIsSubDir) {
            FileStatus[] srcs = fs.listStatus(srcf, FileUtils.HIDDEN_FILES_PATH_FILTER);
            if (srcs.length == 0) {
              success = true; // Nothing to move.
            }
            for (FileStatus status : srcs) {
              success = FileUtils.copy(srcf.getFileSystem(conf), status.getPath(), destf.getFileSystem(conf), destf,
                  true,     // delete source
                  replace,  // overwrite destination
                  conf);

              if (!success) {
                throw new HiveException("Unable to move source " + status.getPath() + " to destination " + destf);
              }
            }
          } else {
            success = fs.rename(srcf, destf);
          }

顺着根往上分析,发现是hive.exec.stagingdir参数惹的祸。如果临时目录在目标表下就会触发,因此如果不想copy hadoop-distcp-*.jar的话,那就修改临时路径吧!

[root@hadoop01 apache-hive-3.1.3-bin]# hive which: no hbase in (/export/servers/hadoop-3.3.5/bin::/export/servers/apache-hive-3.1.3-bin/bin:/export/servers/flume-1.9.0/bin::/export/servers/apache-hive-3.1.3-bin/bin:/export/servers/flume-1.9.0/bin:/export/servers/flume-1.9.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/export/servers/jdk1.8.0_161/bin:/export/servers/hadoop-3.3.5/bin:/export/servers/hadoop-3.3.5/sbin:/export/servers/scala-2.12.10/bin:/root/bin:/export/servers/jdk1.8.0_161/bin:/export/servers/hadoop-3.3.5/bin:/export/servers/hadoop-3.3.5/sbin:/export/servers/scala-2.12.10/bin:/export/servers/jdk1.8.0_161/bin:/export/servers/hadoop-3.3.5/bin:/export/servers/hadoop-3.3.5/sbin:/export/servers/scala-2.12.10/bin:/root/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/export/servers/hadoop-3.3.5/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/export/servers/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory] 2025-06-16 17:53:36,956 INFO conf.HiveConf: Found configuration file file:/export/servers/apache-hive-3.1.3-bin/conf/hive-site.xml Hive Session ID = 30846036-47a7-480e-81e3-48f09d764412 2025-06-16 17:53:40,291 INFO SessionState: Hive Session ID = 30846036-47a7-480e-81e3-48f09d764412 Logging initialized using configuration in jar:file:/export/servers/apache-hive-3.1.3-bin/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true 2025-06-16 17:53:40,414 INFO SessionState: Logging initialized using configuration in jar:file:/export/servers/apache-hive-3.1.3-bin/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true 2025-06-16 17:53:43,041 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/30846036-47a7-480e-81e3-48f09d764412 2025-06-16 17:53:43,099 INFO session.SessionState: Created local directory: /tmp/root/30846036-47a7-480e-81e3-48f09d764412 2025-06-16 17:53:43,119 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/30846036-47a7-480e-81e3-48f09d764412/_tmp_space.db 2025-06-16 17:53:43,154 INFO conf.HiveConf: Using the default value passed in for log id: 30846036-47a7-480e-81e3-48f09d764412 2025-06-16 17:53:43,154 INFO session.SessionState: Updating thread name to 30846036-47a7-480e-81e3-48f09d764412 main 2025-06-16 17:53:45,040 INFO metastore.HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore 2025-06-16 17:53:45,120 WARN metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 2025-06-16 17:53:45,133 INFO metastore.ObjectStore: ObjectStore, initialize called 2025-06-16 17:53:45,139 INFO conf.MetastoreConf: Found configuration file file:/export/servers/apache-hive-3.1.3-bin/conf/hive-site.xml 2025-06-16 17:53:45,141 INFO conf.MetastoreConf: Unable to find config file hivemetastore-site.xml 2025-06-16 17:53:45,141 INFO conf.MetastoreConf: Found configuration file null 2025-06-16 17:53:45,143 INFO conf.MetastoreConf: Unable to find config file metastore-site.xml 2025-06-16 17:53:45,143 INFO conf.MetastoreConf: Found configuration file null 2025-06-16 17:53:45,603 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 2025-06-16 17:53:46,052 INFO hikari.HikariDataSource: HikariPool-1 - Starting... 2025-06-16 17:53:46,556 INFO hikari.HikariDataSource: HikariPool-1 - Start completed. 2025-06-16 17:53:46,645 INFO hikari.HikariDataSource: HikariPool-2 - Starting... 2025-06-16 17:53:46,677 INFO hikari.HikariDataSource: HikariPool-2 - Start completed. 2025-06-16 17:53:47,494 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 2025-06-16 17:53:47,815 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL 2025-06-16 17:53:47,820 INFO metastore.ObjectStore: Initialized ObjectStore 2025-06-16 17:53:48,285 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:48,286 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:48,287 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:48,287 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:48,288 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:48,288 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:51,987 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:51,988 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:51,988 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:51,988 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:51,989 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:51,989 WARN DataNucleus.MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored 2025-06-16 17:53:57,252 WARN metastore.ObjectStore: Version information not found in metastore. metastore.schema.verification is not enabled so recording the schema version 3.1.0 2025-06-16 17:53:57,253 WARN metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 3.1.0, comment = Set by MetaStore root@192.168.245.131 2025-06-16 17:53:57,550 INFO metastore.HiveMetaStore: Added admin role in metastore 2025-06-16 17:53:57,560 INFO metastore.HiveMetaStore: Added public role in metastore 2025-06-16 17:53:57,673 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 2025-06-16 17:53:58,030 INFO metastore.RetryingMetaStoreClient: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=root (auth:SIMPLE) retries=1 delay=1 lifetime=0 2025-06-16 17:53:58,085 INFO metastore.HiveMetaStore: 0: get_all_functions 2025-06-16 17:53:58,089 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_all_functions Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. 2025-06-16 17:53:58,263 INFO CliDriver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Hive Session ID = c998401c-9255-4863-8e6f-4932f9f591fa 2025-06-16 17:53:58,266 INFO SessionState: Hive Session ID = c998401c-9255-4863-8e6f-4932f9f591fa 2025-06-16 17:53:58,341 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/c998401c-9255-4863-8e6f-4932f9f591fa 2025-06-16 17:53:58,350 INFO session.SessionState: Created local directory: /tmp/root/c998401c-9255-4863-8e6f-4932f9f591fa 2025-06-16 17:53:58,365 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/c998401c-9255-4863-8e6f-4932f9f591fa/_tmp_space.db 2025-06-16 17:53:58,372 INFO metastore.HiveMetaStore: 1: get_databases: @hive# 2025-06-16 17:53:58,373 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_databases: @hive# 2025-06-16 17:53:58,377 INFO metastore.HiveMetaStore: 1: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore 2025-06-16 17:53:58,383 INFO metastore.ObjectStore: ObjectStore, initialize called 2025-06-16 17:53:58,462 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL 2025-06-16 17:53:58,466 INFO metastore.ObjectStore: Initialized ObjectStore 2025-06-16 17:53:58,494 INFO metastore.HiveMetaStore: 1: get_tables_by_type: db=@hive#db_hive1 pat=.*,type=MATERIALIZED_VIEW 2025-06-16 17:53:58,495 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_tables_by_type: db=@hive#db_hive1 pat=.*,type=MATERIALIZED_VIEW 2025-06-16 17:53:58,525 INFO metastore.HiveMetaStore: 1: get_multi_table : db=db_hive1 tbls= 2025-06-16 17:53:58,526 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_multi_table : db=db_hive1 tbls= 2025-06-16 17:53:58,530 INFO metastore.HiveMetaStore: 1: get_tables_by_type: db=@hive#default pat=.*,type=MATERIALIZED_VIEW 2025-06-16 17:53:58,530 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_tables_by_type: db=@hive#default pat=.*,type=MATERIALIZED_VIEW 2025-06-16 17:53:58,539 INFO metastore.HiveMetaStore: 1: get_multi_table : db=default tbls= 2025-06-16 17:53:58,539 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_multi_table : db=default tbls= 2025-06-16 17:53:58,539 INFO metadata.HiveMaterializedViewsRegistry: Materialized views registry has been initialized
06-17
[root@hadoop1 bin]# hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/export/servers/hive-3.1.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/export/servers/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] /usr/bin/which: no hbase in (/export/servers/hive-3.1.3/bin:/export/servers/flume-1.9.0/bin:/root/.local/bin:/root/bin:/export/servers/flume-1.9.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/export/servers/jdk1.8.0_241/bin:/export/servers/hadoop-3.3.0/bin:/export/servers/hadoop-3.3.0/sbin:/export/servers/jdk1.8.0_241/bin:/export/servers/hadoop-3.3.0/bin:/export/servers/hadoop-3.3.0/sbin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/export/servers/hive-3.1.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/export/servers/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = c684811f-5c32-43ba-b621-c41d4ecf957a Logging initialized using configuration in jar:file:/export/servers/hive-3.1.3/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive (default)>
03-08
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值