执行spark sql 遇到的问题

本文介绍了在SparkCluster和YARNCluster环境下运行SparkSQL并操作Hive数据的方法及常见问题解决策略,包括配置文件放置位置、JDBCConnector缺失处理及使用--archives参数打包依赖。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

运行环境:

用图形更直观点。

在 spark cluster 和 yarn cluster 两种方式运行spark sql, 操作hive中的数据,另外,hive 是独立的,可以直接运行hive处理数据。

spark sql的程序比较好写,直接看spark的example的例子HiveFromSpark ,很容易理解

首先,在spark cluster上运行:

将hive的   hive-site.xml  配置文件放到 ${SPARK_HOME}/conf 目录下

  1. #!/bin/bash
  2. cd $SPARK_HOME
  3. ./bin/spark-submit \
  4. -- class com.datateam.spark.sql.HotelHive \
  5. --master spark://192.168.44.80:8070 \
  6. --executor-memory 2G \
  7. --total-executor-cores 10 \
  8. /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/jobs/spark-jobs-20141023.jar \

执行脚本,遇到下面的错误:

  1. Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table dw_hotel_price_log
  2. at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java: 958)
  3. at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java: 924)
  4. ……
  5. Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error :
  6. The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.
  7. Please check your CLASSPATH specification, and the name of the driver.
  8. at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:237)
  9. at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:110)
  10. at org.datanucleus.store.rdbms.ConnectionFactoryImpl.<init>(ConnectionFactoryImpl.java:82)
  11. ... 127 more
  12. Caused by: org.datanucleus.store.rdbms.datasource.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not
  13.  found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
  14. at org.datanucleus.store.rdbms.datasource.AbstractDataSourceFactory.loadDriver(AbstractDataSourceFactory.java:58)
  15. at org.datanucleus.store.rdbms.datasource.BoneCPDataSourceFactory.makePooledDataSource(BoneCPDataSourceFactory.java:61)
  16. at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:217)

意思是找不到 jdbc 的 connector,解决办法:

在提交任务的脚本里加入下面的配置语句即可:

    --driver-class-path /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/lib/mysql-connector-java-5.1.22-bin.jar \

spark cluster 遇到的问题不多,主要在yarn cluster上遇到几个问题。

在spark cluster上调用hive的数据,需要将 hive-site.xml 文件放到spark的conf 目录下,那在yarn上运行该将hive的配置文件放到哪里才能被 spark sql 识别呢?

在提交任务的时候加上:

--files /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/conf/hive-site.xml \

这里用到的是 --files  , 而不是 --conf

先看一下 提交任务的脚本:

  1. cd $SPARK_HOME
  2. ./bin/spark-submit -- class com.qunar.datateam.spark.sql.HotelHive \
  3. --master yarn-cluster \
  4. --num-executors 10 \
  5. --driver-memory 4g \
  6. --executor-memory 2g \
  7. --executor-cores 2 \
  8. --files /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/conf/hive-site.xml \
  9. /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/jobs/spark-jobs-20141023.jar \

ok,我们这里同样需要将mysql connector的jar包添加进去,如何进行?

--jars mysql-connectorpath

但是会出现下面的问题:

  1. Exception in thread "Driver" java.lang.reflect.InvocationTargetException
  2. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  3. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 57)
  4. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 43)
  5. at java.lang.reflect.Method.invoke(Method.java: 606)
  6. at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$ 2.run(ApplicationMaster.scala: 162)
  7. ……
  8.  Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table tablename
  9.  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java: 958)
  10. at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java: 924)
  11. Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
  12.         at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java: 1212)
  13.         at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java: 62)
  14.         at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java: 72)
  15.         at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java: 2372)
  16.         at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java: 2383)
  17.         at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java: 950)
  18.         ... 68 more
  19. Caused by: java.lang.reflect.InvocationTargetException
  20.         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  21.         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java: 57)
  22.         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java: 45)
  23.         at java.lang.reflect.Constructor.newInstance(Constructor.java: 526)
  24.         at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java: 1210)
  25.         ... 73 more
  26. Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
  27. NestedThrowables:
  28. java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
  29. ……
  30. Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
  31.         at java.net.URLClassLoader$ 1.run(URLClassLoader.java: 366)
  32.         at java.net.URLClassLoader$ 1.run(URLClassLoader.java: 355)
  33.         at java.security.AccessController.doPrivileged(Native Method)
  34.         at java.net.URLClassLoader.findClass(URLClassLoader.java: 354)
  35.         at java.lang.ClassLoader.loadClass(ClassLoader.java: 425)
  36.         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: 308)
  37.         at java.lang.ClassLoader.loadClass(ClassLoader.java: 358)
  38.         at java.lang.Class.forName0(Native Method)
  39.         at java.lang.Class.forName(Class.java: 270)
  40.         at javax.jdo.JDOHelper$ 18.run(JDOHelper.java: 2018)
  41.         at javax.jdo.JDOHelper$ 18.run(JDOHelper.java: 2016)
  42.         at java.security.AccessController.doPrivileged(Native Method)
  43.         at javax.jdo.JDOHelper.forName(JDOHelper.java: 2015)
  44.         at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java: 1162)
  45.         ... 97 more

查找资料,发现别人的讨论:http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-td11369.html


于是将

datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar 

都加到 --jars 里,但是还是出问题:

  1. Exception in thread "Driver" java.lang.reflect.InvocationTargetException
  2. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  3. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 57)
  4. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 43)
  5. at java.lang.reflect.Method.invoke(Method.java: 606)
  6. at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$ 2.run(ApplicationMaster.scala: 162)
  7. Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure:
  8. Lost task 6.3 in stage 2.0 (TID 34, l-hbase72.data.cn8
  9. ): java.io.FileNotFoundException: ./datanucleus-core- 3.2.2.jar (Permission denied)
  10. java.io.FileOutputStream.open(Native Method)
  11. java.io.FileOutputStream.<init>(FileOutputStream.java: 221)
  12. com.google.common.io.Files$FileByteSink.openStream(Files.java: 223)
  13. com.google.common.io.Files$FileByteSink.openStream(Files.java: 211)
  14. com.google.common.io.ByteSource.copyTo(ByteSource.java: 203)
  15. com.google.common.io.Files.copy(Files.java: 436)

经过不断尝试,将 --jars 后面的配置的jar包都用 --archives 的方式打到运行jar中:

--archives mysql-connector.jar,datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar 

ok

另外还要注意一点:

spark sql 中不认“;”,所以只能在sql中指明database,不能用 use database ; 这样的hive sql 语句指定database

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值