spark sql and hive 3g数据测试

本文详细介绍了一次使用Hadoop进行文件上传至HDFS、利用Hive创建外部表并加载数据的过程,最后通过Hive及Spark SQL进行了数据查询与统计分析。

1:上传文件到hdfs

[jifeng@feng01 hadoop]$ hadoop fs -mkdir /user/jifeng/fire
15/03/05 13:29:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[jifeng@feng01 hadoop]$ hadoop fs -put /home/jifeng/hadoop/firewallb-week2-origin.sql /user/jifeng/fire/
15/03/05 13:30:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[jifeng@feng01 hadoop]$ 

2:hive中建表


[jifeng@feng01 ~]$ hive
15/03/05 13:31:08 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead

Logging initialized using configuration in jar:file:/home/jifeng/hive/lib/hive-common-0.13.1.jar!/hive-log4j.properties
hive> CREATE EXTERNAL TABLE firewall (         idauto INT,         dt TIMESTAMP,         dtstr STRING,         timeSeconds INT,         priority STRING,         operation STRING,         messageCode STRING,         protocol STRING,         srcIp STRING,         destIp STRING,         srcIPNumber bigINT,         destIPNumber bigINT,         srcPort INT,         destPort INT,         destService STRING,         direction STRING,         flags STRING,         command STRING,         slot1min INT,         slot5min INT,         slot30min INT,         slot60min INT        ) row format delimited fields terminated by '\t'; 
OK
Time taken: 3.414 seconds
hive> load data inpath '/user/jifeng/fire/firewallb-week2-origin.sql' into table firewall;
Loading data to table default.firewall
Table default.firewall stats: [numFiles=1, numRows=0, totalSize=3415284428, rawDataSize=0]
OK
Time taken: 1.441 seconds

建表语句:

CREATE EXTERNAL TABLE firewall (  
       idauto INT,  
       dt TIMESTAMP,  
       dtstr STRING,  
       timeSeconds INT,  
       priority STRING,  
       operation STRING,  
       messageCode STRING,  
       protocol STRING,  
       srcIp STRING,  
       destIp STRING,  
       srcIPNumber bigINT,  
       destIPNumber bigINT,  
       srcPort INT,  
       destPort INT,  
       destService STRING,  
       direction STRING,  
       flags STRING,  
       command STRING,  
       slot1min INT,  
       slot5min INT,  
       slot30min INT,  
       slot60min INT   
     ) row format delimited fields terminated by '\t'; 


3:hive中测试

hive> select * from firewall order by dt desc limit 10; 
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1423061535604_0004, Tracking URL = http://feng01:8088/proxy/application_1423061535604_0004/
Kill Command = /home/jifeng/hadoop/hadoop-2.4.1/bin/hadoop job  -kill job_1423061535604_0004
Hadoop job information for Stage-1: number of mappers: 13; number of reducers: 1
2015-03-05 13:35:57,706 Stage-1 map = 0%,  reduce = 0%
2015-03-05 13:36:58,247 Stage-1 map = 0%,  reduce = 0%
2015-03-05 13:37:26,248 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 5.38 sec
2015-03-05 13:37:34,542 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 7.63 sec
2015-03-05 13:38:40,840 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 13.89 sec
2015-03-05 13:39:24,926 Stage-1 map = 10%,  reduce = 0%, Cumulative CPU 79.08 sec
2015-03-05 13:40:00,669 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 85.37 sec
2015-03-05 13:40:25,540 Stage-1 map = 15%,  reduce = 0%, Cumulative CPU 86.27 sec
2015-03-05 13:40:45,643 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 107.73 sec
2015-03-05 13:40:46,742 Stage-1 map = 31%,  reduce = 0%, Cumulative CPU 109.79 sec
2015-03-05 13:40:59,426 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 116.75 sec
2015-03-05 13:41:03,706 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU 120.27 sec
2015-03-05 13:41:07,921 Stage-1 map = 44%,  reduce = 0%, Cumulative CPU 125.38 sec
2015-03-05 13:41:15,830 Stage-1 map = 49%,  reduce = 0%, Cumulative CPU 130.26 sec
2015-03-05 13:41:18,002 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 131.86 sec
2015-03-05 13:42:18,780 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 170.98 sec
2015-03-05 13:43:19,428 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 194.07 sec
2015-03-05 13:44:20,028 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 197.32 sec
2015-03-05 13:45:20,494 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 200.93 sec
2015-03-05 13:46:21,209 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 223.6 sec
2015-03-05 13:46:42,075 Stage-1 map = 54%,  reduce = 15%, Cumulative CPU 230.03 sec
2015-03-05 13:46:49,284 Stage-1 map = 56%,  reduce = 15%, Cumulative CPU 218.59 sec
2015-03-05 13:46:51,358 Stage-1 map = 60%,  reduce = 15%, Cumulative CPU 219.07 sec
2015-03-05 13:46:54,448 Stage-1 map = 63%,  reduce = 15%, Cumulative CPU 191.15 sec
2015-03-05 13:46:58,609 Stage-1 map = 64%,  reduce = 15%, Cumulative CPU 193.67 sec
2015-03-05 13:47:05,823 Stage-1 map = 67%,  reduce = 15%, Cumulative CPU 180.53 sec
2015-03-05 13:47:09,938 Stage-1 map = 69%,  reduce = 15%, Cumulative CPU 194.45 sec
2015-03-05 13:47:11,986 Stage-1 map = 69%,  reduce = 18%, Cumulative CPU 194.85 sec
2015-03-05 13:47:47,100 Stage-1 map = 72%,  reduce = 18%, Cumulative CPU 209.86 sec
2015-03-05 13:48:31,364 Stage-1 map = 77%,  reduce = 18%, Cumulative CPU 210.55 sec
2015-03-05 13:48:37,549 Stage-1 map = 77%,  reduce = 21%, Cumulative CPU 210.74 sec
2015-03-05 13:48:53,998 Stage-1 map = 82%,  reduce = 21%, Cumulative CPU 177.87 sec
2015-03-05 13:48:55,021 Stage-1 map = 82%,  reduce = 23%, Cumulative CPU 178.51 sec
2015-03-05 13:49:06,447 Stage-1 map = 90%,  reduce = 23%, Cumulative CPU 143.53 sec
2015-03-05 13:49:11,591 Stage-1 map = 90%,  reduce = 26%, Cumulative CPU 144.64 sec
2015-03-05 13:49:12,612 Stage-1 map = 95%,  reduce = 26%, Cumulative CPU 145.78 sec
2015-03-05 13:49:14,665 Stage-1 map = 95%,  reduce = 28%, Cumulative CPU 145.81 sec
2015-03-05 13:49:28,043 Stage-1 map = 97%,  reduce = 28%, Cumulative CPU 147.61 sec
2015-03-05 13:49:29,075 Stage-1 map = 100%,  reduce = 28%, Cumulative CPU 147.67 sec
2015-03-05 13:50:06,153 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 148.22 sec
2015-03-05 13:50:09,249 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 149.17 sec
MapReduce Total cumulative CPU time: 2 minutes 29 seconds 170 msec
Ended Job = job_1423061535604_0004
MapReduce Jobs Launched: 
Job 0: Map: 13  Reduce: 1   Cumulative CPU: 149.68 sec   HDFS Read: 3418565375 HDFS Write: 2076 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 29 seconds 680 msec
OK
16600916        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.6.6.7 10.0.2.2        168166919       167772674       42465   52372   52372_tcp       (empty) (empty) (empty) 227670004553400 758900  379450
16600915        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.13.77.49      10.0.3.2        168643889       167772930       58200   32179   32179_tcp       (empty) (empty) (empty) 22767000 4553400 758900  379450
16600931        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.6.6.7 10.0.2.2        168166919       167772674       42464   33004   33004_tcp       (empty) (empty) (empty) 227670004553400 758900  379450
16600919        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.6.6.7 10.0.2.2        168166919       167772674       42465   38300   38300_tcp       (empty) (empty) (empty) 227670004553400 758900  379450
16600918        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.13.77.49      10.0.3.2        168643889       167772930       58199   25087   25087_tcp       (empty) (empty) (empty) 22767000 4553400 758900  379450
16600912        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.6.6.7 10.0.2.2        168166919       167772674       42465   556     remotefs        (empty) (empty) (empty) 227670004553400 758900  379450
16600913        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.13.77.49      10.0.3.2        168643889       167772930       58199   32179   32179_tcp       (empty) (empty) (empty) 22767000 4553400 758900  379450
16600920        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Info    Teardown        ASA-6-302014    TCP      10.6.6.7        172.10.0.2      168166919       2886336514      42706   3389    3389_tcp        inbound SYN Timeout      (empty) 22767000        4553400 758900  379450
16600930        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.13.77.49      10.0.3.2        168643889       167772930       58200   13817   13817_tcp       (empty) (empty) (empty) 22767000 4553400 758900  379450
16600911        2013-04-15 10:00:00     15/Apr/2013 10:00:00    1366020000      Warning Deny    ASA-4-106023    TCP     10.13.77.49      10.0.3.2        168643889       167772930       58200   1993    1993_tcp        (empty) (empty) (empty) 22767000 4553400 758900  379450
Time taken: 879.31 seconds, Fetched: 10 row(s)

4:启动spark sql server

[jifeng@feng02 spark-1.2.0-bin-2.4.1]$ ./sbin/start-thriftserver.sh  --hiveconf hive.server2.thrift.port=10000  --hiveconf hive.server2.thrift.bind.host=feng02 --master spark://feng02:7077
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /home/jifeng/hadoop/spark-1.2.0-bin-2.4.1/sbin/../logs/spark-jifeng-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-feng02.out

运行beeline

[jifeng@feng02 spark-1.2.0-bin-2.4.1]$ ./bin/beeline
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Beeline version ??? by Apache Hive

连接:

beeline> !connect jdbc:hive2://feng02:10000 scott tiger org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://feng02:10000
log4j:WARN No appenders could be found for logger (org.apache.thrift.transport.TSaslTransport).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connected to: Spark SQL (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://feng02:10000> show tables;
+----------------+
|     result     |
+----------------+
| course         |
| firewall       |
| hbase_table_1  |
| pokes          |
| student        |
+----------------+
5 rows selected (0.489 seconds)


5:spark sql测试

0: jdbc:hive2://feng02:10000> select * from firewall order by dt desc limit 10; 
+-----------+------------------------+-----------------------+--------------+-----------+------------+---------------+--+
|  idauto   |           dt           |         dtstr         | timeseconds  | priority  | operation  |  messagecode  |  |
+-----------+------------------------+-----------------------+--------------+-----------+------------+---------------+--+
| 16600918  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600914  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600917  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600912  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600913  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600915  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600916  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600911  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
| 16600920  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Info      | Teardown   | ASA-6-302014  |  |
| 16600919  | 2013-04-15 10:00:00.0  | 15/Apr/2013 10:00:00  | 1366020000   | Warning   | Deny       | ASA-4-106023  |  |
+-----------+------------------------+-----------------------+--------------+-----------+------------+---------------+--+
10 rows selected (119.692 seconds)


统计IP出现次数:

0: jdbc:hive2://feng02:10000> select srcIp,count(*) as num  from firewall  group by srcIp order by  num desc limit 100; 
+-----------------+----------+
|      srcIp      |   num    |
+-----------------+----------+
| 10.12.15.152    | 3875019  |
| 10.13.77.49     | 1954742  |
| 10.6.6.7        | 1769857  |
| 10.138.235.111  | 1630920  |
| 10.12.14.15     | 756475   |
| 10.17.15.10     | 720936   |
| 172.10.0.4      | 482568   |
| 10.78.100.150   | 300530   |
| 10.247.58.182   | 292384   |
| 10.38.217.48    | 282588   |
| 172.30.0.4      | 232850   |
| 10.10.11.102    | 205579   |
| 10.170.32.110   | 202735   |
| 10.170.32.181   | 200539   |
| 10.138.214.18   | 197864   |
| 10.247.106.27   | 193912   |
| 10.156.165.120  | 183929   |
| 10.15.7.85      | 172422   |
| 10.0.0.42       | 165063   |
| 10.200.20.2     | 164519   |
| 10.70.68.127    | 161735   |
| 172.20.0.15     | 145924   |
| 10.250.178.101  | 140053   |
| 172.20.0.4      | 122768   |
| 172.30.1.223    | 51329    |
| 172.20.1.81     | 51328    |
| 172.10.2.66     | 51304    |
| 172.20.1.47     | 51298    |
| 172.30.1.218    | 51236    |
| 172.10.2.135    | 51233    |
| 172.20.1.23     | 51199    |
| 172.10.2.106    | 51174    |
| 172.20.0.3      | 20795    |
| 172.10.0.2      | 16296    |
| 10.13.77.4      | 15622    |
| 10.77.101.2     | 15307    |
| 10.77.24.4      | 15274    |
| 10.0.2.181      | 15264    |
| 10.0.3.77       | 15086    |
| 10.58.99.97     | 15021    |
| 10.83.101.25    | 14997    |
| 10.62.15.101    | 14930    |
| 10.1.55.43      | 14783    |
| 172.20.0.2      | 14754    |
| 10.27.200.49    | 14690    |
| 10.18.37.65     | 14645    |
| 10.43.51.79     | 14457    |
| 172.30.0.3      | 11113    |
| 172.30.0.2      | 10129    |
| 172.10.0.3      | 9598     |
| 172.10.0.9      | 3899     |
| 172.10.0.8      | 3852     |
| 172.20.0.6      | 3833     |
| 172.10.0.7      | 3820     |
| 172.10.0.5      | 3805     |
| 10.57.84.113    | 3420     |
| 10.57.105.222   | 3329     |
| 10.170.32.239   | 3231     |
| 172.30.1.215    | 2973     |
| 172.30.1.248    | 2908     |
| 172.30.1.213    | 2890     |
| 172.30.1.217    | 2713     |
| 172.30.1.210    | 2686     |
| 172.20.0.5      | 2668     |
| 172.30.1.205    | 2639     |
| 10.76.187.67    | 2609     |
| 172.30.1.204    | 2556     |
| 172.30.1.227    | 2553     |
| 172.30.1.242    | 2525     |
| 172.30.1.236    | 2521     |
| 10.156.165.212  | 2513     |
| 10.88.52.24     | 2513     |
| 172.30.1.247    | 2513     |
| 10.41.9.89      | 2509     |
| 10.89.101.53    | 2503     |
| 10.88.25.4      | 2494     |
| 172.30.1.243    | 2473     |
| 172.30.1.245    | 2457     |
| 172.30.1.241    | 2442     |
| 10.51.88.167    | 2418     |
| 172.30.1.202    | 2415     |
| 172.30.1.237    | 2414     |
| 10.0.3.178      | 2414     |
| 10.36.156.99    | 2411     |
| 10.32.42.116    | 2410     |
| 10.78.90.9      | 2408     |
| 172.30.1.229    | 2408     |
| 10.33.0.15      | 2406     |
| 10.0.125.6      | 2400     |
| 10.138.78.75    | 2400     |
| 10.51.88.253    | 2400     |
| 10.38.37.13     | 2398     |
| 10.76.187.68    | 2397     |
| 10.76.193.186   | 2397     |
| 10.149.59.54    | 2396     |
| 10.170.148.52   | 2395     |
| 10.179.19.15    | 2395     |
| 10.57.105.44    | 2395     |
| 172.30.1.224    | 2394     |
+-----------------+----------+
|      srcIp      |   num    |
+-----------------+----------+
| 10.38.37.229    | 2393     |
+-----------------+----------+
100 rows selected (65.642 seconds)



0: jdbc:hive2://feng02:10000> select count(*) from firewall;                                                
+-----------+
|    _c0    |
+-----------+
| 16600931  |
+-----------+
1 row selected (47.636 seconds)
<think>首先,用户的问题是基于给定的系统级指令和用户查询。用户查询是:“我想查找使用spark.sql获取hive数据的方法 请问使用spark.sql获取hive数据的方法”。我需要使用中文回答,并遵循指令。 关键指令: - 所有行内数学表达式使用 $...$ 格式。 - 独立公式使用 $$...$$ 格式并单独成段。 - LaTeX 语法正确。 - 使用中文回答。 - 生成相关问题。 - 回答中引用的段落末尾自然地添加引用标识。 - 回答结构清晰,逐步解决问题,保证真实可靠。 用户提供了几个引用: - 引用[1]:关于Spark SQL读取Hive数据。 - 引用[2]:Spark SQLHive集成和查询Hive表。 - 引用[3]:Spark-SQL连接Hive的五种方法。 - 引用[4]:Hive安装元数据在MySQL等。 用户的问题是查找使用spark.sql获取Hive数据的方法。所以,我需要基于这些引用和一般知识来回答。 步骤: 1. **理解需求**:用户想知道如何用Spark SQLspark.sql)来获取Hive数据。Spark SQL允许使用SQL查询Hive表。 2. **结构化回答**: - 介绍Spark SQLHive集成。 - 步骤:配置、查询方法、优化等。 - 使用代码示例(如果有)。 - 添加引用标识。 3. **参考引用**: - 引用[1]:Spark SQL使用Hive数据源,支持Spark on Hive。 - 引用[2]:详细配置、查询方法和优化技巧。 - 引用[3]:五种连接方法,强调包含Hive支持。 - 引用[4]:涉及MySQL和Thrift,但不直接相关。 4. **关键点**: - Spark SQL可以查询Hive表通过配置Hive支持。 - 方法包括:使用SparkSession启用Hive支持、查询Hive表。 - 示例代码:在Spark中使用spark.sql("SELECT * FROM hive_table")。 5. **生成相关问题**:在回答结束后,添加“相关问题”部分。 6. **数学表达式**:用户查询中没有明显数学内容,但如果有,如公式,需用LaTeX。例如,如果有统计公式,但这里可能不适用。 回答大纲: - **引言**:简要解释Spark SQLHive集成。 - **配置步骤**:如何设置Spark SQL以访问Hive。 - **查询方法**:使用spark.sql执行查询。 - **代码示例**:提供简单代码。 - **优化技巧**:简要提及性能优化。 - **结束**:总结。 - **相关问题**:生成相关问题。 代码示例:使用Spark SQL查询Hive表。 ```python from pyspark.sql import SparkSession # 创建SparkSession,启用Hive支持 spark = SparkSession.builder \ .appName("Hive Example") \ .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \ .enableHiveSupport() \ .getOrCreate() # 使用spark.sql查询Hive表 df = spark.sql("SELECT * FROM your_hive_table") df.show() ``` 引用添加:在相关段落末尾添加[^1], [^2], [^3]。例如,提到配置时引用[2],提到方法时引用[3]。 相关问题: 1. Spark SQL查询Hive时如何优化性能? 2. Spark SQL支持哪些Hive特性? 3. 如何处理Spark SQLHive的兼容性问题? 确保回答真实可靠:基于Spark文档和引用。 </think> ### 使用Spark SQL获取Hive数据的方法 要使用Spark SQL获取Hive数据,关键在于正确配置Spark以支持Hive集成,然后通过Spark SQL执行查询。Spark SQL可以直接读取Hive表的数据,其底层利用Hive的元数据存储和查询解析能力,但物理执行由Spark引擎优化,比传统MapReduce更高效。以下是具体步骤和方法,基于最佳实践和可靠参考: #### 1. **配置Spark以支持Hive集成** - Spark SQL需要启用Hive支持才能访问Hive元数据。Hive元数据通常存储在MySQL或Derby等数据库中,Spark通过连接这些数据库获取表结构信息。 - **核心配置**: - 确保Spark编译时包含Hive支持(默认Spark发行版通常已包含)。 - 设置Hive元数据存储路径:在`spark-defaults.conf`中添加配置,例如指定Hive的`hive-site.xml`文件路径。 - 如果Hive元数据在MySQL中,需提供JDBC驱动(将MySQL驱动JAR放入Spark的`jars`目录)。 - 示例配置代码(Scala/Python): ```python from pyspark.sql import SparkSession # 创建SparkSession并启用Hive支持 spark = SparkSession.builder \ .appName("HiveIntegrationExample") \ .config("spark.sql.warehouse.dir", "/user/hive/warehouse") # Hive数据仓库路径 \ .config("spark.sql.catalogImplementation", "hive") # 指定使用Hive元数据 \ .enableHiveSupport() \ # 启用Hive支持 .getOrCreate() ``` 此配置确保Spark能读取Hive的元数据表。 #### 2. **使用spark.sql查询Hive数据** - 配置完成后,直接使用`spark.sql()`方法执行SQL查询。Spark SQL兼容HiveQL语法,支持所有标准操作(如SELECT、JOIN、聚合)。 - **查询步骤**: 1. 创建SparkSession(如上所示)。 2. 运行SQL语句:查询Hive表并处理结果。 3. 结果转换为DataFrame,便于进一步分析或输出。 - 完整示例代码(Python): ```python # 假设Hive中已有表 'sales_data' # 查询Hive表数据 df = spark.sql("SELECT product_id, SUM(amount) AS total_sales FROM sales_data GROUP BY product_id") # 显示结果(或写入其他存储) df.show() # 可选:将结果保存为新Hive表 df.write.saveAsTable("aggregated_sales") ``` 此代码读取Hive表`sales_data`,计算每个产品的总销售额,并输出结果。 #### 3. **优化查询性能** - Spark SQL查询Hive时,性能优化可提升效率: - **分区和分桶**:利用Hive表的分区(Partitioning)和分桶(Bucketing)加速查询。例如,查询时指定分区键: ```sql SELECT * FROM sales_data WHERE date = '2023-10-01' ``` - **缓存数据**:对频繁访问的数据调用`df.cache()`,减少重复读取。 - **调整并行度**:通过`spark.sql.shuffle.partitions`参数控制任务并行度。 - **使用向量化查询**:启用`spark.sql.hive.convertMetastoreParquet`优化Parquet格式表的读取[^2]。 - 监控工具:使用Spark UI检查查询计划,识别瓶颈。 #### 4. **常见问题解决** - **元数据访问失败**:确保Hive元数据库(如MySQL)可访问,并在`hive-site.xml`中正确配置JDBC URL。 - **类路径错误**:若报`ClassNotFoundException`,将Hive或MySQL驱动JAR添加到Spark的classpath。 - **权限问题**:Hive表需对Spark用户开放读取权限(通过Hive的GRANT命令)。 通过以上方法,您可以高效使用Spark SQL获取Hive数据,适用于大数据分析场景[^2]。
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值