Hive 7. 桶 (Bucket)

本文介绍了Hive中的桶表概念及其优势,包括提高查询效率和抽样操作的效率。通过具体的示例说明如何创建桶表,并展示了桶表的实际应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

桶 (Bucket)

理解

桶是除分区表之外的Hive另一种组织表的方式。
其实桶组织表的方式和 Shuffle map 端的 partition 很相似,就是将文件中的某个关键数据进行哈希处理(mr 中是键,桶中就是自己指定),然后将哈希结果对一个数(mr 中是 reduce 任务的个数,桶中由用户指定)进行取模,然后根据取模结果放置在相应的文件中。

优点

有两个好处:

  1. 第一个就是让查询更加高效。
    桶会在表上提供额外的结构,这让 Hive 在进行精确的数据查询时得到好处。另外,可以将要进行连接(join)的两个表在相同的列上进行桶处理,这样能够让 map 端的连接操作更加高效,因为同值的行都被hash取模到相同命名的文件中。
  2. 第二个就是桶能够让抽样操作更加高效。当你在一个很大的数据集上工作,你正在进行开发或者改善程序,查询出具有代表的小数据集抽样来验证程序,能够变得非常方便。

注意

  • 需要配置 hive.enforce.bucketing 为 true,这样才会按照你定义的桶数目进行分桶。

  • 另外 还需要提前设置一下 Hive 调用 mapreduce 任务时所开辟的 JVM 内存。Hive 默认的大小为 200 MB,太小了。
    添加 mapred.child.java.opts 属性的值为
    -Xmx1024m,即一个G。
    要不然就会内存溢出。

创建

使用下面的操作来告诉 Hive ,你想将这个表进行桶处理。

Create Table bucketed_users (id Int, name String)
Clustered By (id) Into 4 Buckets;

使用 Clustered By 子句来设置桶操作依照的列, 后面就是桶的个数。

设置数据排序

这样在 join 时候会更加高效。

Create Table bucketed_users (id Int, name String)
Clustered By (id) Sorted By(id Asc) Into 4 Buckets;

注意
Hive 在插入数据阶段不会按表的定义来检查桶操作后的磁盘上的数据文件,如果格式不匹配,会在读取时检查出来。这就所谓的读模式。

先弄一个非桶的用户表。

hive> Select * from users;
0   Nat
2   Joe
3   Kay
4   Ann

然后将数据 Insert 进 bucketed_users 表中。

Insert OverWrite Table bucketed_users
Select * From users;

物理上,每个桶就是表(或分区)目录里的一个文件。它的文件名并不重要,但是桶 n 就是按照字典序排列的第 n 个文件。事实上,桶对应于 MapReduce 的输出文件分区:一个作业产生的桶(输出文件)和reduce任务个数相同。我们可以通过查看刚才创建的bucketd_users表的布局来了解这一情况。运行如下命令:

hive> dfs -ls /你的Hive数据目录/bucketed_users;

然后你就发现下面几个文件:

000000_0
000001_0
000002_0
000003_0

看一看第一个文件,结果为:

4   Ann
0   Nat

因为数字哈希后的结果还是其本身,所以你会发现第一个文件中的id数值为取4的模结果为 0 的行。

取样的方式

hive> Select * From bucketed_users
    > Tablesample(Bucket 1 Out Of 4 On id);

即取总数据四分之一的值。

hive> Select * From bucketed_users
    > Tablesample(Bucket 1 Out Of 2 On id);

这就是二分之一。

注意:取样并不是一个精确的操作,不是刚刚好几分之几,且是你的数据的分布情况决定了取样的均匀程度。


End!!!

from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import StreamTableEnvironment, EnvironmentSettings, DataTypes from pyflink.table.udf import udf import os # 定义处理CDC操作的UDF @udf(result_type=DataTypes.STRING()) def process_cdc_op(op): """将CDC操作转换为可读的标签""" return { 'I': 'INSERT', 'U': 'UPDATE', 'D': 'DELETE' }.get(op, 'UNKNOWN') def main(): env = StreamExecutionEnvironment.get_execution_environment() env.set_parallelism(1) env.enable_checkpointing(5000) # 添加必要的连接器JAR包 flink_home = os.getenv('FLINK_HOME', '/opt/flink') env.add_jars( f"file://{flink_home}/lib/flink-connector-kafka-1.17.1.jar", f"file://{flink_home}/lib/flink-connector-jdbc-1.17.1.jar", f"file://{flink_home}/lib/flink-sql-connector-hive-3.1.2_2.12-1.16.3.jar", f"file://{flink_home}/lib/mysql-connector-java-8.0.28.jar", f"file://{flink_home}/lib/hive-exec-3.1.2.jar", ) settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) # 注册UDF t_env.create_temporary_function("process_cdc_op", process_cdc_op) # 解决方案1: 使用默认数据库 t_env.execute_sql(""" CREATE CATALOG hive_catalog WITH ( 'type' = 'hive', 'hive-conf-dir' = '/opt/hive/conf' ) """) t_env.use_catalog("hive_catalog") # 1. 创建Kafka源表 t_env.execute_sql(""" CREATE TABLE kafka_user_meal ( id STRING, review STRING, rating DOUBLE, review_time STRING, user_id STRING, meal_id STRING, op STRING, ts AS TO_TIMESTAMP(FROM_UNIXTIME(CAST(review_time AS BIGINT))), WATERMARK FOR ts AS ts - INTERVAL '5' SECOND, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'kafka', 'topic' = 'cleaned-user-meal-reviews', 'properties.bootstrap.servers' = 'master:9092,slave01:9092,slave02:9092', 'properties.group.id' = 'flink-cdc-group', 'scan.startup.mode' = 'latest-offset', 'format' = 'json', 'json.ignore-parse-errors' = 'true' ) """) # 2. 创建Hive表 t_env.execute_sql(""" CREATE TABLE IF NOT EXISTS hive_user_meal_cdc ( id STRING, review STRING, rating DOUBLE, review_time STRING, user_id STRING, meal_id STRING, operation_type STRING, operation_ts TIMESTAMP(3), op STRING ) PARTITIONED BY (op) STORED AS ORC TBLPROPERTIES ( 'sink.partition-commit.policy.kind' = 'metastore,success-file', 'auto-compaction' = 'true' ) """) # 3. 创建MySQL表 t_env.execute_sql(""" CREATE TABLE mysql_user_meal ( id STRING PRIMARY KEY NOT ENFORCED, review STRING, rating DOUBLE, review_time TIMESTAMP(3), user_id STRING, meal_id STRING, last_operation STRING, update_time TIMESTAMP(3) ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://mysql-host:3306/user_meal', 'table-name' = 'user_meal_reviews', 'username' = 'root', 'password' = '5266', 'driver' = 'com.mysql.cj.jdbc.Driver', 'sink.buffer-flush.max-rows' = '100', 'sink.buffer-flush.interval' = '5s' ) """) # 4. 写入Hive t_env.execute_sql(""" INSERT INTO hive_user_meal_cdc SELECT id, review, rating, review_time, user_id, meal_id, process_cdc_op(op) AS operation_type, ts AS operation_ts, op FROM kafka_user_meal """) # 5. 写入MySQL(优化查询) latest_data = t_env.sql_query(""" SELECT id, LAST_VALUE(review IGNORE NULLS) AS review, LAST_VALUE(rating IGNORE NULLS) AS rating, MAX(ts) AS review_time, LAST_VALUE(user_id IGNORE NULLS) AS user_id, LAST_VALUE(meal_id IGNORE NULLS) AS meal_id, LAST_VALUE(process_cdc_op(op) IGNORE NULLS) AS last_operation, CURRENT_TIMESTAMP AS update_time FROM kafka_user_meal WHERE op <> 'D' GROUP BY id """) t_env.create_temporary_view("latest_user_meal", latest_data) t_env.execute_sql(""" INSERT INTO mysql_user_meal SELECT * FROM latest_user_meal """) env.execute("Flink CDC to Hive and MySQL Pipeline") if __name__ == '__main__': main() 这个代码能不能走得通?已经运行了一遍,hive中已经存在表kafka_user_meal,但是无法查看数据 at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:328) at org.apache.hadoop.util.RunJar.main(RunJar.java:241) FAILED: SemanticException Line 0:-1 Invalid column reference 'TOK_ALLCOLREF' 1962568 [199a4ed1-e39e-47c4-ade9-ff69770529cc main] ERROR org.apache.hadoop.hive.ql.Driver - FAILED: SemanticException Line 0:-1 Invalid column reference 'TOK_ALLCOLREF' org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid column reference 'TOK_ALLCOLREF' at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genColListRegex(SemanticAnalyzer.java:3761) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4526) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4338) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:10566) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10505) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11418) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11288) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12074) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:593) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12164) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:330) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:659) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:328) at org.apache.hadoop.util.RunJar.main(RunJar.java:241) hive> DESCRIBE FORMATTED kafka_user_meal; OK # col_name data_type comment # Detailed Table Information Database: user_meal OwnerType: USER Owner: null CreateTime: Wed Jun 18 17:53:08 CST 2025 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/user_meal.db/kafka_user_meal Table Type: MANAGED_TABLE Table Parameters: flink.comment flink.connector kafka flink.format json flink.json.ignore-parse-errors true flink.properties.bootstrap.servers master:9092,slave01:9092,slave02:9092 flink.properties.group.id flink-cdc-group flink.scan.startup.mode latest-offset flink.schema.0.data-type VARCHAR(2147483647) NOT NULL flink.schema.0.name id flink.schema.1.data-type VARCHAR(2147483647) flink.schema.1.name review flink.schema.2.data-type DOUBLE flink.schema.2.name rating flink.schema.3.data-type VARCHAR(2147483647) flink.schema.3.name review_time flink.schema.4.data-type VARCHAR(2147483647) flink.schema.4.name user_id flink.schema.5.data-type VARCHAR(2147483647) flink.schema.5.name meal_id flink.schema.6.data-type VARCHAR(2147483647) flink.schema.6.name op flink.schema.7.data-type TIMESTAMP(3) flink.schema.7.expr TO_TIMESTAMP(FROM_UNIXTIME(CAST(`review_time` AS BIGINT))) flink.schema.7.name ts flink.schema.primary-key.columns id flink.schema.primary-key.name PK_3386 flink.schema.watermark.0.rowtime ts flink.schema.watermark.0.strategy.data-type TIMESTAMP(3) flink.schema.watermark.0.strategy.expr `ts` - INTERVAL '5' SECOND flink.topic cleaned-user-meal-reviews transient_lastDdlTime 1750240388 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.168 seconds, Fetched: 54 row(s) hive> SELECT > id, > review, > rating, > review_time, > user_id, > meal_id, > op, > ts > FROM kafka_user_meal > LIMIT 10; 2138986 [199a4ed1-e39e-47c4-ade9-ff69770529cc main] ERROR org.apache.hadoop.hive.ql.parse.CalcitePlanner - CBO failed, skipping CBO. org.apache.hadoop.hive.ql.parse.SemanticException: Line 2:4 Invalid table alias or column reference 'id': (possible column names are: ) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:12673) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:12613) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genSelectLogicalPlan(CalcitePlanner.java:4539) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:4876) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1661) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1609) at org.apache.calcite.tools.Frameworks$1.apply(Frameworks.java:118) at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:1052) at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:154) at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:111) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1414) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.getOptimizedAST(CalcitePlanner.java:1430) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:450) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12164) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:330) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:659) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:328) at org.apache.hadoop.util.RunJar.main(RunJar.java:241) FAILED: SemanticException [Error 10004]: Line 2:4 Invalid table alias or column reference 'id': (possible column names are: ) 2139034 [199a4ed1-e39e-47c4-ade9-ff69770529cc main] ERROR org.apache.hadoop.hive.ql.Driver - FAILED: SemanticException [Error 10004]: Line 2:4 Invalid table alias or column reference 'id': (possible column names are: ) org.apache.hadoop.hive.ql.parse.SemanticException: Line 2:4 Invalid table alias or column reference 'id': (possible column names are: ) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:12673) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:12613) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4565) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4338) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:10566) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10505) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11418) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11288) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12074) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:593) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12164) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:330) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:659) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:328) at org.apache.hadoop.util.RunJar.main(RunJar.java:241) hive> FROM kafka_user_meal 是在pycharm中运行到一般后进程被杀死的结果
最新发布
06-19
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值