test1
当key不是spark_catalog
spark-sql --master local \
--driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5060" \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.type=hive
create table IF NOT EXISTS employee_partition(
id string,
name string,
age int,
tel string
)
using iceberg
PARTITIONED BY (
city_id string
);


key不是spark_catalog时,insert数据会有些问题。
test2
当key是spark_catalog,这种是标准的
spark-sql --master local \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive


这种为正确情况,新建的空表时,TABLE_PARAMS 表中要有7个字段:
mysql> select * from TABLE_PARAMS;
| 15 | EXTERNAL | TRUE |
| 15 | metadata_location | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00000-5525ef8b-bc94-48f0-8dd3-5aeccb80ea0d.metadata.json |
| 15 | numFiles | 1 |
| 15 | owner | hadoop |
| 15 | table_type | ICEBERG |
| 15 | totalSize | 2008 |
| 15 | transient_lastDdlTime | 1632798925
(1)第1次写入数据
spark-sql> INSERT OVERWRITE TABLE iceberg_hive.employee_partition1 PARTITION (city_id="beijing")
> VALUES
> ("1","wang",17,"130"),
> ("2","hong",18,"150"),
> ("3","bing",19,"170");
Time taken: 8.769 seconds
插入分区数据后,分区表PARTITIONS仍然为空。 但是,TABLE_PARAMS 表中字段有所变化(7行-> 9行)。
mysql> select * from PARTITIONS;
Empty set (0.00 sec)
mysql> select * from TABLE_PARAMS;
| 15 | EXTERNAL | TRUE |
| 15 | metadata_location | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00001-c02eda34-acc3-48ff-aa9d-d435734dbf6d.metadata.json |
| 15 | numFiles | 1 |
| 15 | numRows | 3 |
| 15 | owner | hadoop |
| 15 | previous_metadata_location | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00000-5525ef8b-bc94-48f0-8dd3-5aeccb80ea0d.metadata.json |
| 15 | table_type | ICEBERG |
| 15 | totalSize | 1406
| 15 | transient_lastDdlTime | 1632798925
hadoop存储变化
employee_partition1/
|-- data
| `-- city_id=beijing
| `-- 00000-0-c38fea92-1cdc-492b-9f09-a713d8e97d7f-00001.parquet
`-- metadata
|-- 00000-5525ef8b-bc94-48f0-8dd3-5aeccb80ea0d.metadata.json
|-- 00001-c02eda34-acc3-48ff-aa9d-d435734dbf6d.metadata.json
|-- e656716d-5e76-425a-abc9-d7b5a1e855b4-m0.avro
`-- snap-8454781130181280343-1-e656716d-5e76-425a-abc9-d7b5a1e855b4.avro
3 directories, 5 files

两json文件的diff,可以看到空表时快照id为-1. 快照为空。

查看json文件
{
"format-version" : 1,
"table-uuid" : "22560303-6d75-42b3-b6ae-c71a38147a1e",
"location" : "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1",
"last-updated-ms"

本文详细解析了使用Spark SQL与Iceberg进行表创建、分区写入、元数据变化及快照管理的过程,包括表参数调整、数据文件增删与avro文件解析。
最低0.47元/天 解锁文章
1566

被折叠的 条评论
为什么被折叠?



