[iceberg]2-hivecatalog

最新推荐文章于 2025-04-05 09:51:20 发布

王小禾

最新推荐文章于 2025-04-05 09:51:20 发布

阅读量775

点赞数

分类专栏： hadoop 文章标签： sql iceberg

本文链接：https://blog.youkuaiyun.com/answer100answer/article/details/120524906

版权

hadoop 专栏收录该内容

14 篇文章

订阅专栏

test1

当key不是spark_catalog

spark-sql --master local \
--driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5060" \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.type=hive

create table IF NOT EXISTS employee_partition(
  id string,
  name string,
  age int,
  tel string
)
using iceberg
PARTITIONED BY (
  city_id string
);

在这里插入图片描述
key不是spark_catalog时，insert数据会有些问题。

test2

当key是spark_catalog，这种是标准的

spark-sql --master local \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive

在这里插入图片描述

这种为正确情况，新建的空表时，TABLE_PARAMS 表中要有7个字段：

mysql> select * from TABLE_PARAMS;
|     15 | EXTERNAL                             | TRUE                                                                                                                                                                                                                                                                                                                                               |
|     15 | metadata_location                    | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00000-5525ef8b-bc94-48f0-8dd3-5aeccb80ea0d.metadata.json                                                                                                                                                                                                                 |
|     15 | numFiles                             | 1                                                                                                                                                                                                                                                                                                                                                  |
|     15 | owner                                | hadoop                                                                                                                                                                                                                                                                                                                                             |
|     15 | table_type                           | ICEBERG                                                                                                                                                                                                                                                                                                                                            |
|     15 | totalSize                            | 2008                                                                                                                                                                                                                                                                                                                                               |
|     15 | transient_lastDdlTime                | 1632798925

（1）第1次写入数据

spark-sql> INSERT OVERWRITE TABLE iceberg_hive.employee_partition1 PARTITION (city_id="beijing")
         > VALUES
         > ("1","wang",17,"130"),
         > ("2","hong",18,"150"),
         > ("3","bing",19,"170");
Time taken: 8.769 seconds

插入分区数据后，分区表PARTITIONS仍然为空。 但是，TABLE_PARAMS 表中字段有所变化（7行-> 9行）。

mysql> select * from PARTITIONS;
Empty set (0.00 sec)

mysql> select * from TABLE_PARAMS;
|     15 | EXTERNAL                             | TRUE                                                                                                                                                                                                                                                                                                                                               |
|     15 | metadata_location                    | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00001-c02eda34-acc3-48ff-aa9d-d435734dbf6d.metadata.json                                                                                                                                                                                                                 |
|     15 | numFiles                             | 1                                                                                                                                                                                                                                                                                                                                                  |
|     15 | numRows                              | 3                                                                                                                                                                                                                                                                                                                                                  |
|     15 | owner                                | hadoop                                                                                                                                                                                                                                                                                                                                             |
|     15 | previous_metadata_location           | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00000-5525ef8b-bc94-48f0-8dd3-5aeccb80ea0d.metadata.json                                                                                                                                                                                                                 |
|     15 | table_type                           | ICEBERG                                                                                                                                                                                                                                                                                                                                            |
|     15 | totalSize                            | 1406                                                                                                                                                                                                                                                                                                                                               
|     15 | transient_lastDdlTime                | 1632798925

hadoop存储变化

employee_partition1/
|-- data
|   `-- city_id=beijing
|       `-- 00000-0-c38fea92-1cdc-492b-9f09-a713d8e97d7f-00001.parquet
`-- metadata
    |-- 00000-5525ef8b-bc94-48f0-8dd3-5aeccb80ea0d.metadata.json
    |-- 00001-c02eda34-acc3-48ff-aa9d-d435734dbf6d.metadata.json
    |-- e656716d-5e76-425a-abc9-d7b5a1e855b4-m0.avro
    `-- snap-8454781130181280343-1-e656716d-5e76-425a-abc9-d7b5a1e855b4.avro

3 directories, 5 files

在这里插入图片描述
两json文件的diff，可以看到空表时快照id为-1. 快照为空。

查看json文件

{
  "format-version" : 1,
  "table-uuid" : "22560303-6d75-42b3-b6ae-c71a38147a1e",
  "location" : "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1",
  "last-updated-ms" : 1632802184610,
  "last-column-id" : 5,
  "schema" : {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "name",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "age",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 4,
      "name" : "tel",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 5,
      "name" : "city_id",
      "required" : false,
      "type" : "string"
    } ]
  },
    "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "name",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "age",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 4,
      "name" : "tel",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 5,
      "name" : "city_id",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "partition-spec" : [ {
    "name" : "city_id",
    "transform" : "identity",
    "source-id" : 5,
    "field-id" : 1000
  } ],
  "default-spec-id" : 0,
    "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "city_id",
      "transform" : "identity",
      "source-id" : 5,
      "field-id" : 1000
    } ]
  } ],
  "last-partition-id" : 1000,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "owner" : "hadoop"
  },
  "current-snapshot-id" : 8454781130181280343,
  "snapshots" : [ {
    "snapshot-id" : 8454781130181280343,
    "timestamp-ms" : 1632802184610,
    "summary" : {
      "operation" : "overwrite",
      "spark.app.id" : "local-1632800483195",
      "added-data-files" : "1",
      "added-records" : "3",
      "added-files-size" : "1406",
      "changed-partition-count" : "1",
      "total-records" : "3",
      "total-files-size" : "1406",
      "total-data-files" : "1",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/snap-8454781130181280343-1-e656716d-5e76-425a-abc9-d7b5a1e855b4.avro",
    "schema-id" : 0
  } ],
    "snapshot-log" : [ {
    "timestamp-ms" : 1632802184610,
    "snapshot-id" : 8454781130181280343
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1632798924774,
    "metadata-file" : "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00000-5525ef8b-bc94-48f0-8dd3-5aeccb80ea0d.metadata.json"
  } ]
}

json文件中存储了

manifest-list（avro文件）
metadata-file（就是上一次的json文件）

（2）第2次写入数据

spark-sql> 
INSERT OVERWRITE TABLE iceberg_hive.employee_partition1 PARTITION (city_id="wuhan")
   VALUES
   ("4","w",17,"130"),
   ("5","h",18,"150"),
   ("6","b",19,"170");

mysql> select * from TABLE_PARAMS;
|     15 | EXTERNAL                             | TRUE                                                                                                                                                                                                                                                                                                                                               |
|     15 | metadata_location                    | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00002-99d06b27-2ff0-4f22-a40f-5ce5cbe6c6ec.metadata.json                                                                                                                                                                                                                 |
|     15 | numFiles                             | 2                                                                                                                                                                                                                                                                                                                                                  |
|     15 | numRows                              | 6                                                                                                                                                                                                                                                                                                                                                  |
|     15 | owner                                | hadoop                                                                                                                                                                                                                                                                                                                                             |
|     15 | previous_metadata_location           | hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/00001-c02eda34-acc3-48ff-aa9d-d435734dbf6d.metadata.json                                                                                                                                                                                                                 |
|     15 | table_type                           | ICEBERG                                                                                                                                                                                                                                                                                                                                            |
|     15 | totalSize                            | 2779                                                                                                                                                                                                                                                                                                                                               |     15 | transient_lastDdlTime                | 1632798925

TABLE_PARAMS 表中字段有所变化

numFiles：数据文件的文件个数，从1变为2
numRows：所有分区的总行数，数据变化后这里会更新
previous_metadata_location：上一个元数据json文件
totalSize：总字节数

那么，hdfs文件系统上会多数据文件、json文件、两个avro文件
在这里插入图片描述
以下分别分析avro文件、json文件
avro：
第2个avro文件 2cfc4831-b799-40eb-913d-548eede5c9ec-m0.avro

{
	"status": 1,
	"snapshot_id": {
		"long": 6656578434519846789
	},
	"data_file": {
		"file_path": "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/data/city_id=wuhan/00000-2-53fed580-6321-4c03-bf95-1609ae4e287d-00001.parquet",
		"file_format": "PARQUET",
		"partition": {
			"city_id": {
				"string": "wuhan"
			}
		},
		"record_count": 3,
		"file_size_in_bytes": 1373,
		"block_size_in_bytes": 67108864,
		"column_sizes": {
			"array": [{
				"key": 1,
				"value": 54
			}, {
				"key": 2,
				"value": 54
			}, {
				"key": 3,
				"value": 52
			}, {
				"key": 4,
				"value": 57
			}, {
				"key": 5,
				"value": 93
			}]
		},
		"value_counts": {
			"array": [{
				"key": 1,
				"value": 3
			}, {
				"key": 2,
				"value": 3
			}, {
				"key": 3,
				"value": 3
			}, {
				"key": 4,
				"value": 3
			}, {
				"key": 5,
				"value": 3
			}]
		},
		"null_value_counts": {
			"array": [{
				"key": 1,
				"value": 0
			}, {
				"key": 2,
				"value": 0
			}, {
				"key": 3,
				"value": 0
			}, {
				"key": 4,
				"value": 0
			}, {
				"key": 5,
				"value": 0
			}]
		},
		"nan_value_counts": {
			"array": []
		},
		"lower_bounds": {
			"array": [{
				"key": 1,
				"value": "4"
			}, {
				"key": 2,
				"value": "b"
			}, {
				"key": 3,
				"value": "\u0011\u0000\u0000\u0000"
			}, {
				"key": 4,
				"value": "130"
			}, {
				"key": 5,
				"value": "wuhan"
			}]
		},
		"upper_bounds": {
			"array": [{
				"key": 1,
				"value": "6"
			}, {
				"key": 2,
				"value": "w"
			}, {
				"key": 3,
				"value": "\u0013\u0000\u0000\u0000"
			}, {
				"key": 4,
				"value": "170"
			}, {
				"key": 5,
				"value": "wuhan"
			}]
		},
		"key_metadata": null,
		"split_offsets": {
			"array": [4]
		},
		"sort_order_id": {
			"int": 0
		}
	}
}

这里，只包含第二次写入时的信息，如wuhan分区，不包含第一次的分区信息。

对应的第2个snap.avro文件 snap-6656578434519846789-1-2cfc4831-b799-40eb-913d-548eede5c9ec.avro

{
	"manifest_path": "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/2cfc4831-b799-40eb-913d-548eede5c9ec-m0.avro",
	"manifest_length": 6233,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 6656578434519846789
	},
	"added_data_files_count": {
		"int": 1
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 0
	},
	"partitions": {
		"array": [{
			"contains_null": false,
			"contains_nan": {
				"boolean": false
			},
			"lower_bound": {
				"bytes": "wuhan"
			},
			"upper_bound": {
				"bytes": "wuhan"
			}
		}]
	},
	"added_rows_count": {
		"long": 3
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 0
	}
} {
	"manifest_path": "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/e656716d-5e76-425a-abc9-d7b5a1e855b4-m0.avro",
	"manifest_length": 6241,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 8454781130181280343
	},
	"added_data_files_count": {
		"int": 1
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 0
	},
	"partitions": {
		"array": [{
			"contains_null": false,
			"contains_nan": {
				"boolean": false
			},
			"lower_bound": {
				"bytes": "beijing"
			},
			"upper_bound": {
				"bytes": "beijing"
			}
		}]
	},
	"added_rows_count": {
		"long": 3
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 0
	}
}

这里其实包含了2个分区的信息，第1次和第2次分区信息都有。

（3）第3次写入数据

update iceberg_hive.employee_partition1 set tel='newNum',age=1 where name='b';

在这里插入图片描述
先看快照文件，snap-2173158888179895109-1-c701372c-0d5f-47ed-aacd-7064c94f61d9.avro
有3个json，记录了每个分区快照和最近的变化。

{
	"manifest_path": "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/c701372c-0d5f-47ed-aacd-7064c94f61d9-m1.avro",
	"manifest_length": 6237,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 2173158888179895109
	},
	"added_data_files_count": {
		"int": 1
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 0
	},
	"partitions": {
		"array": [{
			"contains_null": false,
			"contains_nan": {
				"boolean": false
			},
			"lower_bound": {
				"bytes": "wuhan"
			},
			"upper_bound": {
				"bytes": "wuhan"
			}
		}]
	},
	"added_rows_count": {
		"long": 3
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 0
	}
} {
	"manifest_path": "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/c701372c-0d5f-47ed-aacd-7064c94f61d9-m0.avro",
	"manifest_length": 6232,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 2173158888179895109
	},
	"added_data_files_count": {
		"int": 0
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 1
	},
	"partitions": {
		"array": [{
			"contains_null": false,
			"contains_nan": {
				"boolean": false
			},
			"lower_bound": {
				"bytes": "wuhan"
			},
			"upper_bound": {
				"bytes": "wuhan"
			}
		}]
	},
	"added_rows_count": {
		"long": 0
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 3
	}
} {
	"manifest_path": "hdfs://ns1/user/wanghongbing/db/iceberg_hive/employee_partition1/metadata/e656716d-5e76-425a-abc9-d7b5a1e855b4-m0.avro",
	"manifest_length": 6241,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 8454781130181280343
	},
	"added_data_files_count": {
		"int": 1
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 0
	},
	"partitions": {
		"array": [{
			"contains_null": false,
			"contains_nan": {
				"boolean": false
			},
			"lower_bound": {
				"bytes": "beijing"
			},
			"upper_bound": {
				"bytes": "beijing"
			}
		}]
	},
	"added_rows_count": {
		"long": 3
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 0
	}
}

同下截图
在这里插入图片描述

文件结构

1.snapshot

meta文件夹下的json文件，每个meta.json文件相当于一个snapshot，其中包含了当前版本的schema信息、产生此版本的任务信息、以及manifest文件地址信息。

2.manifest-list文件

即 snap-xxx.avro，包含了所有mainfest的文件的元数据信息，包含了manifest地址，分区范围以及一些统计信息

3.manifest文件

即xxx.avro文件。包含了所有的数据地址细化到具体文件，所以读取时不需list所有的文件，包含了分区信息，所有字段的存储大小、每个字段的行数信息、空值统计信息、每个字段的最大值、最小值信息、分区信息等等，上层引擎可以利用这些做JOIN的Cache优化、做文件级别的下推过滤，精准的分区信息，大大提高了上层引擎查询初始化的速度。