parquet类型小文件合并

parquet类型小文件合并:
./2024-7-26/0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

hadoop jar ./parquet-tools-1.9.0.jar --help
WARNING: Use “yarn jar” to launch YARN applications.
usage: parquet-tools cat [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-j,–json Show records in JSON format.
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools head [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-n,–records The number of records to show (default: 5)
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools schema [option…]
where option is one of:
-d,–detailed Show detailed information about the schema.
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file containing the schema to show

usage: parquet-tools meta [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools dump [option…]
where option is one of:
-c,–column Dump only the given column, can be specified more than
once
-d,–disable-data Do not dump column data
–debug Enable debug output
-h,–help Show this help string
-m,–disable-meta Do not dump row group and page metadata
-n,–disable-crop Do not crop the output based on console width
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools merge [option…] [ …]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the source parquet files/directory to be merged
is the destination parquet file

查看结构:
hadoop jar ./parquet-tools-1.9.0.jar schema ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq
message schema {
optional binary id;
optional binary sn;
optional binary mes_sn;
optional binary line_code;
optional binary section_code;
optional binary station_code;
optional binary station_slot;
optional binary test_software_version;
optional binary test_time;
optional double elapsed_time;
optional binary test_result;
optional binary failitem;
optional binary failitems;
optional binary bg;
optional binary bu;
optional binary project_code;
optional binary project_name;
}

查看内容:
hadoop jar ./parquet-tools-1.9.0.jar head -n 10 ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

合并parquet小文件:原文件不删除,产生新的合并文件
hadoop jar ./parquet-tools-1.9.0.jar merge ./2024-7-26/ /tmp/all.parquet
合并结果:
hdfs dfs -du -h /tmp/all.parquet
280.6 M 841.7 M /tmp/all.parquet

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值