我从IT之家看到这则新闻,然后到OpenZL github存储库 下载了源代码
编译步骤很简单,解压缩到/par/openzl, 然后make即可,为了加速我用了make -j 6,
使用,编译的命令行工具在cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b/目录,这有点奇怪,不过不影响测试。我按照官方页面 https://openzl.org/getting-started/quick-start/ 的步骤,压缩了一个csv文件。
compiling single-threaded static library 1.5.7
make[1]: Leaving directory '/par/openzl/deps/zstd/lib'
LD cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b/zli
root@6ae32a5ffcde:/par/openzl# cd cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli list-profiles
Available profiles:
-| csv = CSV. Pass optional non-comma separator with --profile-arg <char>.
-| le-i16 = Little-endian signed 16-bit data
-| le-i32 = Little-endian signed 32-bit data
-| le-i64 = Little-endian signed 64-bit data
-| le-u16 = Little-endian unsigned 16-bit data
-| le-u32 = Little-endian unsigned 32-bit data
-| le-u64 = Little-endian unsigned 64-bit data
-| parquet = Parquet in the canonical format (no compression, plain encoding)
-| pytorch = Pytorch model generated from torch.save(). Training is not supported.
-| sao = SAO format from the Silesia corpus
-| sddl = Data that can be parsed using the Simple Data Description Language. Pass a path to the data description file with --profile-arg.
-| serial = Serial data (aka raw bytes)
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli compress --profile csv /par/fgduck1000w.csv --output /par/fgduck1000w.csv.zl
src/openzl/codecs/dispatch_string/encode_dispatch_string_binding.c:74: EI_dispatch_string: splitting 150000001 strings into 8 outputs
Compressed 485554801 -> 26960807 (18.01x) in 4176.355 ms, 110.88 MiB/s
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# head -n 3 /par/fgduck1000w.csv
passenger_id,departure_station,arrival_station,train_id,COACH_NUMBER,SEAT_NUMBER
P00000001,城市1,城市1,G101,11,19F
P00000002,城市1,城市2,G2,3,7A
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# time zstd -T0 /par/fgduck1000w.csv
/par/fgduck1000w.csv : 22.10% ( 463 MiB => 102 MiB, /par/fgduck1000w.csv.zst)
real 0m2.217s
user 0m1.346s
sys 0m0.257s
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli decompress /par/fgduck1000w.csv.zl --ou
tput /par/fgduck1000w.csv.zl.decompressed
Decompressed: 5.55% ( 25.71 MiB -> 463.06 MiB) in 2148.338 ms, 215.54 MiB/s
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# md5sum /par/fgduck1000w.csv /par/fgduck1000w.csv.zl.decompressed
493100bbc45d59304c78e72bc30ee34d /par/fgduck1000w.csv
493100bbc45d59304c78e72bc30ee34d /par/fgduck1000w.csv.zl.decompressed
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b#
虽然4秒的压缩时间比zstd长了一倍,但是压缩比非常惊人,达到了18倍,而zstd不过4.5倍。
再用duckdb生成parquet格式文件来测试,结果报错了,
D load tpch;
D call dbgen(sf=1);
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows │
└─────────┘
D copy lineitem to '/par/nocomplineitem.parquet' (FORMAT parquet, COMPRESSION uncompressed);
D copy lineitem to '/par/zstdlineitem.parquet' (FORMAT parquet, COMPRESSION zstd);
D .exit
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ls -l /par/nocomplineitem.parquet /par/zstdlineitem.parquet
-rwxrwxrwx 1 root root 664336490 Oct 7 12:11 /par/nocomplineitem.parquet
-rwxrwxrwx 1 root root 163604034 Oct 7 12:12 /par/zstdlineitem.parquet
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli compress --profile parquet /par/nocomplineitem.parquet --output /par/nocomplineitem.parquet.zl
OpenZL Library Exception:
OpenZL error code: 1
OpenZL error string: Generic
OpenZL error context: Code: Generic
Message: Attaching to pre-existing error:
Graph ID: 55
Stack Trace:
#0 ZL_ParquetLexer_lex (custom_parsers/parquet/parquet_lexer.cpp:357): Attaching to pre-existing error:
#1 parquetGraphInner (custom_parsers/parquet/parquet_graph.c:103): Forwarding error:
#2 CCTX_runGraph_internal (src/openzl/compress/cctx.c:770): Forwarding error:
#3 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1148): Forwarding error:
#4 CCTX_startCompression (src/openzl/compress/cctx.c:1274): Forwarding error:
#5 CCTX_compressInputs_withGraphSet_stage2 (src/openzl/compress/compress2.c:116): Forwarding error:
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b#


被折叠的 条评论
为什么被折叠?



