测试Meta开源的 OpenZL 无损压缩框架

我从IT之家看到这则新闻,然后到OpenZL github存储库 下载了源代码

编译步骤很简单,解压缩到/par/openzl, 然后make即可,为了加速我用了make -j 6

使用,编译的命令行工具在cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b/目录,这有点奇怪,不过不影响测试。我按照官方页面 https://openzl.org/getting-started/quick-start/ 的步骤,压缩了一个csv文件。

compiling single-threaded static library 1.5.7
make[1]: Leaving directory '/par/openzl/deps/zstd/lib'
LD cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b/zli


root@6ae32a5ffcde:/par/openzl# cd cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli  list-profiles
Available profiles:
  -| csv        = CSV. Pass optional non-comma separator with --profile-arg <char>.
  -| le-i16     = Little-endian signed 16-bit data
  -| le-i32     = Little-endian signed 32-bit data
  -| le-i64     = Little-endian signed 64-bit data
  -| le-u16     = Little-endian unsigned 16-bit data
  -| le-u32     = Little-endian unsigned 32-bit data
  -| le-u64     = Little-endian unsigned 64-bit data
  -| parquet    = Parquet in the canonical format (no compression, plain encoding)
  -| pytorch    = Pytorch model generated from torch.save(). Training is not supported.
  -| sao        = SAO format from the Silesia corpus
  -| sddl       = Data that can be parsed using the Simple Data Description Language. Pass a path to the data description file with --profile-arg.
  -| serial     = Serial data (aka raw bytes)

root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli compress --profile csv /par/fgduck1000w.csv --output /par/fgduck1000w.csv.zl
src/openzl/codecs/dispatch_string/encode_dispatch_string_binding.c:74: EI_dispatch_string: splitting 150000001 strings into 8 outputs
Compressed 485554801 -> 26960807 (18.01x) in 4176.355 ms, 110.88 MiB/s


root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# head -n 3  /par/fgduck1000w.csv
passenger_id,departure_station,arrival_station,train_id,COACH_NUMBER,SEAT_NUMBER
P00000001,城市1,城市1,G101,11,19F
P00000002,城市1,城市2,G2,3,7A


root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# time zstd -T0 /par/fgduck1000w.csv
/par/fgduck1000w.csv : 22.10%   (   463 MiB =>    102 MiB, /par/fgduck1000w.csv.zst)

real    0m2.217s
user    0m1.346s
sys     0m0.257s
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli decompress /par/fgduck1000w.csv.zl --ou
tput /par/fgduck1000w.csv.zl.decompressed
Decompressed: 5.55% (  25.71 MiB ->  463.06 MiB) in 2148.338 ms, 215.54 MiB/s
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# md5sum /par/fgduck1000w.csv /par/fgduck1000w.csv.zl.decompressed
493100bbc45d59304c78e72bc30ee34d  /par/fgduck1000w.csv
493100bbc45d59304c78e72bc30ee34d  /par/fgduck1000w.csv.zl.decompressed
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b#

虽然4秒的压缩时间比zstd长了一倍,但是压缩比非常惊人,达到了18倍,而zstd不过4.5倍。
再用duckdb生成parquet格式文件来测试,结果报错了,

D load tpch;
D call dbgen(sf=1);
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘
D copy lineitem to '/par/nocomplineitem.parquet' (FORMAT parquet, COMPRESSION uncompressed);
D copy lineitem to '/par/zstdlineitem.parquet' (FORMAT parquet, COMPRESSION zstd);
D .exit
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ls -l /par/nocomplineitem.parquet /par/zstdlineitem.parquet
-rwxrwxrwx 1 root root 664336490 Oct  7 12:11 /par/nocomplineitem.parquet
-rwxrwxrwx 1 root root 163604034 Oct  7 12:12 /par/zstdlineitem.parquet
root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b# ./zli compress --profile parquet /par/nocomplineitem.parquet --output /par/nocomplineitem.parquet.zl
OpenZL Library Exception:
        OpenZL error code: 1
OpenZL error string: Generic
OpenZL error context: Code: Generic
Message: Attaching to pre-existing error:
Graph ID: 55
Stack Trace:
        #0 ZL_ParquetLexer_lex (custom_parsers/parquet/parquet_lexer.cpp:357): Attaching to pre-existing error:
        #1 parquetGraphInner (custom_parsers/parquet/parquet_graph.c:103): Forwarding error:
        #2 CCTX_runGraph_internal (src/openzl/compress/cctx.c:770): Forwarding error:
        #3 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1148): Forwarding error:
        #4 CCTX_startCompression (src/openzl/compress/cctx.c:1274): Forwarding error:
        #5 CCTX_compressInputs_withGraphSet_stage2 (src/openzl/compress/compress2.c:116): Forwarding error:


root@6ae32a5ffcde:/par/openzl/cachedObjs/72e0b48f22c972c6cc2a1df9ed3b4b2b#
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值