核心思想:利用duckdb tpch插件内置的tpch_queries()表函数输出查询Sql语句到qs.txt,然后读入生成的qs.txt, 将结果输出到res.txt, 在控制台输出计时。
autotpch.txt脚本如下:
LOAD tpch;
PRAGMA disable_progress_bar;
CALL dbgen(sf = 0.3);
.output qs.txt
.mode list
.header off
select query FROM tpch_queries() limit 3;
.output
.timer on
.output res.txt
.read qs.txt
.output
再在操作系统命令行执行
# windows
type autotpch.txt|duckdb
# Linux
cat autotpch.txt|duckdb
或者在duckdb命令提示符下输入.read autotpch.txt
运行结果如下:
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows │
└─────────┘
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows │
└─────────┘
Run Time (s): real 0.013 user 0.093750 sys 0.093750
Run Time (s): real 0.014 user 0.000000 sys 0.000000
Run Time (s): real 0.011 user 0.015625 sys 0.000000
如果需要对外部parquet文件,而不是数据库内的表执行查询,可以利用如下的脚本, 假定所有的parquet文件都位于C:/d/fire/data/tables_pyarrow/scale-0.1/目录,不需要load tpch, 只要用前一步得到的qs.txt文件。
PRAGMA disable_progress_bar;
.mode list
.header off
.output creaview.txt
with ta(path) as (select 'C:/d/fire/data/tables_pyarrow/scale-0.1/'),
tb(tname) as (from unnest(['nation','region','customer','supplier','lineitem','orders','partsupp','part']) )
select 'create view '||tname||' as from '''||path||tname||'.parquet'';' from ta, tb;
.output
.read creaview.txt
.timer on
.output res2.txt
.read qs.txt
.output
执行方法和前面的类似
duckdb < parquetpch.txt
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows │
└─────────┘
Run Time (s): real 0.047 user 0.031250 sys 0.015625
Run Time (s): real 0.023 user 0.031250 sys 0.000000
Run Time (s): real 0.043 user 0.062500 sys 0.000000
如果需要生成parquet文件,可以修改脚本,用自动生成的copy 表 to parquet文件名
命令来批量生成parquet文件。具体选项参阅官方文档,
对于超过内存的tpch规模数据,可以分步生成parquet文件,例如:
import duckdb
import pathlib
for x in range(0, sf) :
con=duckdb.connect()
con.sql('PRAGMA disable_progress_bar;SET preserve_insertion_order=false')
con.sql(f"CALL dbgen(sf={sf} , children ={sf}, step = {x})")
for tbl in ['nation','region','customer','supplier','lineitem','orders','partsupp','part'] :
pathlib.Path(f'{sf}/{tbl}').mkdir(parents=True, exist_ok=True)
con.sql(f"COPY (SELECT * FROM {tbl}) TO '{sf}/{tbl}/{x:02d}.parquet' ")
con.close()