数据源tpch sf=0.1数据库 lineitem表
按速度从快到慢排序如下
import duckdb
import pandas
import polars
import pyarrow
from pyarrow import csv
import time
con = duckdb.connect("tpch_duckdb")
#sql copy to
t=time.time();con.sql("copy lineitem to 'copyto.csv'");print(time.time()-t)
0.7691347599029541
# pyarrow.csv.write_csv
t=time.time();ar=con.sql("from lineitem").arrow();print(time.time()-t)
0.28513407707214355
t=time.time();pyarrow.csv.write_csv(ar,"arout.csv");print(time.time()-t)
0.591275691986084
#sql write_csv
t=time.time();con.sql("from lineitem").write_csv("out.csv");print(time.time()-t)
0.9410018920898438
# polars write_csv
t=time.time();pl=con.sql("from lineitem").pl();print(time.time()-t)
0.5547764301300049
t=time.time();pl.write_csv("pl.csv");print(time.time()-t)
0.6685678958892822
# pandas to_csv
t=time.time();df=con.sql("from lineitem").df();print(time.time()-t)
0.8767569065093994
t=time.time();df.to_csv("df.csv",index=None);print(time.time()-t)
6.194610118865967
以为快的软件基本上都用了多线程。但设置为单线程反而更快
con = duckdb.connect("tpch_duckdb",config = {'threads': 1})
t=time.time();con.sql("copy lineitem to 'copyto.csv'");print(time.time()-t)
0.7604503631591797
t=time.time();con.sql("from lineitem").write_csv("out.csv");print(time.time()-t)
0.6541898250579834
t=time.time();ar=con.sql("from lineitem").arrow();print(time.time()-t)
0.14452648162841797
t=time.time();pyarrow.csv.write_csv(ar,"arout.csv");print(time.time()-t)
0.5979902744293213
t=time.time();pl=con.sql("from lineitem").pl();print(time.time()-t)
0.1778249740600586
t=time.time();pl.write_csv("pl.csv");print(time.time()-t)
0.37970709800720215
t=time.time();df=con.sql("from lineitem").df();print(time.time()-t)
0.43950891494750977
t=time.time();df.to_csv("df.csv",index=None);print(time.time()-t)
6.099819183349609
补记:
fireducks支持多线程,并兼容pandas,但没有arm64 Linux版,所以在amd64上测试,并与pandas比较
import duckdb
con = duckdb.connect("/par/tpch_duckdb")
t=time.time();con.sql("copy lineitem to 'copyto.csv'");print(time.time()-t)
0.5453505516052246
import fireducks.pandas
t=time.time();fd=fireducks.pandas.read_csv('copyto.csv');print(time.time()-t)
0.014633893966674805
t=time.time();fd.to_csv('fireduck.csv');print(time.time()-t)
1.0539100170135498
import pandas
t=time.time();pd=pandas.read_csv('copyto.csv');print(time.time()-t)
2.0135209560394287
t=time.time();pd.to_csv('pandas.csv');print(time.time()-t)
4.81012487411499

6890

被折叠的 条评论
为什么被折叠?



