前面测试了数种单机版分析型数据库,
databend是rust编写的分析型数据库,它不支持访问指定parquet文件,而要用stage访问系统生成的文件。如下所示,有些繁琐。
create stage lake;
copy into @lake from(select i::int i from range(1,1000000000)t(i))file_format=(type=parquet);
SELECT avg(i) FROM @lake (pattern => '.*parquet') group by round(log10(i)) ;
所以改用databend python模块来测试。它可以把指定的parquet文件注册成一个别名,以后都用此别名来访问。
编写/par/config2.toml.txt,内容如下:
[meta]
embedded_dir = "/par/"
# Storage config.
[storage]
# fs | s3 | azblob | obs | oss
type = "fs"
allow_insecure = true
[storage.fs]
data_path = "/par/"
allow_insecure = true
用pip install databend
安装,然后编写python脚本q1.py, 如下:(注册部分只能运行一次,否则报错RuntimeError: DataFrame collect error: ViewAlreadyExists. Code: 2306, Text = 'lineitem' as view Already Exists.
)
import databend
databend.init_service( config = "/par/config2.toml.txt" )
from databend import SessionContext
ctx = SessionContext()
ctx.register_parquet("lineitem", "/par/fire/data/tables_pyarrow/scale-10.0/", pattern = "lineitem.parquet") #only once
#df = ctx.sql("select count(*) from lineitem")
#df.show()
q1="""
SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM
lineitem
WHERE
l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
"""
import time
t=time.time();df = ctx.sql(q1);df.show();print(time.time()-t)
q3="""
select
l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from
customer c,
orders o,
lineitem l
where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date'1995-03-15'
and l_shipdate > date'1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10;
"""
ctx.register_parquet("customer", "/par/fire/data/tables_pyarrow/scale-10.0/", pattern = "customer.parquet") #only once
ctx.register_parquet("orders", "/par/fire/data/tables_pyarrow/scale-10.0/", pattern = "orders.parquet") #only once
t=time.time();df = ctx.sql(q3);df.show();print(time.time()-t)
然后这样执行
RUST_BACKTRACE=1 python /par/q1.py
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ l_returnflag │ l_linestatus │ sum_qty │ sum_base_price │ sum_disc_price │ sum_charge │ avg_qty │ avg_price │ avg_disc │ count_order │
│ String │ String │ Decimal(18, 2) NULL │ Decimal(18, 2) NULL │ Decimal(38, 4) NULL │ Decimal(38, 6) NULL │ Decimal(24, 8) NULL │ Decimal(24, 8) NULL │ Decimal(24, 8) NULL │ UInt64 │
├──────────────┼──────────────┼─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼─────────────┤
│ 'A' │ 'F' │ 113243256.00 │ 169777698133.48 │ 161288698452.0115 │ 167740893756.439349 │ 25.50474957 │ 38237.48827635 │ 0.05000161 │ 4440085 │
│ 'N' │ 'F' │ 2953690.00 │ 4424757845.20 │ 4203689870.8100 │ 4372096041.072415 │ 25.53924239 │ 38258.91109785 │ 0.04997985 │ 115653 │
│ 'N' │ 'O' │ 222980543.00 │ 334387558404.46 │ 317670374370.3902 │ 330383821608.062345 │ 25.50002630 │ 38240.51829775 │ 0.04999051 │ 8744326 │
│ 'R' │ 'F' │ 113357470.00 │ 170002054868.07 │ 161505271120.8840 │ 167964089577.052314 │ 25.51100682 │ 38258.82476794 │ 0.04998878 │ 4443473 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
0.8978915214538574
┌──────────────────────────────────────────────────────────────────┐
│ l_orderkey │ revenue │ o_orderdate │ o_shippriority │
│ Int64 │ Decimal(38, 4) NULL │ Date │ Int32 │
├────────────┼─────────────────────┼──────────────┼────────────────┤
│ 2113221 │ 453119.3884 │ '1995-02-19' │ 0 │
│ 9210272 │ 444593.0049 │ '1995-02-18' │ 0 │
│ 8443557 │ 435105.0808 │ '1995-03-12' │ 0 │
│ 15619200 │ 431264.0142 │ '1995-02-23' │ 0 │
│ 12255266 │ 424725.6722 │ '1995-02-28' │ 0 │
│ 11568359 │ 417991.5783 │ '1995-03-13' │ 0 │
│ 8823430 │ 411455.5585 │ '1995-03-11' │ 0 │
│ 12096992 │ 404105.0420 │ '1995-03-12' │ 0 │
│ 8058883 │ 403450.3528 │ '1995-03-05' │ 0 │
│ 7862021 │ 401060.8511 │ '1995-03-09' │ 0 │
└──────────────────────────────────────────────────────────────────┘
1.282097339630127
我的数据实际上是sf=3,lineitem约1800万行。
┌──────────┐
│ COUNT(*) │
│ UInt64 │
├──────────┤
│ 17996609 │
└──────────┘
这个速度和前几种数据库处于同一水平。