Python多进程COPY PostgreSql表实验

原创

于 2017-04-28 10:27:00 发布 · 566 阅读

CC 4.0 BY-SA版权

在32核I5服务器上进行Python多进程COPY PostgreSQL表的实验，研究多进程是否能优化数据COPY效率。实验包括：1)两个进程分别COPY两个表，2)一个进程COPY两个表，3)多进程访问单表。结果显示，多进程在查询单表时，优化主要体现在数据传输到内存阶段，而在数据库查询层面优化有限。多进程查询多表有一定优化，但进程数过多会消耗资源，建议单表查询进程不超过3个。

背景

因为需要进行代码优化。所以进行数据表的整表COPY
一直很好奇，多进程对于copy是否有优化呢？于是做了一些实验。
实验环境：32核I5的服务器。内存200G

实验一：两个进程分开copy两个表

Python代码如下

dsn = 'postgresql://postgres:pset123456@192.168.10.10/CHN_NAVINFO_2016Spr_0082_0002_108'

conn1 = psycopg2.connect(dsn=dsn)
conn2 = psycopg2.connect(dsn=dsn)

io1 = open('rdb_node.csv', 'w')
io2 = open('rdb_node_with_all_attri_view.csv', 'w')


sql1 = """copy (select * from rdb_node order by node_id_t, node_id) to STDOUT delimiter '|' csv header"""
sql2 = """copy (select * from rdb_node_with_all_attri_view order by node_id_t, node_id) to STDOUT delimiter '|' csv header"""


def table_size(table_name, c):
    cur = c.cursor()
    cur.execute("select pg_size_pretty(pg_relation_size('%s'));" % table_name)
    s = cur.fetchone()[0]
    cur.close()
    return s

print 'rdb_node size:', table_size('rdb_node', conn1)
print 'rdb_node_with_all_attri_view:', table_size('rdb_node_with_all_attr