Retrieving million of rows from MySQL

最新推荐文章于 2021-02-08 21:59:36 发布

转载最新推荐文章于 2021-02-08 21:59:36 发布 · 250 阅读

高级程序员必备专栏收录该内容

10 篇文章

订阅专栏

当从MySQL检索大量数据时，使用默认游标可能会因内存溢出而导致进程被杀。本文介绍如何利用Python的SSCursor实现流式结果集处理，逐行读取数据避免内存问题，并讨论了使用SSCursor的限制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

There are times when your query returns a very large number of rows. If you use the default cursor, chances are your process will be killed while retrieving the rows. The reason is by default MySQL clients (e.g. Java connector, Python driver) retrieve all rows and buffer them in memory before passing the result set to your code. If you run out of memory while doing that, your process is certainly killed.

The fix is to use streaming result set. In Python, you can use MySQLdb.cursors.SSCursor for this purpose.

 import MySQLdb

 conn = MySQLdb.connect(...)

 cursor = 
 MySQLdb.SSCursor(conn)

 cursor.execute(...)

 while True:

     row = cursor.
 fetchone()

     if not row:

         break

...

There are two important things to remember here:

You use an SSCursor instead of the default cursor. This can be done like shown above, or by passing the class name to cursor() call such as conn.cursor(MySQLdb.SSCursor).
Use fetchone to fetch rows from the result set, one row at a time. Do not use fetchall. You can use fetchmany but it is the same as calling fetchone that many times.

One common misconception is to treat SSCursor as a server side cursor. It is not! This class is in fact only an unbuffered cursor. It does not read all result set into memory like the default cursor does (hence a buffered cursor). What it does is reading from the response stream in chunks and returning record by record to you. There is another more appropriate name for this: a streaming result set.

Because SSCursor is only an unbuffered cursor, (I repeat, not a real server side cursor), there are several restrictions applied to it:

You must read ALL records. The rational is that you send one query, and the server replies with one answer, albeit a really long one. Therefore, before you can do anything else, even a simple ping, you must completely finish this response.
This brings another restriction that you must process each row quickly. If your processing takes even half a second for each row, you will find your connection dropped unexpectedly with error 2013, "Lost connection to MySQL server during query." The reason is by default MySQL will wait for a socket write to finish in 60 seconds. The server is trying to dump large amount of data down the wire, yet the client is taking its time to process chunk by chunk. So, the server is likely to just give up. You can increase this timeout by issuing a query SET NET_WRITE_TIMEOUT = xx where xx is the number of seconds that MySQL will wait for a socket write to complete. But please do not rely on that to be a workable remedy. Fix your processing instead. Or if you cannot reduce processing time any further, you can quickly chuck the rows somewhere local to complete the query first, and then read them back later at a more leisure rate.
The first restriction also means that your connection is totally held up while you are retrieving the rows. There is no way around it. If you need to run another query in parallel, do it in another connection. Otherwise, you will get error 2014, "Commands out of sync; you can't run this command now."

I hope this post will help some of you.

转载地址：https://techualization.blogspot.sg/2011/12/retrieving-million-of-rows-from-mysql.html 使用python遍历mysql中有千万行数据的大表