python 解压zip 慢_Python和zlib：解压缩级联流的速度非常慢

最新推荐文章于 2023-12-13 14:30:15 发布

土豆焖马铃薯

最新推荐文章于 2023-12-13 14:30:15 发布

阅读量283

点赞数

文章标签： python 解压zip 慢

本文链接：https://blog.youkuaiyun.com/weixin_42367557/article/details/112932172

版权

博主遇到一个833MB的ZIP文件，包含大量压缩XML流。使用现有代码逐流解压并写入文件耗时数天。经过分析，原始算法处理子字符串的方式呈平方级，效率低下。解决方案是利用Python的zlib.Decompress类，每次读取一定量数据，检查unused_data属性来确定流的结束，并在找到完整流后写入文件。通过这种方式，可以更有效地处理文件，避免了读-解压-写循环的瓶颈。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.

If I try to decompress it as a single object, I only get the first stream (about 19 kb).

I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:

import zlib

outfile = open('output.xml', 'w')

def zipstreams(filename):

"""Return all zip streams and their positions in file."""

with open(filename, 'rb') as fh:

data = fh.read()

i = 0

print "got it"

while i < len(data):

try:

zo = zlib.decompressobj()

dat =zo.decompress(data[i:])

outfile.write(dat)

zo.flush()

i += len(data[i:]) - len(zo.unused_data)

except zlib.error:

i += 1

outfile.close()

zipstreams('payload')

infile.close()

This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!

Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.

Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?

Thanks for any pointers or suggestions you have!

解决方案

It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:

You say that the vendor states that they are

using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.

From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: http://tools.ietf.org/html/rfc1950

So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).

Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:

The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.

So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).

Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.

import zlib

# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input

ZHEAD = CMF+FLG

def findstart(header, buf, source):

"""Find `header` in str `buf`, reading more from `source` if necessary"""

while buf.find(header) == -1:

more = source.read(2**12)

if len(more) == 0: # EOF without finding the header

return ''

buf += more

offset = buf.find(header)

return buf[offset:]