zip压缩算法 源码分析(1)

本文详细介绍了LZ77压缩算法的主要思路,包括使用三元组替换重复短语,以及在gzip中应用的优化技巧。通过对gzip源码的分析,特别是deflate.c文件中的关键函数,如lm_init、fill_window和longest_match,揭示了算法的内部运作机制,包括哈希表的特殊实现、lazy match技术和匹配串搜索策略。

LZ77 压缩算法

参考资料1(学术味较浓)
参考资料2(较为全面)
gzip源码下载地址
这里的图都是自己做的,所以很抱歉,可能在文章页面看起来不怎么清楚,想仔细看细节的话,可以
在新页面打开图片,会清晰很多的。

I. 主要思路
LZ77算法压缩的是输入中重复出现的那些短语,每读到一个新字符,就向前扫描,看是否有重复出现的短语。若搜索出来有重复出现的短语,就用形如
(charactor(匹配串的起始字符), distance, length)的三元组来代替该匹配串,如图

为了保证压缩速度,LZ77 限制了向前搜索的范围,所有查重匹配工作都在一个“滑窗”内进行。
gzip在LZ77算法的基础上增加了不少的优化技巧。该过程主要在c文件deflate.c中(这整个算法就叫做deflate算法,deflate意为“放气”, 有种把车胎放气弄瘪的意思在里面,不知道这个理解对不对(⊙v⊙))。

II. deflate.c源码分析
源码的分析我打算分段来写,并且为了逻辑清晰起见,顺序与源码不同。
(i.)先贴上代码首部的注释:(好歹我也是全英班的,就当练练英语吧。)

/*
 *  PURPOSE 目的
 *
 *      Identify new text as repetitions of old text within a fixed-
 *      length sliding window trailing behind the new text.
 *      这里的主要活动是在一个定长滑动窗口内进行字符串匹配并初次压缩输出的工作。
 *
 *  DISCUSSION
 *
 *      The "deflation" process depends on being able to identify portions
 *      of the input text which are identical to earlier input (within a
 *      sliding window trailing behind the input currently being processed).
 *      “deflation”算法依靠前后文中相同的字符串达到压缩的目的。
 * 
 *      The most straightforward technique turns out to be the fastest for
 *      most input files: try all possible matches and select the longest.
 *      查找所有匹配的串并且选出匹配长度最长的哪一个串进行压缩---
 *      这种最直接粗暴的压缩技术,被证明在绝大多数输入情形下都是最快的。
 *      The key feature of this algorithm is that insertions into the string
 *      dictionary are very simple and thus fast, and deletions are avoided
 *      completely. 
 *      该算法的主要特性是对字典区块(就是滑动窗口中已经处理过的那部分
 *      字符组成的字典,由哈希表实现)插入(新词条)的操作简洁且快速,
 *      同时还完全地避免了删除(词条)的操作。
 *      Insertions are performed at each input character, whereas
 *      string matches are performed only when the previous match ends. So it
 *      is preferable to spend more time in matches to allow very fast string
 *      insertions and avoid deletions. 
 *      每读到一个新字符都会进行一次(字典区块的)插入操作,而匹配操作仅当
 *      之前的一次操作完成时才会开始进行。所以总是希望可以通过适当增加匹配操作
 *      时间来减少插入操作的时间。
 *     The matching algorithm for small
 *      strings is inspired from that of Rabin & Karp. A brute force approach
 *      is used to find longer strings when a small match has been found.
 *      A similar algorithm is used in comic (by Jan-Mark Wams) and freeze
 *      (by Leonid Broukhis).
 *      A previous version of this file used a more sophisticated algorithm
 *      (by Fiala and Greene) which is guaranteed to run in linear amortized
 *      time, but has a larger average cost, uses more memory and is patented.
 *      However the F&G algorithm may be faster for some highly redundant
 *      files if the parameter max_chain_length (described below) is too large.
 *   (此处为介绍各种技巧思路的来源以及同之前版本的对比,无太大意义,故不译。)
 *
 *  ACKNOWLEDGEMENTS
 *
 *      The idea of lazy evaluation of matches is due to Jan-Mark Wams, and
 *      I found it in 'freeze' written by Leonid Broukhis.
 *      Thanks to many info-zippers for bug reports and testing.
 *
 *  REFERENCES
 *
 *      APPNOTE.TXT documentation file in PKZIP 1.93a distribution.
 *
 *      A description of the Rabin and Karp algorithm is given in the book
 *         "Algorithms" by R. Sedgewick, Addison-Wesley, p252.
 *
 *      Fiala,E.R., and Greene,D.H.
 *         Data Compression with Finite Windows, Comm.ACM, 32,4 (1989) 490-595
 *   (此处亦为介绍各种技巧思路的来源以及同之前版本的对比,无太大意义,故不译。)
 * 
 *  INTERFACE 
 *      2个接口函数
 *      void lm_init (int pack_level, ush *flags)
 *          Initialize the "longest match" routines for a new file
 *        初始化函数,启动LZ77压缩进程;
 *      ulg deflate (void)
 *          Processes a new input file and return its compressed length. Sets
 *          the compressed length, crc, deflate flags and internal file
 *          attributes.
 *        LZ77压缩主进程都在该函数里。
 */

*(ii.)接下来开始分析函数 void lm_init (int pack_level, ush flags):
这里对哈希表的初始化可能有些令人疑惑,实际上,此处的哈希表并非由链接法或开域法实现,
而是由2个数组构成的,这样做的原因是节约空间,避免node的空间申请与删除。

(看不清楚请点右键在新选项卡中打开)
如图,哈希表由2个数组head[WSIZE]与prev[WSIZE]构成,WSIZE = ox8000 = 32k,
这两个数组构成了许多条链,所有由一个特定哈希函数取到相同哈希码的字符串们组成一条链。
head[]装载所有链头串的地址,prev[]则记录串与串之间的链接关系。链的具体实现方式稍后会解释。
顺便一提,这种链的构造方式与boost库的内存池很相似。

其次,在查找匹配串的时候使用了lazy match技术,就是说如果当前这对匹配串不够长,就把
strtsart指针继续右移一个字符,看看接下来找到的匹配串是不是更好。

/* ===========================================================================
 * Initialize the "longest match" routines for a new file
 */
void lm_init (pack_level, flags)
    int pack_level; /* 0: store, 1: best speed, 9: best compression */
    ush *flags;     /* general purpose bit flag */
{
    register unsigned j;

    if (pack_level < 1 || pack_level > 9) error("bad pack level");
    compr_level = pack_level;

    /* Initialize the hash table. 
    *  将head数组初始化
    */
#if defined(MAXSEG_64K) && HASH_BITS == 15
    for (j = 0;  j < HASH_SIZE; j++) head[j] = NIL;
#else
    memzero((char*)head, HASH_SIZE*sizeof(*head));
#endif
    /* prev will be initialized on the fly 
    *  prev将会在后面动态地进行初始化
    */

   //4种配置参数,见函数下方注1.
    max_lazy_match   = configuration_table[pack_level].max_lazy;
    good_match       = configuration_table[pack_level].good_length;
#ifndef FULL_SEARCH
    nice_match       = configuration_table[pack_level].nice_length;
#endif
    max_chain_length = configuration_table[pack_level].max_chain;
    if (pack_level == 1) {
       *flags |= FAST;
    } else if (pack_level == 9) {
       *flags |= SLOW;
    }
    /* ??? reduce max_chain_length for binary files */
	//初始化strstart与block_start指针
    strstart = 0;
    block_start = 0L;
#ifdef ASMV
    match_init(); /* initialize the asm code */
#endif
	//初始化超前查看区,read_buf函数详情见下方注2.
    lookahead = read_buf((char*)window,
			 sizeof(int) <= 2 ? (unsigned)WSIZE : 2*WSIZE);
	//如果没有足够空间,lookahead会得到标记值eof(end of file)
    if (lookahead == 0 || lookahead == (unsigned)EOF) {
       eofile = 1, lookahead = 0;
       return;
    }
    //有足够空间
    eofile = 0;
    /* Make sure that we always have enough lookahead. This is important
     * if input comes from a device such as a tty.
     * 对于像从tty终端这样的设备的输入来说,保证超前查看区总有足够的空间是很重要的。
     * 函数fill_window()主要负责向滑窗写入信息,是该文件中较长较重要的一个函数,,
     * 后面会单独拿出来分析,就不放在注释的部分了。
     * 常量 MIN_LOOKAHEAD 见注释3.
     */
    while (lookahead < MIN_LOOKAHEAD && !eofile) fill_window();
	
	//初始化哈希码,使用了宏UPDATE_HASH,该哈希算法十分有趣,对于理解
	//整个算法也很重要,见注4。
    ins_h = 0;
    for (j=0; j<MIN_MATCH-1; j++) UPDATE_HASH(ins_h, window[j]);
    /* If lookahead < MIN_MATCH, ins_h is garbage, but this is
     * not important since only literal bytes will be emitted.
     */
}

@注1.

 控制运行速度与效果的4个全局配置参数:
  1. max_lazy_match 
  定义 local unsigned int max_lazy_match
  Attempt to find a better match only when the current match is strictly
  smaller than this value. This mechanism is used only for compression
  levels >= 4.
  用于判断是否进行lazy match,小于该长度则不满意,进行lazy match;大于该长度则认为还算可以,
  略过。该机制只在压缩程度>=4时生效。
  2. good_match
  unsigned near good_match
  Use a faster search when the previous match is longer than this
  当匹配串比该长度长时,通过调整系数(具体实现在longest_match中有描述)加快该次
  搜索的速度。
  3. nice_match
  取值由是否定义宏FULL_SEARCH决定
  #ifdef  FULL_SEARCH
  # define nice_match MAX_MATCH
  (MAX_MATCH = 258, MIN_MATCH = 3,见gzip.h)
  #else
  int near nice_match; 
  #endif
  Stop searching when current match exceeds this
  当匹配长超过该值时,认为够好了,停止搜索。
  4. max_chain_length
  unsigned near max_chain_length;
  To speed up deflation, hash chains are never searched beyond this length.
  A higher limit improves compression ratio but degrades the speed.
  用于加快deflation算法速度,在查找时,遍历的串数目永不超过该值。
  
  另外还有一个configuration_table数组,其下标对应不同的压缩等级:
  local config configuration_table[10] = {
       good lazy nice chain 
  0    {0,    0,  0,    0},   store only 
  1    {4,    4,  8,    4},   maximum speed, no lazy matches 
  2    {4,    5, 16,    8},
  3    {4,    6, 32,   32},
  4    {4,    4, 16,   16},   lazy matches
  5    {8,   16, 32,   32},
  6    {8,   16, 128, 128},
  7    {8,   32, 128, 256},
  8    {32, 128, 258, 1024},
  9    {32, 258, 258, 4096}};  maximum compression

@注2.

事实上,函数read_buf真身位于zip.c文件。
/* ===========================================================================
 * Read a new buffer from the current input file, perform end-of-line
 * translation, and update the crc and input file size.
 * IN assertion: size >= 2 (for end-of-line translation)
 * 从当前输入数据中读入一个新的缓冲区,更新crc码(crc, cyclic-redundancy-check码是一种
 * 常见的纠错码)与文件大小。
 */
int file_read(buf, size)
    char *buf;
    unsigned size;
{
    unsigned len;

    Assert(insize == 0, "inbuf not empty");

	//此处尝试在从文件标识符ifd对应文件向buf写入size大小数据块
    len = read(ifd, buf, size);
    if (len == (unsigned)(-1) || len == 0) return (int)len;
	//crc是全局long变量,未压缩的文件自带
    crc = updcrc((uch*)buf, len);
    isize += (ulg)len;
    return (int)len;
}

@注3

MIN_LOOKAHEAD 常量的定义位于文件gzip.h,描述如下:
#define MIN_LOOKAHEAD (MAX_MATCH+MIN_MATCH+1)
/* Minimum amount of lookahead, except at the end of the input file.
 * See deflate.c for comments about the MIN_MATCH+1.
 */

#define MAX_DIST  (WSIZE-MIN_LOOKAHEAD)
/* In order to simplify the code, particularly on 16 bit machines, match
 * distances are limited to MAX_DIST instead of WSIZE.
 */
 其中MAX_MATCH = 258, MIN_MATCH = 3, 
 表示匹配长度小于3的一对串不能算作匹配串,因为光三元组
((charactor(匹配串的起始字符), distance, length))
 就不止3个字节了,强行压缩反而把数据量搞大了.
滑窗结构图


(看不清楚请点右键在新选项卡中打开)
如图,MIN_LOOKAHEAD 表示了超前查看区最小的长度,当不够时,就会调用fill_window函数。

@注4.
宏UPDATE_HASH与宏INSET_STRING 构成了一个哈希函数,其哈希值的求法比较特殊,是动态生成的,
也就是说,随着strstart指针的前进,不断由之前的旧值更新来得到新串的哈希值。
对于任意的字符串,其哈希值由其前3个字符生成,为什么是3个呢?因为MIN_MATCH = 3,
因此,前3项相同的串必为匹配串,从而匹配串总能被hash到同一条链内,虽然同一条链内的串不一定
都互为匹配串。
代码如下(按照我的思路,把2个宏的声明顺序换了一下)

/* ===========================================================================
 * Insert string s in the dictionary and set match_head to the previous head
 * of the hash chain (the most recent string with same hash key). Return
 * the previous length of the hash chain.
 * IN  assertion: all calls to to INSERT_STRING are made with consecutive
 *    input characters and the first MIN_MATCH bytes of s are valid
 *    (except for the last MIN_MATCH-1 bytes of the input file).
 * 参数:
 * s为当前字符c位置,即strstart, 
 * match_head用于记录之前链头字符串地址;
 * 综述:
 * 掩码WMASK = WSIZE - 1,用于生成位于s位置的字符c在prev数组中的链接值。
 * 目的是保持如下的映射: 
 * prev[(字符c的位置s) & WMASK] = 同一链中位于c之前那个串在窗口中的起始字符位置
 * head[(以c为首的串生成的哈希码ins_h)] = c在窗口中的位置
 */
#define INSERT_STRING(s, match_head) \
   (UPDATE_HASH(ins_h, window[(s) + MIN_MATCH-1]), \
    prev[(s) & WMASK] = match_head = head[ins_h], \
    head[ins_h] = (s))
    
/* ===========================================================================
 * Update a hash value with the given input byte
 * IN  assertion: all calls to to UPDATE_HASH are made with consecutive
 *    input characters, so that a running hash key can be computed from the
 *    previous key instead of complete recalculation each time.
 *    参数:h为旧的哈希码,c为从INSERT_STRING中传入的的字符window[s + 2],
 *    综述:设哈希码长度为HASH_BITS,
 *    设计H_SHIFT = (HASH_BITS + MIN_MATCH - 1) / MIN_MATCH,
 *    即H_SHIFT 为哈希码长度除以3再向上取整,原因如下:
 *    设strtsart  = s, 此时生成的哈希码只能与
 *    window[s], window[s + 1], window[s + 2]有关,这一点是必须的,因此,必须通过左移将之前
 *    window[s - i],window[s - i + 1],......,window[s - 1](i >= MIN_MATCH - 1)
 *    产生的哈希码去掉,
 *    不妨设每次左移r位;
 *    window[s - i]左移一次,window[s - i + 1]左移一次,...,window[s]左移一次,
 *    共有i + 1次机会,从而(i + 1) * r >= HASH_BITS, 且 i >= MIN_MATCH - 1,
 *    于是 r = (HASH_BITS + i) / (i + 1) <= (HASH_BITS + MIN_MATCH - 1) / MIN_MATCH 
 */
#define UPDATE_HASH(h,c) (h = (((h)<<H_SHIFT) ^ (c)) & HASH_MASK)

(iii.)接下来开始分析函数 fill_window():
fill_window()函数相对于本文中其它几个函数而言,算比较好理解的了。
先上代码吧,过一遍代码再上图来分析。

/* ===========================================================================
 * Fill the window when the lookahead becomes insufficient.
 * Updates strstart and lookahead, and sets eofile if end of input file.
 * 当超前查看区空间不足时,将新的数据填进滑窗并更新strstart,lookahead,并判断是否到达文件尾部。
 * IN assertion: lookahead < MIN_LOOKAHEAD && strstart + lookahead > 0
 * OUT assertions: at least one byte has been read, or eofile is set;
 *    file reads are performed for at least two bytes (required for the
 *    translate_eol option).
 */
local void fill_window()
{
	//计数寄存器
    register unsigned n, m;
    unsigned more = (unsigned)(window_size - (ulg)lookahead - (ulg)strstart);
    /* Amount of free space at the end of the window. 
    *  free: 滑窗末尾自由区的长度, window_size = 2 * W_SIZE,即滑窗总长。
    */

    /* If the window is almost full and there is insufficient lookahead,
     * move the upper half to the lower one to make room in the upper half.
     * 如果滑窗已满,且超前查看区空间不够,则把右边一窗数据(32K)移到左窗,再将一窗新数据写入右窗。
     */
    if (more == (unsigned)EOF) {
        /* Very unlikely, but possible on 16 bit machine if strstart == 0
         * and lookahead == 1 (input done one byte at time)
         * 16比特机器的特殊情况,无必要深入理解
         */
        more--;
    } else if (strstart >= WSIZE+MAX_DIST) {
        /* By the IN assertion, the window is not empty so we can't confuse
         * more == 0 with more == 64K on a 16 bit machine.
         */
        Assert(window_size == (ulg)2*WSIZE, "no sliding with BIG_MEM");
		//右窗移至左窗,刷新各指针值
        memcpy((char*)window, (char*)window+WSIZE, (unsigned)WSIZE);
        match_start -= WSIZE;
        strstart    -= WSIZE; /* we now have strstart >= MAX_DIST: */

        block_start -= (long) WSIZE;
		
		//更新哈希表,若是旧左窗的字串,则删除该词条,重置为nil,
		//注意,哈希表中越靠近头部的串,在窗口位置越靠右(就是更加新鲜),
		//因此,即使删除一个串导致断链,其后面的串也只会是garbage string.
        for (n = 0; n < HASH_SIZE; n++) {
            m = head[n];
            head[n] = (Pos)(m >= WSIZE ? m-WSIZE : NIL);
        }
        for (n = 0; n < WSIZE; n++) {
            m = prev[n];
            prev[n] = (Pos)(m >= WSIZE ? m-WSIZE : NIL);
            /* If n is not on any hash chain, prev[n] is garbage but
             * its value will never be used.
             */
        }
        //自由区增加WSIZE长度
        more += WSIZE;
    }
    /* At this point, more >= 2 */
    if (!eofile) {
		//读入长度为more的新数据
	    //read_buf已经解释过,见前面函数 lm_init的注2
        n = read_buf((char*)window+strstart+lookahead, more);
        if (n == 0 || n == (unsigned)EOF) {
            eofile = 1;
        } else {
            lookahead += n;
        }
    }
}

具体过程如图所示

(iv.)接下来开始分析函数longest_match()
longest_match()用于在链中搜索目前读到的字符的较长匹配串(由于配置系数,可能不一定是最长的)。

/* ===========================================================================
 * Set match_start to the longest match starting at the given string and
 * return its length. Matches shorter or equal to prev_length are discarded,
 * in which case the result is equal to prev_length and match_start is
 * garbage.
 * 将指针match_start设为所给字符串最长匹配串所在地址,并返回其长度。
 * IN assertions: cur_match is the head of the hash chain for the current
 *   string (strstart) and its distance is <= MAX_DIST, and prev_length >= 1
 */
#ifndef ASMV
/* For MSDOS, OS/2 and 386 Unix, an optimized version is in match.asm or
 * match.s. The code is functionally equivalent, so you can use the C version
 * if desired.
 * 对于MSDOS, OS/2, 和386 unix操作系统,有与该函数功能相同的版本。
 */
int longest_match(cur_match)
	/*IPos定义:typedef unsigned IPos*/
    IPos cur_match;                             /* current match */
{
    unsigned chain_length = max_chain_length;   /* max hash chain length */
    /*uch定义:typedef unsigned char uch*/
    register uch *scan = window + strstart;     /* current string*/
    register uch *match;                        /* matched string */
    register int len;                           /* length of current match */
    /*全局常量prev_length 之前搜索所得的最佳长度,详情见后面函数 deflate的分析*/
    int best_len = prev_length;                 /* best match length so far */
    /*给对匹配串的搜索加上距离限制:只搜索strstart左边MAX_DIST范围之内的串*/
    IPos limit = strstart > (IPos)MAX_DIST ? strstart - (IPos)MAX_DIST : NIL;
    /* Stop when cur_match becomes <= limit. To simplify the code,
     * we prevent matches with the string of window index 0.
     */

    /* Compare two bytes at a time. Note: this is not always beneficial.
     * Try with and without -DUNALIGNED_OK to check.
     */
    /*目标串的一些相关指针, 详情见注1.*/
    register uch *strend = window + strstart + MAX_MATCH - 1;
    register ush scan_start = *(ush*)scan;
    register ush scan_end   = *(ush*)(scan+best_len-1);

    /* Do not waste too much time if we already have a good match: 
    *  如果发现prev_length已经够长了,就将限制搜索长度减半
    */
    if (prev_length >= good_match) {
        chain_length >>= 2;
    }
    Assert(strstart <= window_size-MIN_LOOKAHEAD, "insufficient lookahead");

    do {
        Assert(cur_match < strstart, "no future");
        /*由“相对地址”cur_match获取真实的待匹配串地址*/
        match = window + cur_match;

        /* Skip to next match if the match length cannot increase
         * or if the match length is less than 2:
         */
#if (defined(UNALIGNED_OK) && MAX_MATCH == 258)
        /* This code assumes sizeof(unsigned short) == 2. Do not use
         * UNALIGNED_OK if your compiler uses a different size.
         * 这段代码只在unsigned short占用2字节时编译,此处通过将match指针
         * 转换为unsigned short *,实现一次比较2字节的功能。
         */
        /*预判,若当前待匹配串match 前2字符与strstart前2字符不同 或
        * 在prev_length处不能匹配上,则直接结束本次匹配。
        */
        if (*(ush*)(match+best_len-1) != scan_end || *(ush*)match != scan_start) continue;
			
        /* It is not necessary to compare scan[2] and match[2] since they are
         * always equal when the other bytes match, given that the hash keys
         * are equal and that HASH_BITS >= 8. Compare 2 bytes at a time at
         * strstart+3, +5, ... up to strstart+257. We check for insufficient
         * lookahead only every 4th comparison; the 128th check will be made
         * at strstart+257. If MAX_MATCH-2 is not a multiple of 8, it is
         * necessary to put more guard bytes at the end of the window, or
         * to check more often for insufficient lookahead.
         * 由于unsigned short 占2字节,所以每次迭代中计数器都加2。
         * 详情见注1。
         */
        scan++, match++;
        do {
        } while (*(ush*)(scan+=2) == *(ush*)(match+=2) &&
                 *(ush*)(scan+=2) == *(ush*)(match+=2) &&
                 *(ush*)(scan+=2) == *(ush*)(match+=2) &&
                 *(ush*)(scan+=2) == *(ush*)(match+=2) &&
                 scan < strend);
        /* The funny "do {}" generates better code on most compilers */

        /* Here, scan <= window+strstart+257 */
        Assert(scan <= window+(unsigned)(window_size-1), "wild scan");
        if *scan) == *match) scan++;

        len = (MAX_MATCH - 1) - (int)(strend-scan);
        scan = strend - (MAX_MATCH-1);

#else /* UNALIGNED_OK */
/*这段功能和前面那一段是一样的,如果前面的理解了,那这一段也好懂了。*/
       //略
#endif /* UNALIGNED_OK */
/*条件编译结束*/
		/*更新最佳匹配的记录*/
        if (len > best_len) {
            match_start = cur_match;
            best_len = len;
            if (len >= nice_match) break;
#ifdef UNALIGNED_OK
            scan_end = *(ush*)(scan+best_len-1);
#else
            scan_end1  = scan[best_len-1];
            scan_end   = scan[best_len];
#endif
        }/*此处使用哈希链向链尾遍历*/
    } while ((cur_match = prev[cur_match & WMASK]) > limit
	     && --chain_length != 0);

    return best_len;
}
#endif /* ASMV */

@注1.


如图, best_len表示到目前为止最好的匹配长度,
MAX_MATCH表示匹配串的长度上限
使用short *来2个2个地比较; 同时还要保持match的位置始终不少于limit。

(vi.)接下来开始分析函数 deflate()
deflate()函数描述了整个lz77算法的过程,非常重要

ulg deflate()
{
    IPos hash_head;          /* head of hash chain */
    IPos prev_match;         /* previous match */
    int flush;               /* set if current block must be flushed */
    int match_available = 0; /* set if previous match exists */
    register unsigned match_length = MIN_MATCH-1; /* length of best match */
#ifdef DEBUG
    extern long isize;        /* byte length of input file, for debug only */
#endif

    //如果压缩等级compr_level <= 3, 那么调用deflate的简化版本
    if (compr_level <= 3) return deflate_fast(); /* optimized for speed */

    /* Process the input block. */
    while (lookahead != 0) {//循环条件:超前查看区不为空
        /* Insert the string window[strstart .. strstart+2] in the
         * dictionary, and set hash_head to the head of the hash chain:
         * 每次读到一个字符,就把以它为首的3个字符组成的串插入哈希表,并将在
         * 同一哈希链中紧跟着该字串的下一匹配串(即原来的链头)地址存入hash_head。
         * 如果没有,那么hash_head直被设为NIL = 0. 
         * 宏INSERT_STRING的详细信息请见(ii.)lm_init 函数的注释4。
         */
        INSERT_STRING(strstart, hash_head);

        /* Find the longest match, discarding those <= prev_length.
         * unigned near prev_length: 上一个字节的最佳匹配串长度,用于懒惰匹配
         * register unsigned match_length: 记录本次最佳匹配长度
         */
        prev_length = match_length, prev_match = match_start;
        match_length = MIN_MATCH-1;

        //hash_head != NIL: 查找到了匹配串
        //prev_length < max_lazy_match: 前一次匹配长度没有超长
        //strstart - hash_head <= MAX_DIST: 匹配串与strstart在允许范围内
        if (hash_head != NIL && prev_length < max_lazy_match &&
            strstart - hash_head <= MAX_DIST) {
            /* To simplify the code, we prevent matches with the string
             * of window index 0 (in particular we have to avoid a match
             * of the string with itself at the start of the input file).
             */
            //函数longest_match: 前面有详尽分析
            match_length = longest_match (hash_head);
            /* longest_match() sets match_start */
            //调整长度,匹配串不能长到超出超前查看区(look_ahead buffer)的地步。
            if (match_length > lookahead) match_length = lookahead;
            
            /* Ignore a length 3 match if it is too distant: 
             * 如果匹配长度恰为3,但是却离strstart过远,此时若仍坚持压缩只会
             * 适得其反,因为距离太长,占用过多比特,而匹配串本身又太短。
             */
            if (match_length == MIN_MATCH && strstart-match_start > TOO_FAR){
                /* If prev_match is also MIN_MATCH, match_start is garbage
                 * but we will ignore the current match anyway.
                 */
                match_length--;
            }
        }
        /* If there was a match at the previous step and the current
         * match is not better, output the previous match:
         * 若prev_length >= 3,则说明上个字符找到了匹配串,将上次的长度与这次
         * 长度进行比较来决定输出哪一个,即进行lazy match;
         */
        if (prev_length >= MIN_MATCH && match_length <= prev_length) {
            //若上一个匹配更好,决定输出上一次的匹配串

            check_match(strstart-1, prev_match, prev_length);
            //ct_tally: 在另一个文件里,也是很重要的一个函数,将在下一篇文章里给出分析。
            flush = ct_tally(strstart-1-prev_match, prev_length - MIN_MATCH);

            /* Insert in hash table all strings up to the end of the match.
             * strstart-1 and strstart are already inserted.
             */
            //长为prev_length的匹配串已被压缩,移入字典区,lookahead减少。
            //之所以减少prev_length - 1是因为匹配串是以上一个字符开头的。
            lookahead -= prev_length-1;
            //用prev_length来把匹配掉的字符全部插入哈希表;
            //上一个字符和本字符已经插入到哈希表,故减二。
            prev_length -= 2;
            do {
                strstart++;
                INSERT_STRING(strstart, hash_head);
                /* strstart never exceeds WSIZE-MAX_MATCH, so there are
                 * always MIN_MATCH bytes ahead. If lookahead < MIN_MATCH
                 * these bytes are garbage, but it does not matter since the
                 * next lookahead bytes will always be emitted as literals.
                 */
            } while (--prev_length != 0);
            //重置符号:match_avaliable = 0:之前没有可比较的串
            match_available = 0;
            match_length = MIN_MATCH-1;
            strstart++;
            if (flush) FLUSH_BLOCK(0), block_start = strstart;

        } else if (match_available) {
            /* If there was no match at the previous position, output a
             * single literal. If there was a match but the current match
             * is longer, truncate the previous match to a single literal.
             * 1.如果前一次没有匹配串(prev_length < MIN_MATCH),则将前一次字符
             * 作为literal单字符输出,lz77输出2种压缩结果,详情见注1.
             * 2.如果前一次有匹配,但是没有这一次的长,也将将前一次字符
             * 作为literal单字符输出。
             */
            Tracevv((stderr,"%c",window[strstart-1]));
            if (ct_tally (0, window[strstart-1])) {
                FLUSH_BLOCK(0), block_start = strstart;
            }
            strstart++;
            lookahead--;
        } else {
            /* There is no previous match to compare with, wait for
             * the next step to decide.
             */
            match_available = 1;
            strstart++;
            lookahead--;
        }
        Assert (strstart <= isize && lookahead <= isize, "a bit too far");

        /* Make sure that we always have enough lookahead, except
         * at the end of the input file. We need MAX_MATCH bytes
         * for the next match, plus MIN_MATCH bytes to insert the
         * string following the next match.
         * 空间不足,则调用fill_window()进行填充
         * 此处解释了为何MIN_LOOKAHEAD = MAX_MATCH + MIN_MATCH + 1:
         * MIN_MATCH个字符用于生成哈希码,MAX_MATCH用于匹配。
         */
        while (lookahead < MIN_LOOKAHEAD && !eofile) fill_window();
    }
    if (match_available) ct_tally (0, window[strstart-1]);

    return FLUSH_BLOCK(1); /* eof */
}
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值