The rsync algorithm

本文介绍了一种高效的文件同步算法RSync,该算法通过计算文件块的滚动校验和来快速更新远程文件,减少数据传输量。算法分为多个步骤,包括文件分割、校验和搜索等,旨在实现低延迟、高效率的文件同步过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

ref : http://rsync.samba.org/tech_report/

The problem

Imagine you have two files, A and B, and you wish to update B to be the same as A. The obvious method is to copy A ontoB.

Now imagine that the two files are on machines connected by a slow communications link, for example a dial up IP link. IfA is large, copying A onto B will be slow. To make it faster you could compressA before sending it, but that will usually only gain a factor of 2 to 4.

Now assume that A and B are quite similar, perhaps both derived from the same original file. To really speed things up you would need to take advantage of this similarity. A common method is to send just the differences betweenA and B down the link and then use this list of differences to reconstruct the file.

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used (once you had copied the file over, you wouldn't need the differences). This is the problem that rsync addresses.

The rsync algorithm efficiently computes which parts of a source file match some part of an existing destination file. These parts need not be sent across the link; all that is needed is a reference to the part of the destination file. Only parts of the source file which are not matched in this way need to be sent verbatim. The receiver can then construct a copy of the source file using the references to parts of the existing destination file and the verbatim material.

Trivially, the data sent to the receiver can be compressed using any of a range of common compression algorithms, for further speed improvements

The rsync algorithm

Suppose we have two general purpose computers $\alpha$ and$\beta$. Computer$\alpha$ has access to a fileA and $\beta$ has access to fileB, where A and B are ``similar''. There is a slow communications link between$\alpha$ and$\beta$.

The rsync algorithm consists of the following steps:

1.
$\beta$ splits the file B into a series of non-overlapping fixed-sized blocks of size S bytes 1. The last block may be shorter than S bytes.

2.
For each of these blocks $\beta$ calculates two checksums: a weak ``rolling'' 32-bit checksum (described below) and a strong 128-bit MD4 checksum.

3.
$\beta$ sends these checksums to $\alpha$.
4.
$\alpha$ searches through A to find all blocks of length S bytes (at any offset, not just multiples of S) that have the same weak and strong checksum as one of the blocks of B. This can be done in a single pass very quickly using a special property of the rolling checksum described below.
5.
$\alpha$ sends $\beta$ a sequence of instructions for constructing a copy of A. Each instruction is either a reference to a block of B, or literal data. Literal data is sent only for those sections of A which did not match any of the blocks of B.

The end result is that $\beta$ gets a copy ofA, but only the pieces of A that are not found in B (plus a small amount of data for checksums and block indexes) are sent over the link. The algorithm also only requires one round trip, which minimises the impact of the link latency.

The most important details of the algorithm are the rolling checksum and the associated multi-alternate search mechanism which allows the all-offsets checksum search to proceed very quickly. These will be discussed in greater detail below.

 

Rolling checksum

The weak rolling checksum used in the rsync algorithm needs to have the property that it is very cheap to calculate the checksum of a bufferX2 .. Xn+1 given the checksum of bufferX1 .. Xn and the values of the bytesX1 and Xn+1.

The weak checksum algorithm we used in our implementation was inspired by Mark Adler's adler-32 checksum. Our checksum is defined by

\begin{displaymath}a(k,l) = (\sum_{i=k}^l X_i) \bmod M \end{displaymath}

\begin{displaymath}b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M \end{displaymath}

 

s( k, l) = a( k, l) + 2 16 b( k, l)


 

where s(k,l) is the rolling checksum of the bytes $X_k \ldots X_l$. For simplicity and speed, we useM = 216.

The important property of this checksum is that successive values can be computed very efficiently using the recurrence relations


\begin{displaymath}a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M \end{displaymath}

 

\begin{displaymath}b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M \end{displaymath}


 

Thus the checksum can be calculated for blocks of length S at all possible offsets within a file in a ``rolling'' fashion, with very little computation at each point.

Despite its simplicity, this checksum was found to be quite adequate as a first level check for a match of two file blocks. We have found in practice that the probability of this checksum matching when the blocks are not equal is quite low. This is important because the much more expensive strong checksum must be calculated for each block where the weak checksum matches.

Checksum searching

Once $\alpha$ has received the list of checksums of the blocks ofB, it must search A for any blocks at any offset that match the checksum of some block ofB. The basic strategy is to compute the 32-bit rolling checksum for a block of lengthS starting at each byte of A in turn, and for each checksum, search the list for a match. To do this our implementation uses a simple 3 level searching scheme.

The first level uses a 16-bit hash of the 32-bit rolling checksum and a 216 entry hash table. The list of checksum values (i.e., the checksums from the blocks ofB) is sorted according to the 16-bit hash of the 32-bit rolling checksum. Each entry in the hash table points to the first element of the list for that hash value, or contains a null value if no element of the list has that hash value.

At each offset in the file the 32-bit rolling checksum and its 16-bit hash are calculated. If the hash table entry for that hash value is not a null value, the second level check is invoked.

The second level check involves scanning the sorted checksum list starting with the entry pointed to by the hash table entry, looking for an entry whose 32-bit rolling checksum matches the current value. The scan terminates when it reaches an entry whose 16-bit hash differs. If this search finds a match, the third level check is invoked.

The third level check involves calculating the strong checksum for the current offset in the file and comparing it with the strong checksum value in the current list entry. If the two strong checksums match, we assume that we have found a block ofA which matches a block of B. In fact the blocks could be different, but the probability of this is microscopic, and in practice this is a reasonable assumption.

When a match is found, $\alpha$ sends$\beta$ the data inA between the current file offset and the end of the previous match, followed by the index of the block inB that matched. This data is sent immediately a match is found, which allows us to overlap the communication with further computation.

If no match is found at a given offset in the file, the rolling checksum is updated to the next offset and the search proceeds. If a match is found, the search is restarted at the end of the matched block. This strategy saves a considerable amount of computation for the common case where the two files are nearly identical. In addition, it would be a simple matter to encode the block indexes as runs, for the common case where a portion ofA matches a series of blocks of B in order.

Pipelining

The above sections describe the process for constructing a copy of one file on a remote system. If we have a several files to copy, we can gain a considerable latency advantage by pipelining the process.

This involves $\beta$ initiating two independent processes. One of the processes generates and sends the checksums to$\alpha$ while the other receives the difference information from$\alpha$ and reconstructs the files.

If the communications link is buffered then these two processes can proceed independently and the link should be kept fully utilised in both directions for most of the time.

Results

To test the algorithm, tar files were created of the Linux kernel sources for two versions of the kernel. The two kernel versions were 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and are separated by 5 released patch levels.

Out of the 2441 files in the 1.99.10 release, 291 files had changed in the 2.0.0 release, 19 files had been removed and 25 files had been added.

A ``diff'' of the two tar files using the standard GNU diff utility produced over 32 thousand lines of output totalling 2.1 MB.

The following table shows the results for rsync between the two files with a varying block size.2



blockmatchestagfalsedatawrittenread
size hitsalarms   

300

642473817434948531220056291581632284
500469896200136410919001283906979384
700332555719702213078001444346699564
900256865250582414695001575438544124
1100208484968442116545001740838445204


In each case, the CPU time taken was less than the time it takes to run ``diff'' on the two files.3

The columns in the table are as follows:

block size
The size in bytes of the checksummed blocks.
matches
The number of times a block of B was found in A.
tag hits
The number of times the 16 bit hash of the rolling checksum matched a hash of one of the checksums from B.
false alarms
The number of times the 32 bit rolling checksum matched but the strong checksum didn't.
data
The amount of file data transferred verbatim, in bytes.
written
The total number of bytes written by $\alpha$ including protocol overheads. This is almost all file data.
read
The total number of bytes read by $\alpha$ including protocol overheads. This is almost all checksum information.

The results demonstrate that for block sizes above 300 bytes, only a small fraction (around 5%) of the file was transferred. The amount transferred was also considerably less than the size of the diff file that would have been transferred if the diff/patch method of updating a remote file was used.

The checksums themselves took up a considerable amount of space, although much less than the size of the data transferred in each case. Each pair of checksums consumes 20 bytes: 4 bytes for the rolling checksum plus 16 bytes for the 128-bit MD4 checksum.

The number of false alarms was less than 1/1000 of the number of true matches, indicating that the 32 bit rolling checksum is quite good at screening out false matches.

The number of tag hits indicates that the second level of the checksum search algorithm was invoked about once every 50 characters. This is quite high because the total number of blocks in the file is a large fraction of the size of the tag hash table. For smaller files we would expect the tag hit rate to be much closer to the number of matches. For extremely large files, we should probably increase the size of the hash table.

The next table shows similar results for a much smaller set of files. In this case the files were not packed into a tar file first. Rather, rsync was invoked with an option to recursively descend the directory tree. The files used were from two source releases of another software package called Samba. The total source code size is 1.7 MB and the diff between the two releases is 4155 lines long totalling 120 kB.

 

Availability

An implementation of rsync which provides a convenient interface similar to the common UNIX command rcp has been written and is available for download fromftp://rsync.samba.org/pub/rsync.

 

About this document ...

The rsync algorithm

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html tech_report.tex.

The translation was initiated by Andrew Tridgell on 1998-11-09

 

 

 

 

 

root@autodl-container-8bf9409db0-3243e00c:~/autodl-tmp# python utils_one.py 2025-06-04 22:18:14,601 - __main__ - INFO - 初始化对话评估系统... 2025-06-04 22:18:14,602 - __main__ - INFO - 加载嵌入模型: /root/autodl-tmp/paraphrase-multilingual-MiniLM-L12-v2 2025-06-04 22:18:16,425 - __main__ - INFO - 嵌入模型加载成功,设备: cuda 2025-06-04 22:18:16,426 - __main__ - INFO - 加载生成模型: /root/autodl-tmp/Qwen3-14b 2025-06-04 22:18:16,651 - accelerate.utils.modeling - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk). Loading checkpoint shards: 67%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 4/6 [00:03<00:01, 1.27it/s] 2025-06-04 22:18:19,830 - __main__ - ERROR - 生成模型加载失败: Error while deserializing header: MetadataIncompleteBuffer 2025-06-04 22:18:19,830 - __main__ - ERROR - 生成模型加载失败: Error while deserializing header: MetadataIncompleteBuffer 2025-06-04 22:18:19,830 - __main__ - ERROR - 评估系统初始化失败: Error while deserializing header: MetadataIncompleteBuffer Traceback (most recent call last): File "/root/autodl-tmp/utils_one.py", line 482, in <module> evaluator = DialogueEvaluator( ^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/utils_one.py", line 231, in __init__ self.dialogue_generator = DialogueGenerator(gen_model_path, device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/utils_one.py", line 88, in __init__ self.model = AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/transformers/modeling_utils.py", line 279, in _wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4400, in from_pretrained ) = cls._load_pretrained_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4818, in _load_pretrained_model state_dict = load_state_dict( ^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/transformers/modeling_utils.py", line 513, in load_state_dict with safe_open(checkpoint_file, framework="pt") as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
06-05
<think>嗯,用户现在遇到了一个模型加载问题,显示错误“safetensors_rust.SafetensorError:Errorwhiledeserializingheader:MetadataIncompleteBuffer”。这个问题出现于尝试加载一个生成模型时,用户在引用中提供了错误信息和一些相关资料。用户似乎在使用HuggingFaceTransformers库,并且尝试从本地加载模型。在引用中看到两个关键点:第一是版本兼容性问题,transformers4.46.1和safetensors0.4.5可能存在冲突;第二是文件头部的元数据缓冲区不完整导致解析失败。需要思考几个可能的解决方案路径。版本问题往往是最常见的根源,尤其是当库依赖更新不及时时。用户使用的transformers版本可能太旧,而safetensors库较新,可能引入了新特性导致旧版本transformer兼容性出问题。升级transformer到较新版本是值得尝试的方案。另一个可能性是模型文件在下载过程中损坏。大型模型文件在传输中断可能导致元数据区损坏。建议用户重新下载模型或者重新导出模型文件。还有一个思路是尝试使用不同加载方式。用户可能在用from_pretrained加载模型,可以改为直接使用PyTorch加载权重,或者强制使用.pt格式来绕过safetensors解析器。此外,用户环境方面也值得关注。如果使用虚拟环境,可能存在多个版本库混杂的问题。考虑用户升级版本后是否可能引入其他兼容性问题,需要提醒用户创建新的环境测试。</think> 好的,这个`MetadataIncompleteBuffer`错误通常表示模型文件(`.safetensors`)的**头部元数据**在读取时遇到了不完整的数据缓冲区,无法正确解析文件结构。这往往与**文件损坏、文件未完全下载/传输、或者safetensors库版本兼容性问题**有关。 以下是针对`Error while deserializing header: MetadataIncompleteBuffer`的详细解决步骤: ### 1. 验证文件完整性与重新下载/传输 (首要步骤) * **检查文件大小:** 对比你本地`.safetensors`文件的大小是否与官方源(如Hugging Face Hub上的文件页面)提供的文件大小**完全一致**。任何大小差异都表明文件未完全下载或传输中断/出错。 * **校验哈希值(推荐):** 如果模型源提供了文件哈希值(如sha256),使用命令行工具计算本地文件的哈希值并进行比较。 * `Linux/macOS:` `shasum -a 256 /path/to/your/model.safetensors` * `Windows (PowerShell):` `Get-FileHash -Path 'C:\path\to\your\model.safetensors' -Algorithm SHA256` * 不同哈希值确认文件损坏或下载不完整。 * **重新下载/传输:** * **官方源下载:** 直接从模型仓库(如Hugging Face Hub)重新下载对应的`.safetensors`文件。下载时确保网络稳定,避免中断。 * **本地传输:** 如果是本地机器之间拷贝导致的问题,尝试重新拷贝一次,或者考虑使用`rsync`(Linux/macOS)等支持校验的工具,避免使用可能不可靠的图形界面传输工具。 * **替代模型格式:** 如果源仓库同时提供了`PyTorch`格式(`.bin`或`.pth`等)的模型文件,尝试下载和使用这种格式。在使用`AutoModelForCausalLM.from_pretrained`时,库会自动优先查找`.safetensors`,但如果找不到或出问题,它会回退加载`PyTorch`格式的权重。只需确保`model`目录下有`pytorch_model.bin`等相关文件。 ### 2. 升级关键库版本 (解决可能的兼容性Bug) `MetadataIncompleteBuffer`错误在一些特定的库版本组合下更容易出现,尤其是在较新的模型文件格式与旧版库之间。更新是最稳妥的解决兼容性问题的方法。 * **升级`transformers`库:** ```bash pip install --upgrade transformers ``` * 使用最新稳定版通常是兼容性最好的。 * **升级`safetensors`库:** ```bash pip install --upgrade safetensors ``` * 这是处理`.safetensors`文件的核心库,其更新往往修复读取/解析错误。 * **考虑`huggingface-hub`库:** ```bash pip install --upgrade huggingface-hub ``` * 虽然不是直接原因,但它负责模型下载和部分缓存管理,保持最新也有助于减少潜在问题。 ### 3. 排除文件特定部分问题 如果上述方法无效,且你怀疑模型文件的元数据部分可能有不常见的格式(虽然可能性较小): * **尝试其他模型文件:** * 下载仓库中另一个已知工作正常的`.safetensors`文件,看是否能加载。这能判断问题是全局性的(库、环境)还是特定于你这个模型文件的。 * **使用`safetensors`命令行工具检查:** * 安装`safetensors`后,可以用其自带的命令行工具检查文件: ```bash safetensors <path/to/your/model.safetensors> ``` * 如果文件损坏或不完整,这个命令通常会报出更明确的错误信息。 ### 4. 尝试强制加载PyTorch格式 (如果文件损坏但PyTorch格式可用) 如果模型目录下同时存在`pytorch_model.bin`等权重文件,并且它们是完整的,你可以尝试**强制`from_pretrained`加载PyTorch格式文件**,而不是`.safetensors`。设置参数`use_safetensors=False`: ```python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( '/path/to/local/model', # 确保路径指向包含 config.json 和其他文件的目录 use_safetensors=False # 强制不使用 safetensors ) ``` ### 5. 提交错误报告 如果上述所有方法**都尝试过后仍然失败**: 1. 确认环境纯净(在新虚拟环境里测试)。 2. 提供**详细的版本信息**(`transformers`, `safetensors`, `torch`, `Python`)。 3. 提供 **加载代码片段**。 4. 描述 **你已尝试过的所有解决方案**。 5. 确认**文件完整性检查**。 6. 在`transformers`库的GitHub仓库或Hugging Face论坛提交Issue。 ### 总结步骤优先级建议 1. **立即验证并重新下载模型文件(最关键!)。** 2. **升级`transformers`和`safetensors`到最新版。** 3. 如果本地有完整的PyTorch格式权重,尝试`use_safetensors=False`加载。 4. 检查`safetensors`命令行工具的输出。 5. 如果失败且已排除文件和版本问题,考虑提交详细错误报告。 遇到`MetadataIncompleteBuffer`错误通常意味着文件本身存在问题或者库版本过旧,优先关注文件完整性和升级库通常能解决问题[^1][^2][^3]。 --- **相关问题** 1. `.safetensors`文件比`.bin`(PyTorch格式)有哪些优势?加载速度更快吗? 2. 如何安全地断点续传下载大型模型文件以避免损坏? 3. 创建Python虚拟环境来管理机器学习项目依赖有什么好处?如何操作? 4. 除了`MetadataIncompleteBuffer`,加载`.safetensors`文件还可能遇到哪些常见错误?如何解决? 5. 为什么Hugging Face大力推广`.safetensors`格式? [^1]: 参考:Transformers导入本地模型显示safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge_transformers4.46.1与safetensors0.4.5兼容版本号-优快云博客 [^2]: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge... [详细错误栈] [^3]: safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer... [链接到相关讨论]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值