MD5 Algorithm Implement
我算法的主要思想来自于 Wikipedia, RFC 1321 以及老师关于MD5的PPT
1. MD5 概述
MD5 即 Message-Digest Algorithm 5 (信息-摘要算法 5)
MD4 (1990)、MD5(1992, RFC 1321) 作者 Ron Rivest,是广泛使用
的散列算法,经常用于确保信息传输的完整性和一致性
MD5 使用 little-endian,输入任意不定长度信息,以512位长进
行分组,生成四个32位数据,最后联合起来输出固定128位长
的信息摘要。MD5 算法的基本过程为:求余、取余、调整长度、与链接变量
进行循环运算、得出结果。MD5 不是足够安全的
HansDobbertin在1996年找到了两个不同的512-bit块,它们 在 MD5 计算下产生相同的 hash 值。
至今还没有真正找到两个不同的消息,它们的MD5的hash
值相等
2. MD5 算法(算法流程来自RFC 1321)
2.1 input
假设我们有一个长度为b 个bit的输入信息,记作:
m0,m1,m2…mi+1
这个输入信息的长度没有限制
2.2 Step 1 : Decode and Append Padding Bits
The message is “padded” (extended) so that its length (in bits) is
congruent to 448, modulo 512. That is, the message is extended so
that it is just 64 bits shy of being a multiple of 512 bits long.
Padding is always performed, even if the length of the message is
already congruent to 448, modulo 512.
Padding is performed as follows: a single “1” bit is appended to the
message, and then “0” bits are appended so that the length in bits of
the padded message becomes congruent to 448, modulo 512. In all, at
least one bit and at most 512 bits are appended.
在这一步中,我们将输入的char数组转换成二进制串(在实现中用int数组存储),并且用1和0做填充,使得
length % 512 == 448 bits
, 即length % 64 == 56 bytes
,具体为一个1,后面跟着许多0。*这一步至少需要做一次
在实际RFC 1321实现中, 这部分的处理方法并不是全部一起做的,而是先对每512个bit做Step 3、4,最后如果遇到不足512bit的,再进行Step2,然后再Step 3、4, 就完成了这个过程。并不是线性的按照Step 1 -> Step 2 -> Step 3 ->Step 4这样的顺序来完成的,故此比较难以理解
2.3 Step 2 : Append Length
A 64-bit representation of b (the length of the message before the
padding bits were added) is appended to the result of the previous
step. In the unlikely event that b is greater than 2^64, then only
the low-order 64 bits of b are used. (These bits are appended as two
32-bit words and appended low-order word first in accordance with the
previous conventions.) At this point the resulting message (after padding with bits and with
b) has a length that is an exact multiple of 512 bits. Equivalently,
this message has a length that is an exact multiple of 16 (32-bit)
words. Let M[0 … N-1] denote the words of the resulting message,
where N is a multiple of 16.
这部分是补齐操作,对于Step 2做了填充后的二进制串,需要在后面补原来输入字符串的后64bits的字符,使得最终长度为512bits的整数倍,即
length % 512 == 0
进行补齐后,就和之前Step 1所截取的部分具有一样的512bits了,因此可以继续进行Step 3、4,完成信息的压缩。
2.4 Step 3 : Initialize MD Buffer
A four-word buffer (A,B,C,D) is used to compute the message digest. Here each of A, B, C, D is a >32-bit register. These registers are initialized to the following values in hexadecimal, low-order >bytes first):
word A: 01 23 45 67
word B: 89 ab cd ef
word C: fe dc ba 98
word D: 76 54 32 10
#define A 0x67452301
#define B 0xEFCDAB89
#define C 0x98BADCFE
#define D 0x10325476
- 这个定义的是四个buffer的初始值,在之后每一轮的信息压缩中update的时候,会进行压缩,并把压缩的结果加回到buffer中去。具体的描述可以看第三部分的代码解释。
2.5 Step 4 : Process Message in 16-Word Blocks
We first define four auxiliary functions that each take as input three 32-bit words and produce as output one 32-bit word.
F(X,Y,Z) = XY v not(X) Z G(X,Y,Z) = XZ v Y not(Z) H(X,Y,Z) = X xor Y xor Z I(X,Y,Z) = Y xor (X v not(Z))
In each bit position F acts as a conditional: if X then Y else Z.
The function F could have been defined using + instead of v since XY
and not(X)Z will never have 1’s in the same bit position.) It is
interesting to note that if the bits of X, Y, and Z are independent
and unbiased, the each bit of F(X,Y,Z) will be independent and
unbiased. The functions G, H, and I are similar to the function F, in that they
act in “bitwise parallel” to produce their output from the bits of X,
Y, and Z, in such a manner that if the corresponding bits of X, Y,
and Z are independent and unbiased, then each bit of G(X,Y,Z),
H(X,Y,Z), and I(X,Y,Z) will be independent and unbiased. Note that
the function H is the bit-wise “xor” or “parity” function of its
inputs. This step uses a 64-element table T[1 … 64] constructed from the
sine function. Let T[i] denote the i-th element of the table, which
is equal to the integer part of 4294967296 times abs(sin(i)), where i
is in radians. The elements of the table are given in the appendix.
- 这一部分是最重要的部分了,主要的压缩逻辑都在这里,关键的T表和左移的次数,都是给定的Magic number,这部分可能才是算法中最重要的部分之一。实际上很好理解,就是做四轮,每轮采用FF,GG,HH,II之一的一个函数,然后将解码后的二进制串,采用给定的参数(T表和左移的常数),进行移位并赋值即可。
3. 部分代码解释
3.1 开始前的变量准备
值得一提的是,由于MD5算法主要处理都在二进制部分,所以符号对其来说是无意义的,因此具体实现的时候,都采用无符号的变量。其中
unsigned int
类型只用作存储二进制串这部分的变量的作用是沿用RFC 1321的,但是其中有些比较难以理解,值得一提的就是
count[2]
,由于C90没有64bit的变量类型,所以需要采用一个由两个bit32组成的数组,一个表示低32位,一个表示高32位。
#define A 0x67452301
#define B 0xEFCDAB89
#define C 0x98BADCFE
#define D 0x10325476
// four register block a b c d
unsigned int _reg[4];
// the 64 bits of origin meassage. Separate to high-32 bits and low-32 bits
unsigned int _count[2];
// the 512 bits input buffer
unsigned char _buffer[64];
// the 128 bits MD5 message after final
unsigned char _digest[16];
// the 512 bits to pad into the tail of meassage
const unsigned char padding[64] = {
0x80};
- 解码和编码是为了在char 和 int之间做转换,其实也就是在二进制形式,和ASCII码之间做转换
/**
* @init some private variables
*/
void init() {
_count[0] = _count[1] = 0;
_reg[0] = A;
_reg[1] = B;
_reg[2] = C;
_reg[3] = D;
memset(_buffer,0,sizeof(unsigned char)*64);
}
/**
* @Convert 1 int to 4 char
* @param {output} the char stream after convert
* @param {input} the input int array
* @param {length} the length of input array
*/
void encode(unsigned char* output, const unsigned int* input, const unsigned int length) {
for (int i = 0, j = 0; j < length; i++, j += 4) {
output[j] = (unsigned char)(input[i] & 0xff);
output[j + 1] = (unsigned char)((input[i] >> 8) & 0xff);
output[j + 2] = (unsigne