哈夫曼编码实践

最新推荐文章于 2022-04-09 20:00:54 发布

原创最新推荐文章于 2022-04-09 20:00:54 发布 · 400 阅读

CC 4.0 BY-SA版权

本文介绍了如何为给定文本构建哈夫曼树，以实现针对面部landmark定位的高效CNN模型的Rectified Wing损失函数。通过计算字符频率，设计哈夫曼树结构，并输出每个字符的编码。算法的时间复杂度为O(n^2)，空间复杂度为O(1)。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这是学校的数据结构与算法大作业
题目如下：

哈夫曼树

请为下面这段英文文本构造哈夫曼编码：

“Effificient and robust facial landmark localisation is crucial for the deployment of real-time face analysis systems. This paper presents a new loss function, namely Rectifified Wing (RWing) loss, for regression-based facial landmark localisation with Convolutional Neural Networks (CNNs). We fifirst systemically analyse different loss functions, including L2, L1 and smooth L1. The analysis suggests that the training of a network should pay more attention to small-medium errors. Motivated by this finding, we design a piece-wise loss that amplififies the impact of the samples with small-medium errors. Besides, we rectify the loss function for very small errors to mitigate the impact of inaccuracy of manual annotation”

要求如下：

1) 请计算出每个字符出现的概率，并以概率为权重来构造哈夫曼树，写出构造过程、画出最终的哈夫曼树，得到每个字符的哈夫曼编码。

解：通过以下代码得到每个字母出现的概率，并从小到大排序：

include <iostream>
#include <string>

using namespace std;

//哈夫曼树结构体
typedef struct HFnode {
    char data;
    float weight;//哈夫曼树的权重->概率
    HFnode *lchild;
    HFnode *rchild;
} *HFtree, HFnode;
//链队列(单向循环链表)->完全就不需要循环链表，故改为单链表
typedef struct QueueNode {
    HFnode data;
    QueueNode *next;
} QueueNode, *QueueNodePtr;

typedef struct LinkQueue {
//    链队有头节点，头节点不存放任何数据
//    队空条件：rear == front,不设队满条件
    QueueNode *front;
    QueueNode *rear;
} LinkQueue;

bool initLinkQueue(LinkQueue &Q) {
    QueueNode *p = new QueueNode;
    p->next = NULL;
    Q.rear = Q.front = p;
};

//入队
bool EnQueue(LinkQueue &Q, HFnode e) {
//    队尾入，队头出
    QueueNode *p = new QueueNode;
    p->data = e;
    p->next = NULL;
    Q.rear->next = p;
    Q.rear = Q.rear->next;
    return true;
}

bool DeQueue(LinkQueue &Q, HFnode &e) {
//    队尾入，队头出，出队先判断是否队空
    if (Q.front == Q.rear)
        return false;
    QueueNode *p = Q.front->next;
    e = p->data;
    Q.front->next = Q.front->next->next;
    if (Q.front->next == NULL) {
        Q.rear = Q.front;//如果最后一个元素出队，则令两个指针都指向头节点
    }
    delete p;
    return true;
}
int getFreq(char c, const string &str);

HFnode merge(HFnode HFnode1, HFnode HFnode2);

double probabilities[128]; //此数组用于存放下标作为ascii码对应的字符出现的概率
bool flags[128]; //标记data内数据是否已被使用：true-已使用，false-未使用，需要初始化！
void test1_PrintData();

void initFlags();

int getMin();

void PreOrder(HFtree root, string code);

int main() {
    string str = "Effificient and robust facial landmark localisation is crucial for the deployment of real-time face analysis systems. This paper presents a new loss function, namely Rectifified Wing (RWing) loss, for regression-based facial landmark localisation with Convolutional Neural Networks (CNNs). We fifirst systemically analyse different loss functions, including L2, L1 and smooth L1. The analysis suggests that the training of a network should pay more attention to small-medium errors. Motivated by this finding, we design a piece-wise loss that amplififies the impact of the samples with small-medium errors. Besides, we rectify the loss function for very small errors to mitigate the impact of inaccuracy of manual annotation";
    double len = str.length();//待统计字符串长度，用浮点数表示以避免整除
    //开始统计字符串中每个字符出现的概率
    for (int i = 0; i < 128; i++) {
        probabilities[i] = getFreq(char(i), str) / len;
    }

//    test1_PrintData();//检查点1：打印每个字符出现的概率，并检查它们的概率之和是否为1

    //开始构造哈夫曼树
    //初始化flags
    initFlags();
    //构造队列
    LinkQueue Q;
    initLinkQueue(Q);
    while (true) {    //找到出现概率最小且概率不为0的字符
        int index = getMin();
        if (index == -1) {
            break;
        }
        //构造树节点
        HFnode *nodeptr = new HFnode;
        nodeptr->data = char(index);
        nodeptr->weight = probabilities[index];
        nodeptr->rchild = nodeptr->lchild = NULL;
        //入队
        EnQueue(Q, *nodeptr);
    }

    HFnode e;
    while(DeQueue(Q,e)){
        cout<<e.data<<": "<<e.weight<<endl;
    }
}
///此函数用于统计字符c在字符串str中出现的频数
int getFreq(char c, const string &str) {
    int freq = 0;
    for (char i : str) {
        if (i == c)
            freq++;
    }
    return freq;
}

///初始化flags--将flags中内容全部置为false
void initFlags() {
    for (bool &flag : flags) {
        flag = false;
    }
}

///找到数组data中最小的元素并返回其下标,若返回-1，代表已经选择完毕
int getMin() {
    int index = -1;
    for (int i = 0; i < 128; i++) {
        if (probabilities[i] > 1e-9 && !flags[i]) {
            index = i;
            break;
        }
    }
    if (index == -1)
        return index;//可用元素已经选择完毕

    for (int i = 0; i < 128; i++) {
        if (probabilities[i] < probabilities[index] && probabilities[i] > 1e-9 && !flags[i])
            index = i;
    }
    flags[index] = true;
    return index;
}

运行后输出：

C:\Users\1\Desktop\DS\cmake-build-debug\DS.exe
2: 0.00138122
B: 0.00138122
E: 0.00138122
M: 0.00138122
(: 0.00276243
): 0.00276243
1: 0.00276243
C: 0.00276243
R: 0.00276243
T: 0.00276243
L: 0.00414365
W: 0.00414365
b: 0.00414365
v: 0.00414365
N: 0.00552486
k: 0.00552486
-: 0.00690608
.: 0.00690608
,: 0.00828729
w: 0.0110497
g: 0.0138122
p: 0.0138122
y: 0.0179558
u: 0.019337
h: 0.0207182
d: 0.0220994
c: 0.0290055
m: 0.0303867
f: 0.0372928
r: 0.0428177
l: 0.0497238
n: 0.0566298
o: 0.0566298
t: 0.0662983
a: 0.0718232
s: 0.0732044
e: 0.0745856
i: 0.0773481
 : 0.143646

Process finished with exit code 0

构造哈夫曼树时，应该：

根据给定的n个权值{w1,w2,w3,…,wn}构成n棵二叉树的集合F={T1,T2,T3,…,Tn},其中每棵二叉树Ti中只有一个带权为wi的根结点,其左右子树均为空.
在集合F中选取两棵根结点权值最小的树作为左右子树构造一棵新的二叉树,新二叉树的根结点的权值为其左右子树上根结点的权值之和.
在集合F中删除这两棵树,同时将新得到的二叉树加入F中.
重复步骤(2)、(3),直到F中只含一棵树为止,这棵树就是一棵哈夫曼树.

根据以上规则，可构成哈夫曼树如下：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-28BHoFXA-1624255305873)({PTA_URL}/api/private-image?p=user-uploads/1231835569046310912/2021-6-21/1624247479-7b81fbf7-025f-4f8f-9a5f-5708fd7088b3.png)]

’ ': 100
‘n’: 0000
‘o’: 0010
‘r’: 0101
‘l’: 0111
‘t’: 1010
‘a’: 1011
‘s’:1100
‘e’: 1110
‘i’: 1111
‘c’: 00011
‘m’: 00110
‘h’: 01001
‘d’: 01100
‘f’:11011
‘p’: 000101
‘w’:011010
‘g’:011011
‘y’: 110100
‘u’: 110101
‘-’: 0001000
‘.’: 0001001
‘,’: 0011101
‘b’: 0100000
‘v’: 0100001
‘N’: 0100010
‘k’: 0100011
‘L’: 00111110
‘W’: 00011111
‘(’: 001110000
‘)’: 001110001
‘1’: 001110010
‘C’: 001110011
‘R’: 001111000
‘T’: 001111001
‘E’: 0011110100
‘B’: 0011110101
‘M’: 0011110110
‘2’: 0011110111

2) 请将上述设计哈夫曼编码的过程，用代码来实现，并输出各个字母的哈夫曼编码。（有代码，有运行结果的截图）

源码如下：

#include <iostream>
#include <string>

using namespace std;

//哈夫曼树结构体
typedef struct HFnode {
    char data;
    float weight;//哈夫曼树的权重->概率
    HFnode *lchild;
    HFnode *rchild;
} *HFtree, HFnode;
//链队列(单向循环链表)->完全就不需要循环链表，故改为单链表
typedef struct QueueNode {
    HFnode data;
    QueueNode *next;
} QueueNode, *QueueNodePtr;

typedef struct LinkQueue {
//    链队有头节点，头节点不存放任何数据
//    队空条件：rear == front,不设队满条件
    QueueNode *front;
    QueueNode *rear;
} LinkQueue;

bool initLinkQueue(LinkQueue &Q) {
    QueueNode *p = new QueueNode;
    p->next = NULL;
    Q.rear = Q.front = p;
};

//入队
bool EnQueue(LinkQueue &Q, HFnode e) {
//    队尾入，队头出
    QueueNode *p = new QueueNode;
    p->data = e;
    p->next = NULL;
    Q.rear->next = p;
    Q.rear = Q.rear->next;
    return true;
}

bool DeQueue(LinkQueue &Q, HFnode &e) {
//    队尾入，队头出，出队先判断是否队空
    if (Q.front == Q.rear)
        return false;
    QueueNode *p = Q.front->next;
    e = p->data;
    Q.front->next = Q.front->next->next;
    if (Q.front->next == NULL) {
        Q.rear = Q.front;//如果最后一个元素出队，则令两个指针都指向头节点
    }
    delete p;
    return true;
}

bool JumpQueue(LinkQueue &Q, HFnode &e) {
    /*设置两个指针p、q，当找到适合插入的位置时，p指向大于等于被插入结点权重的结点或者空，
     * q指向小于被插入结点的权重的结点。
     * **/
    QueueNodePtr p = Q.front, q = Q.front;
    QueueNodePtr newQNode = new QueueNode;
    newQNode->data = e;
    //如果要插入的位置是第一位，则插入
    if (p->data.weight >= e.weight) {
        newQNode->next = Q.front;
        Q.front = newQNode;
        return true;
    }
    //如果插入的位置不是第一位，则p后移
    if (p->data.weight < e.weight) {
        p = p->next;
        while (p != NULL && p->data.weight < e.weight) {//p!=NULL需要放在前方，否则会引起段错误
            p = p->next;
            q = q->next;
        }
        newQNode->next = p;
        q->next = newQNode;
        return true;
    }

}

int getFreq(char c, const string &str);

HFnode merge(HFnode HFnode1, HFnode HFnode2);

double probabilities[128]; //此数组用于存放下标作为ascii码对应的字符出现的概率
bool flags[128]; //标记data内数据是否已被使用：true-已使用，false-未使用，需要初始化！
void test1_PrintData();

void initFlags();

int getMin();

void PreOrder(HFtree root, string code);

int main() {
    string str = "Effificient and robust facial landmark localisation is crucial for the deployment of real-time face analysis systems. This paper presents a new loss function, namely Rectifified Wing (RWing) loss, for regression-based facial landmark localisation with Convolutional Neural Networks (CNNs). We fifirst systemically analyse different loss functions, including L2, L1 and smooth L1. The analysis suggests that the training of a network should pay more attention to small-medium errors. Motivated by this finding, we design a piece-wise loss that amplififies the impact of the samples with small-medium errors. Besides, we rectify the loss function for very small errors to mitigate the impact of inaccuracy of manual annotation";
    double len = str.length();//待统计字符串长度，用浮点数表示以避免整除
    //开始统计字符串中每个字符出现的概率
    for (int i = 0; i < 128; i++) {
        probabilities[i] = getFreq(char(i), str) / len;
    }

//    test1_PrintData();//检查点1：打印每个字符出现的概率，并检查它们的概率之和是否为1

    //开始构造哈夫曼树
    //初始化flags
    initFlags();
    //构造队列
    LinkQueue Q;
    initLinkQueue(Q);
    while (true) {    //找到出现概率最小且概率不为0的字符
        int index = getMin();
        if (index == -1) {
            break;
        }
        //构造树节点
        HFnode *nodeptr = new HFnode;
        nodeptr->data = char(index);
        nodeptr->weight = probabilities[index];
        nodeptr->rchild = nodeptr->lchild = NULL;
        //入队
        EnQueue(Q, *nodeptr);
    }

    HFnode HFnode1, HFnode2, newHFnode;
    HFtree root;
    while (DeQueue(Q, HFnode1) && DeQueue(Q, HFnode2)) {
        newHFnode = merge(HFnode1, HFnode2);
        if (abs(newHFnode.weight - 1) < 1e-9) {
            root = &newHFnode;
            break;
        }
        JumpQueue(Q, newHFnode);
    }
    //已构建以root为根节点的哈夫曼树，开始遍历求编码

    PreOrder(root, "");
    printf("finish!");
}

///此函数用于统计字符c在字符串str中出现的频数
int getFreq(char c, const string &str) {
    int freq = 0;
    for (char i : str) {
        if (i == c)
            freq++;
    }
    return freq;
}

///初始化flags--将flags中内容全部置为false
void initFlags() {
    for (bool &flag : flags) {
        flag = false;
    }
}

///找到数组data中最小的元素并返回其下标,若返回-1，代表已经选择完毕
int getMin() {
    int index = -1;
    for (int i = 0; i < 128; i++) {
        if (probabilities[i] > 1e-9 && !flags[i]) {
            index = i;
            break;
        }
    }
    if (index == -1)
        return index;//可用元素已经选择完毕

    for (int i = 0; i < 128; i++) {
        if (probabilities[i] < probabilities[index] && probabilities[i] > 1e-9 && !flags[i])
            index = i;
    }
    flags[index] = true;
    return index;
}

///将两个结点合并为一个
HFnode merge(HFnode HFnode1, HFnode HFnode2) {
    HFnode *newNode = new HFnode;
    HFnode *node1 = new HFnode;
    HFnode *node2 = new HFnode;
    node1->data = HFnode1.data;
    node1->lchild = HFnode1.lchild;
    node1->rchild = HFnode1.rchild;
    node1->weight = HFnode1.weight;
    node2->data = HFnode2.data;
    node2->lchild = HFnode2.lchild;
    node2->rchild = HFnode2.rchild;
    node2->weight = HFnode2.weight;
    newNode->weight = HFnode1.weight + HFnode2.weight;
    newNode->lchild = node1;
    newNode->rchild = node2;
    return *newNode;
}

///二叉树的先序遍历,求得编码，左1右0。
void PreOrder(HFtree root, string code) {
    if (root->lchild == NULL && root->rchild == NULL) {
        cout<<root->data<<": "+code<<endl;
    }
    if (root->lchild != NULL)
        PreOrder(root->lchild, code + "1");
    if (root->rchild != NULL)
        PreOrder(root->rchild, code + "0");
}

///检查点1：打印每个字符出现的概率，并检查它们的概率之和是否为1
void test1_PrintData() {
    double sum = 0;
    for (int i = 0; i < 128; i++) {
        if (probabilities[i] != 0) {
            cout << char(i) << ": " << probabilities[i] << endl;
            sum += probabilities[i];
        }
    }
    if (abs(sum - 1) < 0.00001)
        printf("Checkpoint 1: Pass!");
    else
        printf("Checkpoint 1: Fail!");
}

运行结果：

C:\Users\1\Desktop\DS\cmake-build-debug\DS.exe
r: 1111
h: 11101
(: 11100111
): 11100110
E: 111001011
M: 111001010
2: 111001001
B: 111001000
R: 11100011
T: 11100010
1: 11100001
C: 11100000
d: 11011
w: 110101
-: 1101001
.: 1101000
l: 1100
n: 1011
o: 1010
g: 100111
p: 100110
c: 10010
m: 10001
b: 10000111
v: 10000110
L: 10000101
W: 10000100
y: 100000
t: 0111
a: 0110
 : 010
s: 0011
e: 0010
f: 00011
,: 0001011
N: 00010101
k: 00010100
u: 000100
i: 0000
finish!
Process finished with exit code 0

3) 请分析算法的效率，至少包括时间复杂度和空间复杂度等。

①时间复杂度

从主函数入口分析：

int main() {
    string str = "Effificient and robust facial landmark localisation is crucial for the deployment of real-time face analysis systems. This paper presents a new loss function, namely Rectifified Wing (RWing) loss, for regression-based facial landmark localisation with Convolutional Neural Networks (CNNs). We fifirst systemically analyse different loss functions, including L2, L1 and smooth L1. The analysis suggests that the training of a network should pay more attention to small-medium errors. Motivated by this finding, we design a piece-wise loss that amplififies the impact of the samples with small-medium errors. Besides, we rectify the loss function for very small errors to mitigate the impact of inaccuracy of manual annotation";
    double len = str.length();//待统计字符串长度，用浮点数表示以避免整除
    //开始统计字符串中每个字符出现的概率
    for (int i = 0; i < 128; i++) {
        probabilities[i] = getFreq(char(i), str) / len;
    }

//    test1_PrintData();//检查点1：打印每个字符出现的概率，并检查它们的概率之和是否为1

    //开始构造哈夫曼树
    //初始化flags
    initFlags();
    //构造队列
    LinkQueue Q;
    initLinkQueue(Q);
    while (true) {    //找到出现概率最小且概率不为0的字符
        int index = getMin();
        if (index == -1) {
            break;
        }
        //构造树节点
        HFnode *nodeptr = new HFnode;
        nodeptr->data = char(index);
        nodeptr->weight = probabilities[index];
        nodeptr->rchild = nodeptr->lchild = NULL;
        //入队
        EnQueue(Q, *nodeptr);
    }

    HFnode HFnode1, HFnode2, newHFnode;
    HFtree root;
    while (DeQueue(Q, HFnode1) && DeQueue(Q, HFnode2)) {
        newHFnode = merge(HFnode1, HFnode2);
        if (abs(newHFnode.weight - 1) < 1e-9) {
            root = &newHFnode;
            break;
        }
        JumpQueue(Q, newHFnode);
    }
    //已构建以root为根节点的哈夫曼树，开始遍历求编码

    PreOrder(root, "");
    printf("finish!");
}

其中有两个循环，是时间复杂度的主成分。在第一个循环内，是将文本中出现的字符按照它们出现的概率从小到大依次入队。其中getMin()函数是找到文本中存在且出现概率的字符。其代码块如下：

int getMin() {
    int index = -1;
    for (int i = 0; i < 128; i++) {
        if (probabilities[i] > 1e-9 && !flags[i]) {
            index = i;
            break;
        }
    }
    if (index == -1)
        return index;//可用元素已经选择完毕

    for (int i = 0; i < 128; i++) {
        if (probabilities[i] < probabilities[index] && probabilities[i] > 1e-9 && !flags[i])
            index = i;
    }
    flags[index] = true;
    return index;
}

内部有两个循环，设文本长度为n，有getMin()函数时间复杂度为 $O (n)$ ，假设文本中每个出现的字符都不重复，有且仅有1个，则调用它的循环代码块的最坏时间复杂度为 $O(n^2)$ 。
下面，我们来分析第二个循环代码块：

while (DeQueue(Q, HFnode1) && DeQueue(Q, HFnode2)) {
        newHFnode = merge(HFnode1, HFnode2);
        if (abs(newHFnode.weight - 1) < 1e-9) {
            root = &newHFnode;
            break;
        }
        JumpQueue(Q, newHFnode);
    }
    //已构建以root为根节点的哈夫曼树，开始遍历求编码

    PreOrder(root, "");
    printf("finish!");
}

其调用的merge()函数如下：


///将两个结点合并为一个
HFnode merge(HFnode HFnode1, HFnode HFnode2) {
    HFnode *newNode = new HFnode;
    HFnode *node1 = new HFnode;
    HFnode *node2 = new HFnode;
    node1->data = HFnode1.data;
    node1->lchild = HFnode1.lchild;
    node1->rchild = HFnode1.rchild;
    node1->weight = HFnode1.weight;
    node2->data = HFnode2.data;
    node2->lchild = HFnode2.lchild;
    node2->rchild = HFnode2.rchild;
    node2->weight = HFnode2.weight;
    newNode->weight = HFnode1.weight + HFnode2.weight;
    newNode->lchild = node1;
    newNode->rchild = node2;
    return *newNode;
}

显然，其时间复杂度为 $O (1)$ 。同样，假设文本中每个出现的字符都不重复，有且仅有1个，这样，此循环代码块的最坏时间复杂度为 $O (n)$ 。
在主函数中，还出现了线序遍历函数PreOrder()，其代码如下：

///二叉树的先序遍历,求得编码，左1右0。
void PreOrder(HFtree root, string code) {
    if (root->lchild == NULL && root->rchild == NULL) {
        cout<<root->data<<": "+code<<endl;
    }
    if (root->lchild != NULL)
        PreOrder(root->lchild, code + "1");
    if (root->rchild != NULL)
        PreOrder(root->rchild, code + "0");
}

依照之前的假设，其最坏时间复杂度为 $O (n)$ 。
综上所述，此算法的时间复杂度为 $O(n^2)$

②空间复杂度

在此算法中，用到的辅助空间有：
数组:

double probabilities[128]; //此数组用于存放下标作为ascii码对应的字符出现的概率
bool flags[128]; //标记data内数据是否已被使用：true-已使用，false-未使用，需要初始化！

数组类型的空间复杂度为 $O (1)$ .
队列：

//哈夫曼树结构体
typedef struct HFnode {
    char data;
    float weight;//哈夫曼树的权重->概率
    HFnode *lchild;
    HFnode *rchild;
} *HFtree, HFnode;
//链队列(单向循环链表)->完全就不需要循环链表，故改为单链表
typedef struct QueueNode {
    HFnode data;
    QueueNode *next;
} QueueNode, *QueueNodePtr;

typedef struct LinkQueue {
//    链队有头节点，头节点不存放任何数据
//    队空条件：rear == front,不设队满条件
    QueueNode *front;
    QueueNode *rear;
} LinkQueue;