算法基础 -- Trie压缩树原理

sz66cm

已于 2025-04-09 15:58:58 修改

阅读量416

点赞数 5

文章标签：算法

于 2025-01-24 22:00:00 首次发布

本文链接：https://blog.youkuaiyun.com/sz66cm/article/details/145339443

版权

以下是优化后的完整文章，包含 概念澄清、数据结构优化、代码优化、性能优化，并增加了 内存管理，确保结构更加高效、代码更加可读。

Trie 压缩树原理

一、Trie 与压缩 Trie 的基本概念

1. 普通 Trie（字典树）

Trie（前缀树或字典树）是一种用于高效存储和检索字符串的树形数据结构。它的关键特性包括：

Trie 的节点不存储完整单词，而是仅存储路径上的字符信息，多个单词的公共前缀共享相同的路径。
每条边代表可能是一个字符（在最常见的实现里，每一层对应一个字符），也可能是一个字符集的索引。

例如，将字符串集合 ["he", "hello", "hi"] 插入普通 Trie 时，结构如下：

        (root)
        /    \
      'h'    ...
       |
       v
      (node1)
     /      \
   'e'      'i'
    |        |
   (node2)  (node3)
   /   \
 'l'   (单词结束标记：he)
  |
 (node4)
  |
 'l'
  |
 (node5)
  |
 'o'
  |
 (node6) (单词结束标记：hello)

普通 Trie 的优点是结构清晰，查询和插入的时间复杂度理想情况下为 O(k)，其中 k 是字符串长度。但其缺点是当字符串有较多相同前缀时，会出现大量“只有一个孩子的链状节点”，浪费存储空间。

2. 压缩 Trie（Compressed Trie）

压缩 Trie 对普通 Trie 进行了优化，当一个节点仅有一个子节点时，可与该子节点合并，使得一条边可以存储多个字符，以减少 Trie 的深度和冗余节点。

例如，字符串集合 ["he", "hello", "hi"] 的压缩 Trie：

        (root)
         |
        "h"
         |
       (节点A)
       /    \
    "e"     "i"
     |
   (节点B)
     |
   "llo"
     |
   (节点C) (单词结束标记：hello)

压缩 Trie 的操作要点

插入：
1. 找到要插入字符串与现有边的公共前缀；
2. 如果完全匹配，进入下一层节点继续处理；
3. 如果部分匹配，需要拆分边，将公共前缀与剩余部分分别存储；
4. 如果没有匹配，新建一条边。
搜索：
- 沿着边逐字符匹配，如果匹配完整且到达结束标记，则表示字符串存在，否则表示搜索失败。

这种方式能在空间和查询效率之间取得较好平衡。

二、C 语言实现

1. 数据结构定义

在实现中，我们定义以下结构体：

Edge 结构体

表示一条边，包含：

char *label：存储边上的字符串片段（子串）。
int labelLen：存储 label 长度，避免 strlen() 计算，提高效率。
struct TrieNode *child：指向后续节点。

TrieNode 结构体

表示 Trie 中的一个节点，包含：

bool isEndOfWord：标记是否为单词结束节点。
int numChildren：子边的数量。
struct Edge **children：存储指向 Edge 结构的指针数组。

结构关系如下：

(root节点)  ->  Edge[label="abc"] -> (nodeA)
                       \
                        ...

完整的结构体定义如下：

typedef struct TrieNode {
    bool isEndOfWord;
    int numChildren;
    struct Edge **children; // 指针数组存储子边，优化内存管理
} TrieNode;

typedef struct Edge {
    char *label;
    int labelLen; // 存储label长度，避免strlen计算
    TrieNode *child;
} Edge;

2. 基础函数

创建节点

TrieNode *createNode() {
    TrieNode *node = (TrieNode *)malloc(sizeof(TrieNode));
    if (!node) {
        perror("Memory allocation failed");
        exit(EXIT_FAILURE);
    }
    node->isEndOfWord = false;
    node->numChildren = 0;
    node->children = NULL;
    return node;
}

字符串工具函数

char *substr(const char *src, int start, int end) {
    int length = end - start;
    char *dest = (char *)malloc(length + 1);
    if (!dest) {
        perror("Memory allocation failed");
        exit(EXIT_FAILURE);
    }
    memcpy(dest, src + start, length);
    dest[length] = '\0';
    return dest;
}

int commonPrefixLength(const char *s1, const char *s2, int maxLen) {
    int i = 0;
    while (i < maxLen && s1[i] == s2[i]) {
        i++;
    }
    return i;
}

3. 插入字符串

void insertString(TrieNode *root, const char *key) {
    if (*key == '\0') {
        root->isEndOfWord = true;
        return;
    }

    for (int i = 0; i < root->numChildren; i++) {
        Edge *edge = root->children[i];
        int prefixLen = commonPrefixLength(key, edge->label, edge->labelLen);

        if (prefixLen == 0) continue;

        if (prefixLen == edge->labelLen) {
            insertString(edge->child, key + prefixLen);
            return;
        }

        // 分裂逻辑
        TrieNode *newNode = createNode();
        newNode->children = (Edge **)malloc(sizeof(Edge *));
        newNode->children[0] = (Edge *)malloc(sizeof(Edge));

        newNode->children[0]->label = substr(edge->label, prefixLen, edge->labelLen);
        newNode->children[0]->labelLen = edge->labelLen - prefixLen;
        newNode->children[0]->child = edge->child;
        newNode->numChildren = 1;

        char *tmp_str = edge->label;
        edge->label = substr(tmp_str, 0, prefixLen);
        free(tmp_str);
        edge->labelLen = prefixLen;
        edge->child = newNode;

        // 将 key 剩余部分挂入新中间节点
        insertString(newNode, key + prefixLen);
        return;
    }

    // 如果没有任何匹配，说明是新边
    Edge *newEdge = (Edge *)malloc(sizeof(Edge));
    newEdge->label = strdup(key);
    newEdge->labelLen = strlen(key);
    newEdge->child = createNode();

    Edge **newChildren = (Edge **)realloc(root->children, sizeof(Edge *) * (root->numChildren + 1));
    if (!newChildren) {
        perror("realloc failed");
        exit(EXIT_FAILURE);
    }
    root->children = newChildren;
    root->children[root->numChildren++] = newEdge;
}

4. 搜索字符串

bool searchString(TrieNode *root, const char *key) {
    if (*key == '\0') return root->isEndOfWord;

    for (int i = 0; i < root->numChildren; i++) {
        Edge *edge = root->children[i];
        int prefixLen = commonPrefixLength(key, edge->label, edge->labelLen);
		
		if (prefixLen == 0) continue;

        if (prefixLen == edge->labelLen) {
            return searchString(edge->child, key + prefixLen);
        } else {
			return false;
		}
    }
    return false;
}

5. 内存释放

void freeTrie(TrieNode *node) {
    if (!node) return;
    for (int i = 0; i < node->numChildren; i++) {
        free(node->children[i]->label);
        freeTrie(node->children[i]->child);
        free(node->children[i]);
    }
    free(node->children);
    free(node);
}