倒排索引和正向索引区别以及实现原理

原创于 2025-10-27 11:29:31 发布 · 678 阅读

20 ·

CC 4.0 BY-SA版权

文章标签：

#开发语言 #c++

编程语言同时被 2 个专栏收录

223 篇文章

订阅专栏

算法

35 篇文章

订阅专栏

一、概念区别

1. 正向索引（Forward Index）

定义：
正向索引是以“文档为中心”，记录每个文档中包含哪些词，以及这些词在文档中的位置等信息。

结构示例：

文档ID	内容	词项列表
1	我爱自然语言处理	[我, 爱, 自然, 语言, 处理]
2	语言处理很有趣	[语言, 处理, 很, 有趣]

应用场景：
正向索引常用于搜索引擎的文档存储、展示、排序等场景。

2. 倒排索引（Inverted Index）

定义：
倒排索引是以“词项为中心”，记录每个词项出现在哪些文档中，以及在文档中的位置等信息。

结构示例：

词项	文档列表
我	[1]
爱	[1]
自然	[1]
语言	[1, 2]
处理	[1, 2]
很	[2]
有趣	[2]

应用场景：
倒排索引是搜索引擎、信息检索系统的核心，用于高效地查找包含某个词项的所有文档。

二、实现原理

1. 正向索引实现原理

流程：

分词：对每个文档内容进行分词处理，得到词项列表。
建立索引：将文档ID与词项列表进行关联，通常存储为如下结构：
```
{
  文档ID: [词项1, 词项2, ...]
}
```
存储：可以使用数据库、键值对存储系统或文件系统来存储。

扩展：
可以在每个词项后面加上位置信息、词频等，用于后续相关性计算。

2. 倒排索引实现原理

流程：

分词：对所有文档进行分词处理，得到词项与文档的对应关系。
构建倒排列表：对每个词项，建立一个文档列表，记录该词项出现在哪些文档中。
```
{
  词项: [文档ID1, 文档ID2, ...]
}
```
优化：
- 可以在文档ID后加上词频、位置等信息。
- 对文档列表进行压缩（如位图、跳表等）以节省存储空间。
存储：通常使用高效的数据结构（如哈希表、B树、倒排文件）存储。

三、区别总结

对比项	正向索引	倒排索引
关注点	文档 -> 词项	词项 -> 文档
查询效率	适合查找文档内容	适合查找包含某词的所有文档
应用场景	文档展示、排序、聚合	搜索引擎查询、信息检索
存储结构	文档ID为主键，值为词项列表	词项为主键，值为文档ID列表
构建流程	按文档分词，存储词项	按词项聚合，存储文档列表

四、举例说明

假设有两个文档：

文档1：“我喜欢学习”
文档2：“学习使人进步”

正向索引：

文档ID	词项列表
1	[我, 喜欢, 学习]
2	[学习, 使, 人, 进步]

倒排索引：

词项	文档列表
我	[1]
喜欢	[1]
学习	[1, 2]
使	[2]
人	[2]
进步	[2]

五、代码实现简述（Python伪代码）

正向索引

forward_index = {}
for doc_id, content in enumerate(docs):
    tokens = tokenize(content)
    forward_index[doc_id] = tokens

倒排索引

inverted_index = {}
for doc_id, content in enumerate(docs):
    tokens = tokenize(content)
    for token in tokens:
        if token not in inverted_index:
            inverted_index[token] = []
        inverted_index[token].append(doc_id)

六、总结

正向索引：文档到词项的映射，便于文档内容管理。
倒排索引：词项到文档的映射，便于快速查找包含某词的文档，是搜索引擎的核心。

七、完整代码实现

1、准备工作

假设我们的文档集合如下：

std::vector<std::string> docs = {
    "我 爱 自然语言处理",
    "语言处理 很 有趣",
    "自然语言 是 人工智能 的基础"
};

我们将以空格分词（实际应用中可用更复杂的分词算法）。

2、分词函数

#include <vector>
#include <string>
#include <sstream>

std::vector<std::string> tokenize(const std::string& text) {
    std::vector<std::string> tokens;
    std::istringstream iss(text);
    std::string token;
    while (iss >> token) {
        tokens.push_back(token);
    }
    return tokens;
}

3、正向索引实现

正向索引结构：std::unordered_map<int, std::vector<std::string>>
（文档ID -> 词项列表）

#include <unordered_map>
#include <iostream>

void build_forward_index(const std::vector<std::string>& docs,
                         std::unordered_map<int, std::vector<std::string>>& forward_index) {
    for (size_t doc_id = 0; doc_id < docs.size(); ++doc_id) {
        forward_index[doc_id] = tokenize(docs[doc_id]);
    }
}

// 查询文档内容
void print_doc_tokens(int doc_id, const std::unordered_map<int, std::vector<std::string>>& forward_index) {
    auto it = forward_index.find(doc_id);
    if (it != forward_index.end()) {
        std::cout << "Doc " << doc_id << ": ";
        for (const auto& token : it->second) {
            std::cout << token << " ";
        }
        std::cout << std::endl;
    } else {
        std::cout << "Doc not found." << std::endl;
    }
}

4、倒排索引实现

倒排索引结构：std::unordered_map<std::string, std::vector<int>>
（词项 -> 文档ID列表）

void build_inverted_index(const std::vector<std::string>& docs,
                          std::unordered_map<std::string, std::vector<int>>& inverted_index) {
    for (size_t doc_id = 0; doc_id < docs.size(); ++doc_id) {
        auto tokens = tokenize(docs[doc_id]);
        for (const auto& token : tokens) {
            inverted_index[token].push_back(doc_id);
        }
    }
}

// 查询包含某词的文档
void print_docs_for_token(const std::string& token, const std::unordered_map<std::string, std::vector<int>>& inverted_index) {
    auto it = inverted_index.find(token);
    if (it != inverted_index.end()) {
        std::cout << "Token \"" << token << "\" in docs: ";
        for (int doc_id : it->second) {
            std::cout << doc_id << " ";
        }
        std::cout << std::endl;
    } else {
        std::cout << "Token not found." << std::endl;
    }
}

5、完整可运行示例

#include <iostream>
#include <vector>
#include <string>
#include <unordered_map>
#include <sstream>

// 分词
std::vector<std::string> tokenize(const std::string& text) {
    std::vector<std::string> tokens;
    std::istringstream iss(text);
    std::string token;
    while (iss >> token) {
        tokens.push_back(token);
    }
    return tokens;
}

// 正向索引
void build_forward_index(const std::vector<std::string>& docs,
                         std::unordered_map<int, std::vector<std::string>>& forward_index) {
    for (size_t doc_id = 0; doc_id < docs.size(); ++doc_id) {
        forward_index[doc_id] = tokenize(docs[doc_id]);
    }
}

// 倒排索引
void build_inverted_index(const std::vector<std::string>& docs,
                          std::unordered_map<std::string, std::vector<int>>& inverted_index) {
    for (size_t doc_id = 0; doc_id < docs.size(); ++doc_id) {
        auto tokens = tokenize(docs[doc_id]);
        for (const auto& token : tokens) {
            inverted_index[token].push_back(doc_id);
        }
    }
}

// 查询正向索引
void print_doc_tokens(int doc_id, const std::unordered_map<int, std::vector<std::string>>& forward_index) {
    auto it = forward_index.find(doc_id);
    if (it != forward_index.end()) {
        std::cout << "Doc " << doc_id << ": ";
        for (const auto& token : it->second) {
            std::cout << token << " ";
        }
        std::cout << std::endl;
    } else {
        std::cout << "Doc not found." << std::endl;
    }
}

// 查询倒排索引
void print_docs_for_token(const std::string& token, const std::unordered_map<std::string, std::vector<int>>& inverted_index) {
    auto it = inverted_index.find(token);
    if (it != inverted_index.end()) {
        std::cout << "Token \"" << token << "\" in docs: ";
        for (int doc_id : it->second) {
            std::cout << doc_id << " ";
        }
        std::cout << std::endl;
    } else {
        std::cout << "Token not found." << std::endl;
    }
}

int main() {
    std::vector<std::string> docs = {
        "我 爱 自然语言处理",
        "语言处理 很 有趣",
        "自然语言 是 人工智能 的基础"
    };

    // 构建正向索引
    std::unordered_map<int, std::vector<std::string>> forward_index;
    build_forward_index(docs, forward_index);

    // 构建倒排索引
    std::unordered_map<std::string, std::vector<int>> inverted_index;
    build_inverted_index(docs, inverted_index);

    // 查询
    print_doc_tokens(0, forward_index); // 查询文档0的词项
    print_docs_for_token("语言处理", inverted_index); // 查询包含"语言处理"的文档
    print_docs_for_token("自然语言", inverted_index); // 查询包含"自然语言"的文档
    print_docs_for_token("人工智能", inverted_index); // 查询包含"人工智能"的文档

    return 0;
}

6、扩展说明

词频统计
倒排索引可以存储词频或位置：
std::unordered_map<std::string, std::unordered_map<int, int>>
（词项 -> 文档ID -> 词频）
分词优化
实际应用中需用更专业的分词工具（如jieba、THULAC等）。
索引压缩与优化
可用位图、跳表等结构优化倒排索引存储与查询效率。

7. 倒排索引扩展：词频统计

倒排索引不仅记录“词项->文档ID列表”，还可以记录每个词项在每个文档中出现的次数（词频）。

数据结构可以定义为：

// 词项 -> (文档ID -> 词频)
std::unordered_map<std::string, std::unordered_map<int, int>> inverted_index_with_freq;

构建带词频的倒排索引


void build_inverted_index_with_freq(
    const std::vector<std::string>& docs,
    std::unordered_map<std::string, std::unordered_map<int, int>>& inverted_index_with_freq) {
    for (size_t doc_id = 0; doc_id < docs.size(); ++doc_id) {
        auto tokens = tokenize(docs[doc_id]);
        for (const auto& token : tokens) {
            inverted_index_with_freq[token][doc_id] += 1;
        }
    }
}

查询某词在各文档中的词频

void print_token_freq(const std::string& token,
    const std::unordered_map<std::string, std::unordered_map<int, int>>& inverted_index_with_freq) {
    auto it = inverted_index_with_freq.find(token);
    if (it != inverted_index_with_freq.end()) {
        std::cout << "Token \"" << token << "\" frequency in docs:\n";
        for (const auto& pair : it->second) {
            std::cout << "  doc " << pair.first << ": " << pair.second << "\n";
        }
    } else {
        std::cout << "Token not found." << std::endl;
    }
}

8. 支持短语搜索

短语搜索，比如“自然语言处理”，需要记录每个词在文档中的位置，并在查询时判断这些词在文档中是否连续出现。

数据结构

// 词项 -> (文档ID -> [位置列表])
std::unordered_map<std::string, std::unordered_map<int, std::vector<int>>> inverted_index_with_pos;

构建带位置信息的倒排索引

void build_inverted_index_with_pos(
    const std::vector<std::string>& docs,
    std::unordered_map<std::string, std::unordered_map<int, std::vector<int>>>& inverted_index_with_pos) {
    for (size_t doc_id = 0; doc_id < docs.size(); ++doc_id) {
        auto tokens = tokenize(docs[doc_id]);
        for (size_t pos = 0; pos < tokens.size(); ++pos) {
            inverted_index_with_pos[tokens[pos]][doc_id].push_back(pos);
        }
    }
}

短语搜索实现（简单版）


// 判断某文档中是否存在短语（tokens顺序连续）
bool phrase_in_doc(const std::vector<std::string>& tokens,
                   const std::vector<std::string>& doc_tokens) {
    if (tokens.size() > doc_tokens.size()) return false;
    for (size_t i = 0; i <= doc_tokens.size() - tokens.size(); ++i) {
        bool match = true;
        for (size_t j = 0; j < tokens.size(); ++j) {
            if (doc_tokens[i + j] != tokens[j]) {
                match = false;
                break;
            }
        }
        if (match) return true;
    }
    return false;
}

// 查询所有包含短语的文档
void search_phrase(const std::string& phrase,
                   const std::vector<std::string>& docs) {
    auto tokens = tokenize(phrase);
    std::cout << "Docs containing phrase \"" << phrase << "\": ";
    for (size_t doc_id = 0; doc_id < docs.size(); ++doc_id) {
        auto doc_tokens = tokenize(docs[doc_id]);
        if (phrase_in_doc(tokens, doc_tokens)) {
            std::cout << doc_id << " ";
        }
    }
    std::cout << std::endl;
}

9. 索引持久化（写入/读取文件）

以词频倒排索引为例，写入和读取可以简单实现为文本格式。

写入倒排索引到文件

#include <fstream>

void save_inverted_index_with_freq(
    const std::unordered_map<std::string, std::unordered_map<int, int>>& inverted_index_with_freq,
    const std::string& filename) {
    std::ofstream ofs(filename);
    for (const auto& token_pair : inverted_index_with_freq) {
        ofs << token_pair.first;
        for (const auto& doc_pair : token_pair.second) {
            ofs << " " << doc_pair.first << ":" << doc_pair.second;
        }
        ofs << "\n";
    }
    ofs.close();
}

从文件读取倒排索引

void load_inverted_index_with_freq(
    std::unordered_map<std::string, std::unordered_map<int, int>>& inverted_index_with_freq,
    const std::string& filename) {
    std::ifstream ifs(filename);
    std::string line;
    while (std::getline(ifs, line)) {
        std::istringstream iss(line);
        std::string token;
        iss >> token;
        std::string doc_freq;
        while (iss >> doc_freq) {
            auto pos = doc_freq.find(':');
            int doc_id = std::stoi(doc_freq.substr(0, pos));
            int freq = std::stoi(doc_freq.substr(pos + 1));
            inverted_index_with_freq[token][doc_id] = freq;
        }
    }
    ifs.close();
}

10. 综合示例

下面是一个包含以上所有功能的主函数片段：

int main() {
    std::vector<std::string> docs = {
        "我 爱 自然语言处理",
        "语言处理 很 有趣",
        "自然语言 是 人工智能 的基础",
        "我 爱 人工智能"
    };

    // 1. 词频倒排索引
    std::unordered_map<std::string, std::unordered_map<int, int>> inverted_index_with_freq;
    build_inverted_index_with_freq(docs, inverted_index_with_freq);
    print_token_freq("人工智能", inverted_index_with_freq);

    // 2. 短语搜索
    search_phrase("我 爱", docs);
    search_phrase("自然语言 处理", docs);
    search_phrase("自然语言 是", docs);

    // 3. 持久化
    save_inverted_index_with_freq(inverted_index_with_freq, "index.txt");
    std::unordered_map<std::string, std::unordered_map<int, int>> loaded_index;
    load_inverted_index_with_freq(loaded_index, "index.txt");
    print_token_freq("人工智能", loaded_index);

    return 0;
}