【手撕 - 自然语言处理】手撕 TextRank（02）大佬是怎么实现 C++ 版的

最新推荐文章于 2022-05-12 14:29:19 发布

原创

最新推荐文章于 2022-05-12 14:29:19 发布

· 261 阅读

1 ·

版权

本文为博主原创文章，未经博主允许不得转载。

文章标签：

#自然语言处理

作者：LogM

本文原载于 https://segmentfault.com/u/logm/articles ，不允许转载~

1. 源码来源

comoody 大佬的源码：https://github.com/comoody/TextRank.git

本文对应的源码版本：Commits on Oct 23, 2018, 9736be10593b99adc1ea614c5d83f1bfeca17b94

lostfish 大佬的源码：https://github.com/lostfish/textrank.git

本文对应的源码版本：Commits on Sep 29, 2016, e89084374ae0e08490c9cc0fa79f8ae4bb10ad5b

TextRank 论文地址：https://www.aclweb.org/anthology/W04-3252

2. 概述

C++ 版本的 TextRank 还没有发现点赞超级多的代码，这里我找了两个不同的实现来分析。

在上一篇博客：TextRank Python 版本，我们知道，看 TextRank 的源码有两个重点需要看，重点1：句子与句子的相似度是如何计算的；重点2：PageRank的实现。

这里，考虑到篇幅，我直接给出对应的函数所在的位置。

3. comoody 大佬的源码

先看看大佬是怎么计算句子与句子之间的相似度的。配合我写的那几行中文注释，应该很容易看懂。大致和论文里的公式是一致的，但是分母和论文的公式不一样。具体为什么用这个分母，我就不得而知了。

// 文件：src/TextRanker.cpp
// 行数：153
float TextRanker::getSimilarity(std::string a, std::string b) const
{
    // no two equivalent sentences should ever be compared, but this logic is included just in case
    if(a == b)
        return 0.f;

    // 大小写转换 -> 分词 -> 把词放到 set 里
    std::transform(a.begin(), a.end(), a.begin(), ::tolower);
    std::vector<std::string> aWords = stringSplit(a, ' ');  
    std::set<std::string> aWordSet;
    for(auto word : aW

最低0.47元/天解锁文章