TF-IDF与余弦相似性的应用（三）：自动摘要

最新推荐文章于 2022-10-25 11:12:13 发布

转载最新推荐文章于 2022-10-25 11:12:13 发布 · 322 阅读

python 同时被 3 个专栏收录

22 篇文章

订阅专栏

自然语言处理

16 篇文章

订阅专栏

人工智能方面

10 篇文章

订阅专栏

自动摘要技术利用词频统计，从长篇文章中提炼出关键信息，为读者节省阅读时间。通过计算句子中关键词的频率和分布，自动摘要算法能识别出包含最多信息的句子，形成文章摘要。此技术广泛应用于论文、新闻及搜索引擎等领域。

有时候，很简单的数学方法，就可以完成很复杂的任务。

这个系列的前两部分就是很好的例子。仅仅依靠统计词频，就能找出关键词和相似文章。虽然它们算不上效果最好的方法，但肯定是最简便易行的方法。今天，依然继续这个主题。讨论如何通过词频，对文章进行自动摘要（Automatic summarization）。

如果能从3000字的文章，提炼出150字的摘要，就可以为读者节省大量阅读时间。由人完成的摘要叫"人工摘要"，由机器完成的就叫"自动摘要"。许多网站都需要它，比如论文网站、新闻网站、搜索引擎等等。2007年，美国学者的论文《A Survey on Automatic Text Summarization》（Dipanjan Das, Andre F.T. Martins, 2007）总结了目前的自动摘要算法。其中，很重要的一种就是词频统计。这种方法最早出自1958年的IBM公司科学家H.P. Luhn的论文《The Automatic Creation of Literature Abstracts》。Luhn博士认为，文章的信息都包含在句子中，有些句子包含的信息多，有些句子包含的信息少。"自动摘要"就是要找出那些包含信息最多的句子。句子的信息量用"关键词"来衡量。如果包含的关键词越多，就说明这个句子越重要。Luhn提出用"簇"（cluster）表示关键词的聚集。所谓"簇"就是包含多个关键词的句子片段。

上图就是Luhn原始论文的插图，被框起来的部分就是一个"簇"。只要关键词之间的距离小于"门槛值"，它们就被认为处于同一个簇之中。Luhn建议的门槛值是4或5。也就是说，如果两个关键词之间有5个以上的其他词，就可以把这两个关键词分在两个簇。

下一步，对于每个簇，都计算它的重要性分值。

以前图为例，其中的簇一共有7个词，其中4个是关键词。因此，它的重要性分值等于 ( 4 x 4 ) / 7 = 2.3。

然后，找出包含分值最高的簇的句子（比如5句），把它们合在一起，就构成了这篇文章的自动摘要。具体实现可以参见《Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites》（O'Reilly, 2011）一书的第8章，python代码见github。

类似的算法已经被写成了工具，比如基于Java的Classifier4J库的SimpleSummariser模块、基于C语言的OTS库、以及基于classifier4J的C#实现和python实现。

python实现：

# -*- coding: utf-8 -*-
import sys
import json
import nltk
import numpy

N = 100  # 要考虑的字数
CLUSTER_THRESHOLD = 5  # 要考虑的单词之间的距离
TOP_SENTENCES = 5  # “前n个”摘要的返回句子数


# 从H.P.的“文献摘要自动创作”中获取的方法卢恩

def _score_sentences(sentences, important_words):
    scores = []
    sentence_idx = -1

    for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:

        sentence_idx += 1
        word_idx = []

        # For each word in the word list...
        for w in important_words:
            try:
                # 计算句子中任何重要单词出现位置的索引
                word_idx.append(s.index(w))
            except ValueError as e:  # 不是在这个特定的句子
                print(e)

        word_idx.sort()

        # 有些句子可能根本不包含任何重要的单词
        if len(word_idx) == 0:
            continue

        # 使用单词index，使用最大距离阈值计算集群
        # 任何两个连续的单词

        clusters = []
        cluster = [word_idx[0]]
        i = 1
        while i < len(word_idx):
            if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
                cluster.append(word_idx[i])
            else:
                clusters.append(cluster[:])
                cluster = [word_idx[i]]
            i += 1
        clusters.append(cluster)

        # 为每个群集打分。 任何给定群集的最高分数是分数
        # for the sentence

        max_cluster_score = 0
        for c in clusters:
            significant_words_in_cluster = len(c)
            total_words_in_cluster = c[-1] - c[0] + 1
            score = 1.0 * significant_words_in_cluster \
                    * significant_words_in_cluster / total_words_in_cluster

            if score > max_cluster_score:
                max_cluster_score = score

        scores.append((sentence_idx, score))

    return scores


def summarize(txt):
    sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
    normalized_sentences = [s.lower() for s in sentences]

    words = [w.lower() for sentence in normalized_sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    top_n_words = [w[0] for w in fdist.items()
                   if w[0] not in nltk.corpus.stopwords.words('english')][:N]

    scored_sentences = _score_sentences(normalized_sentences, top_n_words)

    # Summaization Approach 1:
    # Filter out non-significant sentences by using the average score plus a
    # fraction of the std dev as a filter

    avg = numpy.mean([s[1] for s in scored_sentences])
    std = numpy.std([s[1] for s in scored_sentences])
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
                   if score > avg + 0.5 * std]

    # Summarization Approach 2:
    # Another approach would be to return only the top N ranked sentences

    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

    # Decorate the post object with summaries

    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])


if __name__ == '__main__':

    # Load in output from blogs_and_nlp__get_feed.py

    BLOG_DATA = sys.argv[1]
    blog_data = json.loads(open(BLOG_DATA).read())

    for post in blog_data:
        post.update(summarize(post['content']))
        print(post['title'])
        print('-' * len(post['title']))
        print('-------------')
        print('Top N Summary')
        print('-------------')
        print(' '.join(post['top_n_summary']))
        print('-------------------')
        print('Mean Scored Summary')
        print('-------------------')
        print(' '.join(post['mean_scored_summary']))

转载自：http://www.ruanyifeng.com/blog/2013/03/automatic_summarization.html