Python提取文本摘要：自动摘要技术解析

最新推荐文章于 2025-10-25 23:16:43 发布

原创

最新推荐文章于 2025-10-25 23:16:43 发布 · 4.1k 阅读

40 ·

CC 4.0 BY-SA版权

文章标签：

#python #c# #开发语言

在这个信息爆炸的时代，每天都有海量的信息产生，而如何从这些庞大的信息中快速获取关键内容成为了一个亟待解决的问题。文本摘要技术应运而生，它能够帮助我们高效地提取文档中的核心信息，节省大量时间。Python作为一门功能强大的编程语言，在自然语言处理领域有着广泛的应用。本文将详细介绍Python中几种主流的文本摘要方法，并结合实际案例进行分析。

1. 文本摘要的基本概念

文本摘要是将长篇幅的文本压缩成较短的版本，同时保留原文的主要信息和意义。根据生成方式的不同，文本摘要可以分为两类：抽取式摘要（Extractive Summarization）和生成式摘要（Abstractive Summarization）。

抽取式摘要：通过识别并提取文档中的关键句子或短语来生成摘要。这种方法简单直观，但可能无法完全反映文档的整体意思。
生成式摘要：利用自然语言生成技术，重新组合文档中的词汇和句子，生成新的摘要。这种方法更灵活，但实现难度较大。

2. 抽取式摘要的方法

2.1 基于词频的方法

基于词频的方法是最简单的文本摘要技术之一。其基本思路是计算每个单词在文档中的出现频率，然后选择频率最高的几个句子作为摘要。Python中常用的库有nltk和gensim。

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
import heapq

def summarize(text, n):
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(text.lower())
    freq_table = defaultdict(int)
    
    for word in words:
        if word not in stop_words:
            freq_table[word] += 1
    
    sentences = sent_tokenize(text)
    sentence_value = defaultdict(int)
    
    for sentence in sentences:
        for word, freq in freq_table.items():
            if word in sentence.lower():
                sentence_value[sentence] += freq
    
    summarized_sentences = heapq.nlargest(n, sentence_value, key=sentence_value.get)
    return ' '.join(summarized_sentences)

text = """
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
"""

print(summarize(text, 2))

2.2 基于TF-IDF的方法

TF-IDF（Term Frequency-Inverse Document Frequency）是一种统计方法，用于评估一个词对文档集的重要程度。Python中的sklearn库提供了方便的TF-IDF计算工具。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

最低0.47元/天解锁文章