31、利用NLTK进行文本数据分析

于 2025-07-21 11:08:25 发布

阅读量103

点赞数

CC 4.0 BY-SA版权

分类专栏： Python数据分析实战指南文章标签： NLTK 文本数据分析自然语言处理

本文链接：https://blog.youkuaiyun.com/read5/article/details/149586757

Python数据分析实战指南专栏收录该内容

33 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

利用NLTK进行文本数据分析

1. 文本提取与结构化

在处理文本时，我们可以使用NLTK（Natural Language Toolkit）库来提取和结构化文本数据。以莎士比亚的《麦克白》为例，我们可以使用 nltk.corpus.gutenberg.sents() 函数来获取结构化的句子数组。

import nltk

macbeth_sents = nltk.corpus.gutenberg.sents('shakespeare-macbeth.txt')
print(macbeth_sents[:5])

输出结果如下：

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ['Scoena', 'Prima', '.'], ['Thunder', 'and', 'Lightning', '.'], ['Enter', 'three', 'Witches', '.']]

这个结果展示了文本的结构化形式，每个句子作为一个元素，而每个句子又由单词组成的数组构成。

2. 在NLTK语料库中搜索单词

当我们有一个NLTK语料库（即从文本中提取的单词数组）时，最基本的操作之一就是在其中进行搜索。NLTK提供了几种搜索单词的方法。

2.1 使用 `concordance()` 函数搜索单词

concordance() 函数用于查找语料库中作为参数传递的单词的所有出现位置。

text = nltk.Text(macbeth)
text.concordance('Stage')

输出结果会显示该单词的所有匹配项，例如：

Displaying 3 of 3 matches:
nts with Dishes and Seruice ouer the Stage . Then enter Macbeth Macb . If it
we
with mans Act , Threatens his bloody Stage : byth ' Clock ' tis Day , And yet
d
struts and frets his houre vpon the Stage , And then is heard no more . It is

需要注意的是，第一次运行此命令时，系统可能需要几秒钟才能返回结果，因为它会为语料库创建索引，后续运行会更快。

2.2 使用 `common_contexts()` 函数搜索单词的上下文

common_contexts() 函数用于查找与搜索单词相邻的前一个和后一个单词。

text.common_contexts(['Stage'])

输出结果可能如下：

the_ bloody_: the_,

2.3 使用 `similar()` 函数搜索同义词

基于单词上下文的概念，我们可以假设具有相同上下文的所有单词可能是同义词。 similar() 函数用于搜索与搜索单词具有相同上下文的所有单词。

text.similar('Stage')

输出结果可能如下：

fogge ayre bleeding reuolt good shew heeles skie other sea feare
consequence heart braine seruice herbenger lady round deed doore

3. 分析单词频率

分析文本中单词的频率是文本分析中最简单和最基本的示例之一。NLTK提供了 FreqDist() 函数来计算单词的频率分布。

fd = nltk.FreqDist(macbeth)
print(fd.most_common(10))

输出结果可能如下：

[(',', 1962), ('.', 1235), ("'", 637), ('the', 531), (':', 477), ('and', 376), ('I', 333), ('of', 315), ('to', 311), ('?', 241)]

从结果中可以看出，最常见的元素是标点符号、介词和冠词，这些在文本分析中通常意义不大，被称为停用词（stopwords）。

3.1 去除停用词

NLTK提供了一个预选择的停用词数组，我们可以使用它来过滤文本。

nltk.download('stopwords')
sw = set(nltk.corpus.stopwords.words('english'))
macbeth_filtered = [w for w in macbeth if w.lower() not in sw]
fd = nltk.FreqDist(macbeth_filtered)
print(fd.most_common(10))

输出结果可能如下：

[(',', 1962), ('.', 1235), ("'", 637), (':', 477), ('?', 241), ('Macb', 137), ('haue', 117), ('-', 100), ('Enter', 80), ('thou', 63)]

虽然停用词已被去除，但结果中仍然存在标点符号。我们可以进一步去除标点符号。

3.2 去除标点符号

import string
punctuation = set(string.punctuation)
macbeth_filtered2 = [w.lower() for w in macbeth if w.lower() not in sw and w.lower() not in punctuation]
fd = nltk.FreqDist(macbeth_filtered2)
print(fd.most_common(10))

输出结果可能如下：

[('macb', 137), ('haue', 122), ('thou', 90), ('enter', 81), ('shall', 68), ('macbeth', 62), ('vpon', 62), ('thee', 61), ('macd', 58), ('vs', 57)]

4. 从文本中选择单词

我们可以根据特定特征从文本中选择单词。以下是两个示例：

4.1 根据单词长度选择单词

long_words = [w for w in macbeth if len(w) > 12]
print(sorted(long_words))

输出结果可能如下：

['Assassination', 'Chamberlaines', 'Distinguishes', 'Gallowgrosses', 'Metaphysicall', 'Northumberland', 'Voluptuousnesse', 'commendations', 'multitudinous', 'supernaturall', 'vnaccompanied']

4.2 根据特定字符序列选择单词

ious_words = [w for w in macbeth if 'ious' in w]
ious_words = set(ious_words)
print(sorted(ious_words))

输出结果可能如下：

['Auaricious', 'Gracious', 'Industrious', 'Iudicious', 'Luxurious', 'Malicious', 'Obliuious', 'Pious', 'Rebellious', 'compunctious', 'furious', 'gracious', 'pernicious', 'pernitious', 'pious', 'precious', 'rebellious', 'sacrilegious', 'serious', 'spacious', 'tedious']

5. 双词搭配和搭配

文本分析的另一个基本元素是考虑单词对（双词搭配，bigrams）而不是单个单词。有些双词搭配在文学作品中非常常见，几乎总是一起使用，这些被称为搭配（collocations）。

bgrms = nltk.FreqDist(nltk.bigrams(macbeth_filtered2))
print(bgrms.most_common(15))

输出结果可能如下：

[(('enter', 'macbeth'), 16), (('exeunt', 'scena'), 15), (('thane', 'cawdor'), 13), (('knock', 'knock'), 10), (('st', 'thou'), 9), (('thou', 'art'), 9), (('lord', 'macb'), 9), (('haue', 'done'), 8), (('macb', 'haue'), 8), (('good', 'lord'), 8), (('let', 'vs'), 7), (('enter', 'lady'), 7), (('wee', 'l'), 7), (('would', 'st'), 6), (('macbeth', 'macb'), 6)]

除了双词搭配，还可以考虑三词搭配（trigrams）。

tgrms = nltk.FreqDist(nltk.trigrams(macbeth_filtered2))
print(tgrms.most_common(10))

输出结果可能如下：

[(('knock', 'knock', 'knock'), 6), (('enter', 'macbeth', 'macb'), 5), (('enter', 'three', 'witches'), 4), (('exeunt', 'scena', 'secunda'), 4), (('good', 'lord', 'macb'), 4), (('three', 'witches', '1'), 3), (('exeunt', 'scena', 'tertia'), 3), (('thunder', 'enter', 'three'), 3), (('exeunt', 'scena', 'quarta'), 3), (('scena', 'prima', 'enter'), 3)]

6. 文本预处理步骤

文本预处理是文本分析中最重要和最基本的阶段之一。以下是一些常见的预处理操作：

6.1 小写转换

text = 'This is a Demo Sentence'
lower_text = text.lower()
print(lower_text)

输出结果如下：

this is a demo sentence

6.2 单词分词

nltk.download('punkt')
text = 'This is a Demo Sentence'
tokens = nltk.word_tokenize(text)
print(tokens)

输出结果如下：

['This', 'is', 'a', 'Demo', 'Sentence']

6.3 句子分词

text = 'This is a Demo Sentence. This is another sentence'
tokens = nltk.sent_tokenize(text)
print(tokens)

输出结果如下：

['This is a Demo Sentence.', 'This is another sentence']

6.4 去除标点符号

from nltk.tokenize import RegexpTokenizer
text = 'This% is a #!!@ Sentence full of punctuation marks :-) '
regexpt = RegexpTokenizer(r'[a-zA-Z0-9]+')
tokens = regexpt.tokenize(text)
print(tokens)

输出结果如下：

['This', 'is', 'a', 'Sentence', 'full', 'of', 'punctuation', 'marks']

6.5 去除停用词

nltk.download('stopwords')
from nltk.corpus import stopwords
text = 'This is a Demo Sentence. This is another sentence'
eng_sw = stopwords.words('english')
tokens = nltk.word_tokenize(text)
clean_tokens = [word for word in tokens if word not in eng_sw]
print(clean_tokens)

输出结果可能如下：

['This', 'Demo', 'Sentence', '.', 'This', 'another', 'sentence']

6.6 词干提取

from nltk.stem import SnowballStemmer
text = 'This operation operates for the operator curiosity. A decisive decision'
stemmer = SnowballStemmer('english')
tokens = nltk.word_tokenize(text)
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print(stemmed_tokens)

输出结果可能如下：

['this', 'oper', 'oper', 'for', 'the', 'oper', 'curios', '.', 'a', 'decis', 'decis']

6.7 词形还原

nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
text = 'A verb: I split, it splits. Splitted verbs.'
tokens = nltk.word_tokenize(text)
lmtzr = WordNetLemmatizer()
lemma_tokens = [lmtzr.lemmatize(word) for word in tokens]
print(lemma_tokens)

输出结果可能如下：

['A', 'verb', ':', 'I', 'split', ',', 'it', 'split', '.', 'Splitted', 'verb', '.']

7. 使用网络文本

到目前为止，我们看到的示例都使用了NLTK库中预定义的语料库。实际上，我们可能需要从互联网上提取文本并将其收集为语料库进行分析。

from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
print(raw[:75])

由于可能存在编码问题，我们可能需要调整解码方式。

raw = response.read().decode('utf8-sig')
print(raw[:75])

将文本转换为与NLTK兼容的语料库：

tokens = nltk.word_tokenize(raw)
webtext = nltk.Text(tokens)
print(webtext[:12])

8. 从HTML页面提取文本

大多数互联网文档是以HTML页面的形式存在的。我们可以使用 urllib 库下载HTML内容，并使用 BeautifulSoup 库提取文本。

url = "https://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, "lxml").get_text()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

9. 情感分析

情感分析是一个新兴的研究领域，用于评估人们对特定主题的意见。它基于文本分析技术，工作领域主要在社交媒体和论坛中。情感分析算法可以根据特定关键词评估人们的赞赏程度，赞赏程度分为积极、中性或消极三种。

以下是一个使用NLTK中的 movie_reviews 语料库进行情感分析的示例：

nltk.download('movie_reviews')
import random
reviews = nltk.corpus.movie_reviews
documents = [(list(reviews.words(fileid)), category)
             for category in reviews.categories()
             for fileid in reviews.fileids(category)]
random.shuffle(documents)

first_review = ' '.join(documents[0][0])
print(first_review)
print(documents[0][1])

all_words = nltk.FreqDist(w.lower() for w in reviews.words())
word_features = list(all_words)

def document_features(document, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features ['{}'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d, word_features), c) for (d, c) in documents]

这个示例展示了如何使用Naïve Bayes算法进行情感分析，通过分析电影评论中的单词来判断评论是积极还是消极。

总结

本文介绍了如何使用NLTK库进行文本数据分析，包括文本提取、搜索单词、分析单词频率、选择单词、处理双词搭配和搭配、文本预处理、使用网络文本、从HTML页面提取文本以及情感分析等方面。通过这些操作，我们可以更好地理解和处理文本数据，为后续的自然语言处理任务打下基础。

流程图

graph TD;
    A[文本数据] --> B[文本预处理];
    B --> B1[小写转换];
    B --> B2[单词分词];
    B --> B3[句子分词];
    B --> B4[去除标点符号];
    B --> B5[去除停用词];
    B --> B6[词干提取];
    B --> B7[词形还原];
    B --> C[文本分析];
    C --> C1[搜索单词];
    C --> C2[分析单词频率];
    C --> C3[选择单词];
    C --> C4[双词搭配和搭配分析];
    C --> D[情感分析];
    A --> E[网络文本获取];
    E --> E1[从URL下载文本];
    E --> E2[从HTML页面提取文本];

表格

操作	描述	代码示例
小写转换	将文本中的所有单词转换为小写	`text.lower()`
单词分词	将文本转换为单词列表	`nltk.word_tokenize(text)`
句子分词	将文本分割为句子列表	`nltk.sent_tokenize(text)`
去除标点符号	使用正则表达式去除文本中的标点符号	`RegexpTokenizer(r'[a-zA-Z0-9]+').tokenize(text)`
去除停用词	过滤掉文本中的停用词	`[word for word in tokens if word not in stopwords.words('english')]`
词干提取	将单词转换为其词干形式	`SnowballStemmer('english').stem(word)`
词形还原	将单词转换为其词形形式	`WordNetLemmatizer().lemmatize(word)`
搜索单词	在语料库中搜索特定单词	`text.concordance('word')`
分析单词频率	计算文本中单词的频率分布	`nltk.FreqDist(text).most_common(10)`
选择单词	根据特定条件选择单词	`[w for w in text if len(w) > 12]`
双词搭配分析	分析文本中的双词搭配	`nltk.FreqDist(nltk.bigrams(text)).most_common(15)`
三词搭配分析	分析文本中的三词搭配	`nltk.FreqDist(nltk.trigrams(text)).most_common(10)`
情感分析	评估文本的情感倾向	使用 `movie_reviews` 语料库和Naïve Bayes算法

利用NLTK进行文本数据分析

10. 操作总结与拓展

在前面的内容中，我们详细介绍了利用NLTK进行文本数据分析的多种操作，下面对这些操作进行一个总结，并探讨一些可能的拓展应用。

操作类型	具体操作	关键代码
文本提取与结构化	获取结构化句子数组	`nltk.corpus.gutenberg.sents('file.txt')`
单词搜索	查找单词所有出现位置	`text.concordance('word')`
	查找单词上下文	`text.common_contexts(['word'])`
	查找同义词	`text.similar('word')`
单词频率分析	计算单词频率分布	`nltk.FreqDist(text)`
	去除停用词	`[w for w in text if w.lower() not in sw]`
	去除标点符号	`[w.lower() for w in text if w.lower() not in sw and w.lower() not in punctuation]`
单词选择	根据长度选择单词	`[w for w in text if len(w) > 12]`
	根据特定字符序列选择单词	`[w for w in text if 'ious' in w]`
双词和三词搭配	分析双词搭配	`nltk.FreqDist(nltk.bigrams(text))`
	分析三词搭配	`nltk.FreqDist(nltk.trigrams(text))`
文本预处理	小写转换	`text.lower()`
	单词分词	`nltk.word_tokenize(text)`
	句子分词	`nltk.sent_tokenize(text)`
	去除标点符号	`RegexpTokenizer(r'[a-zA-Z0-9]+').tokenize(text)`
	去除停用词	`[word for word in tokens if word not in stopwords.words('english')]`
	词干提取	`SnowballStemmer('english').stem(word)`
	词形还原	`WordNetLemmatizer().lemmatize(word)`
网络文本处理	下载网络文本	`request.urlopen(url).read().decode('utf8')`
	从HTML提取文本	`BeautifulSoup(html, "lxml").get_text()`
情感分析	构建特征集	`[(document_features(d, word_features), c) for (d, c) in documents]`

11. 实际应用场景

这些文本分析操作在实际中有广泛的应用场景，以下是一些具体的例子：

信息检索 ：在搜索引擎中，通过对网页文本进行预处理和单词频率分析，可以快速定位用户搜索的关键词，提高检索效率。例如，使用 concordance() 函数可以找到关键词在网页中的所有出现位置。
文本分类 ：在新闻分类、垃圾邮件过滤等场景中，利用情感分析和单词选择等操作，可以将文本分类到不同的类别中。比如，使用 movie_reviews 语料库进行训练，对新的电影评论进行积极或消极的分类。
语言学习 ：在语言学习软件中，通过分析文本中的双词和三词搭配，可以帮助学习者更好地掌握语言的常用表达方式。例如，分析英语文本中的常见搭配，如 “fast food”、“pay attention” 等。

12. 注意事项

在使用NLTK进行文本数据分析时，有一些注意事项需要我们关注：

编码问题 ：在从网络下载文本时，可能会遇到编码问题，如前面提到的 \ufeff 字符。需要根据实际情况选择合适的解码方式，如 utf-8-sig 。
数据量问题 ：当处理大规模文本数据时，一些操作可能会变得非常耗时，如第一次运行 concordance() 函数时需要创建索引。可以考虑使用分布式计算或优化算法来提高处理效率。
停用词和标点符号的选择 ：停用词和标点符号的选择可能会影响分析结果。需要根据具体的分析任务和文本特点，合理选择停用词和标点符号进行过滤。

13. 未来发展趋势

随着自然语言处理技术的不断发展，NLTK作为一个重要的工具库，也将不断发展和完善。未来可能会有以下发展趋势：

深度学习的融合 ：将深度学习技术与NLTK相结合，如使用神经网络进行情感分析、文本分类等任务，可以提高分析的准确性和效率。
多语言支持 ：随着全球化的发展，对多语言文本分析的需求越来越大。NLTK可能会进一步加强对多语言的支持，提供更多的语言资源和工具。
实时处理能力 ：在一些实时应用场景中，如社交媒体监控、实时新闻分析等，需要具备实时处理文本数据的能力。NLTK可能会在这方面进行优化和改进。

流程图

graph LR;
    A[实际应用场景] --> B[信息检索];
    A --> C[文本分类];
    A --> D[语言学习];
    E[注意事项] --> E1[编码问题];
    E --> E2[数据量问题];
    E --> E3[停用词和标点选择];
    F[未来发展趋势] --> F1[深度学习融合];
    F --> F2[多语言支持];
    F --> F3[实时处理能力];

总结

本文全面介绍了利用NLTK进行文本数据分析的各种方法和技术，包括文本提取、搜索、频率分析、选择、搭配处理、预处理、网络文本处理、情感分析等方面。同时，我们还探讨了这些操作的实际应用场景、注意事项和未来发展趋势。通过掌握这些知识和技能，我们可以更好地处理和分析文本数据，为自然语言处理领域的研究和应用提供有力的支持。希望本文能够对读者在文本数据分析方面有所帮助，激发更多的研究和实践。