NLP基础：NLTK使用

最新推荐文章于 2025-10-14 09:51:16 发布

原创最新推荐文章于 2025-10-14 09:51:16 发布 · 1.6k 阅读

36 ·

CC 4.0 BY-SA版权

文章标签：

#NLTK

NLP 专栏收录该内容

8 篇文章

订阅专栏

NLTK

NLTK在NLP上的应用

情感分析
文本相似度
文本分类

一、安装NLTK

sudo pip install nltk
python3
>>> import nltk
>>> nltk.download()

其中 nltk.download() 用来下载nltk自带的一些语料库

测试布朗大学的语料库：

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
>>> len(brown.sents())
57340 #句子长度
>>> len(brown.words())
1161192 #单词个数

二、文本处理流程

1. 文本预处理

（1）Tokenize:把长句子拆分成“有意义”的小部分

>>> import nltk
>>> sentence = “hello, world"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['hello', ‘,', 'world']

（2）词形归一化，包括Stemming和Lemmatization

Stemming 词干提取:一般来说，就是把不影响词性的inflection的小尾巴砍掉

可以使用nltk中的 PorterStemmer/LancasterStemmer/SnowballStemmer/PorterStemmer

Lemmatization 词形归一:把各种类型的词的变形，都归为一个形式

可以使用nltk中的 WordNetLemmatizer

Part-Of-Speech:NLTK更好地实现Lemma

（3）去除停止词

from nltk.corpus import stopwords
# 先token得到word_list,再去除停止词
filtered_words =
[word for word in word_list if word not in stopwords.words('english')]

（4）最终得到一个干净的word list

分词的方式：
启发式Heuristic（查数据库）
机器学习／统计方法：HMM、CRF

三、应用：情感分析

最简单的方法是给每个词打分，然后判断句子总的分数。

配上ML的情感分析：

# 把训练集给做成标准形式

training_data = [[preprocess(s1), 'pos'],
[preprocess(s2), 'pos'],
[preprocess(s3), 'neg'],
[preprocess(s4), 'neg']]
# 构建model
model = NaiveBayesClassifier.train(training_data)

# 测试结果

print(model.classify(preprocess('this is a good book')))

四：应用:文本相似度

用元素频率表示文本特征

原理就是使用余弦定理来计算文本相似度:

$\theta ) = \dfrac{A \cdot B}{||A|| \: ||B||}$

使用NLTK中的FreqDist包来进行频率统计

>>> from nltk.corpus import stopwords
>>> from nltk import FreqDist
>>> corpus = 'this is my sentence this is my life this is the day'
>>> tokens = nltk.word_tokenize(corpus)
>>> fdist = FreqDist(tokens)
>>> standard_freq_vector = fdist.most_common(50)
>>> print(standard_freq_vector)
[('this', 3), ('is', 3), ('my', 2), ('sentence', 1), ('life', 1), ('the', 1), ('day', 1)]

这一步得到了一个类似词典的东西，里面保存了我们给出的句子里面每个单词出现的频率，并且取出来出现频率最高的50个单词。

然后按照出现频率 ,记录下每个单词的位置：

def position_lookup(v):
    res = {}
    counter = 0
    for word in v:
        res[word[0]] = counter
        counter += 1
    return res
# 把标准的单词位置记录下来
standard_position_dict = position_lookup(standard_freq_vector) print(standard_position_dict)
# 得到 个位置对照表
# {'is': 0, 'the': 3, 'day': 4, 'this': 1,'sentence': 5, 'my': 2, 'life': 6}

这时如果我们有个新句子，对于这个新句子的每个单词如果在我们的词库出现过那么就在"标准位置"上+1

sentence = 'this is cool'
# 先新建个跟我们的标准vector同样的向量 freq_vector = [0] * size
# 简单的Preprocessing
tokens = nltk.word_tokenize(sentence)
for word in tokens:
    try:
        freq_vector[standard_position_dict[word]] += 1
    except KeyError: # 如果是个新词就pass掉
        continue
print(freq_vector)
# [1, 1, 0, 0, 0, 0, 0]