Python之nltk分词库使用

孟船长

已于 2025-01-07 14:41:45 修改

阅读量901

点赞数 9

文章标签： python nlp 自然语言处理中文分词

于 2025-01-07 14:40:40 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_52728306/article/details/144985140

版权

本文将给大家介绍一下如何使用NLTK进行各种自然语言处理任务。
这些示例将涵盖文本预处理、词频统计、词形还原、命名实体识别、文本分类等多个方面。

注意：nltk对中文分词处理的不是很理想，如果跑代码建议把示例文本替换文英文

文本预处理

文本预处理是自然语言处理的基础步骤，包括分词、去除停用词、去除标点符号等。

示例：去除停用词和标点符号

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# 下载所需的资源
nltk.download('punkt')
nltk.download('stopwords')

# 示例文本
text = "自然语言处理（NLP）是计算机科学与人工智能领域的一个重要方向。它研究如何让计算机理解和生成人类语言。"

# 分词
tokens = word_tokenize(text)

# 去除停用词和标点符号
stop_words = set(stopwords.words('english'))  # 注意：NLTK的默认停用词语言为英文
# 如果处理中文，可以自定义停用词列表
filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word not in string.punctuation]

print("原始分词结果：", tokens)
print("去除停用词和标点后的结果：", filtered_tokens)

注意： NLTK默认的停用词列表主要针对英文。如果你处理中文文本，可能需要自定义中文停用词列表。

示例：自定义中文停用词

import nltk
from nltk.tokenize import word_tokenize
import string

# 示例文本
text = "自然语言处理（NLP）是计算机科学与人工智能领域的一个重要方向。它研究如何让计算机理解和生成人类语言。"

# 分词（NLTK对中文分词支持有限，建议使用jieba）
# 这里为了简单起见，直接按字符分割
tokens = list(text)

# 自定义中文停用词列表
custom_stop_words = set(['的', '与', '一个', '如何', '和'])

# 去除停用词和标点符号
filtered_tokens = [word for word in tokens if word not in custom_stop_words and word not in string.punctuation and word.strip()]

print("去除停用词和标点后的结果：", filtered_tokens)

词频统计

统计文本中各个单词出现的频率，可以帮助了解文本的基本特征。

示例：词频统计与可视化

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import matplotlib.pyplot as plt

# 下载所需资源
nltk.download('punkt')

# 示例文本
text = "Natural language processing enables computers to understand human language. It's a fascinating field of artificial intelligence."

# 分词
tokens =

最低0.47元/天解锁文章