Resource stopwords not found. Please use the NLTK Downloader to obtain the resource

徐福记c

于 2025-02-18 16:35:26 发布

阅读量332

点赞数 3

文章标签： python

本文链接：https://blog.youkuaiyun.com/xuukai/article/details/145708651

版权

中文分词：

import re
import jieba
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')  

def to_keywords(input_string):
    """将句子转成检索关键词序列"""
    # 按搜索引擎模式分词
    word_tokens = jieba.cut_for_search(input_string)
    # 加载停用词表
    stop_words = set(stopwords.words('chinese'))
    # 去除停用词
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    return ' '.join(filtered_sentence)

def sent_tokenize(input_string):
    """按标点断句"""
    # 按标点切分
    sentences = re.split(r'(?<=[。！？；?!])', input_string)
    # 去掉空字符串
    return [sentence for sentence in sentences if sentence.strip()]

    
if "__main__" == __name__:
    # 测试关键词提取
    print(to_keywords("小明硕士毕业于中国科学院计算所，后在日本京都大学深造"))
    # 测试断句
    print(sent_tokenize("这是，第一句。这是第二句吗？是的！啊"))

在使用NLK分词时，出现了报错：

nltk.download('stopwords') 是 Python 中 Natural Language Toolkit (NLTK) 库的一个命令，用于下载 NLTK 提供的“停用词”（stopwords）资源。
停用词（stopwords）是指在文本处理中通常被忽略的常见词汇。这些词汇在语言中出现频率很高，但通常对语义贡献较小，例如：
中文中的“的”、“是”、“在”等。
英文中的“the”、“is”、“and”、“in”等。
停用词在许多自然语言处理任务中会被移除，以减少噪声并提高处理效率。NLTK 提供了多种语言的停用词列表，用户可以通过 nltk.download('stopwords') 下载这些资源。

问题原因：

没有下载stopwords词库，解释器找不到stopwords词库，报错。

解决方案：

1、下载stopwords库

https://www.nltk.org/nltk_data/