【问题笔记】解决Gensim自带的corpora进行维基百科中文语料数据提取问题_notimplementederror: the lemmatize parameter is no-优快云博客

本文链接：https://blog.youkuaiyun.com/chenghao1012/article/details/139144383

【问题笔记】解决Gensim自带的corpora进行维基百科中文语料数据提取问题

错误提示
问题所在
解决方法
其他问题
最终运行成功代码

# -*- coding: utf-8 -*-


from gensim.corpora import WikiCorpus
space =""
with open('wiki-zh-article.txt','w',encoding="utf8") as f:
    wiki =WikiCorpus('zhwiki-latest-pages-articles.xml.bz2',lemmatize=False,dictionary={})
    
    for text in wiki.get_texts():
        f.write(space.join(text)+"\n")
        
print("Finished Saved")

错误提示

NotImplementedError: The lemmatize parameter is no longer supported. If you need to lemmatize, use e.g. <https://github.com/clips/pattern>. 
Perform lemmatization as part of your tokenization function and pass it as the tokenizer_func parameter to this initializer.

问题所在

lemmatize 参数：如果你在创建 WikiCorpus 实例时使用了 lemmatize 参数，你需要移除它，因为该参数已不再支持。

dictionary 参数：WikiCorpus 类在 gensim 的最新版本中不接受 dictionary 参数。如果你需要使用自定义词典，你需要在创建 WikiCorpus 对象后，手动处理文本。

get_texts 方法：WikiCorpus 类没有 get_texts 方法。如果你想要获取文本，你需要使用 WikiCorpus 类的 get_texts 静态方法。

解决方法

解决1：移除 lemmatize 和 dictionary 参数
解决2：使用 WikiCorpus 类的 get_texts 静态方法，用WikiCorpus.get_texts(wiki)替换wiki.get_texts()

其他问题

gensim 库在初始化 WikiCorpus 类时使用了 multiprocessing 来并行处理数据。错误信息表明 gensim 库尝试在主进程的引导阶段结束之前启动新的进程。

要解决这个问题，你可以尝试以下步骤：
确保你的Python脚本是作为主模块运行的，而不是作为子模块导入到其他脚本中。
如果你的脚本是通过其他Python脚本导入并运行的，确保导入脚本遵循上述模式。

最终运行成功代码

from gensim.corpora import WikiCorpus
import multiprocessing

if __name__ == '__main__':
    # 如果你的程序不是被冻结成可执行文件，可以省略下一行
    multiprocessing.freeze_support()
    space = ""
    with open('wiki-zh-article.txt', 'w', encoding="utf8") as f:
        wiki = WikiCorpus('zhwiki-latest-pages-articles.xml.bz2')  # 移除 lemmatize 和 dictionary 参数
        for text in WikiCorpus.get_texts(wiki):  # 使用静态方法 get_texts
            f.write(space.join(text) + "\n")

    print("Finished Saved")