一、前期准备
-
下载中文维基百科语料:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 (如果需要英文材料可以把链接中的zh改为en)
-
安装gensim
pip install gensim
-
安装繁体转简体中文库opencc
yum install opencc
yum install opencc-tools -
安装结巴分词
pip install jieba
二、语料处理(python3.6)
-
xml转txt
# -*- coding: utf-8 -*- import logging import os.path import sys from gensim.corpora import WikiCorpus if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print(globals()['__doc__'] % locals()) sys.exit