jieba源码解析：jieba.cut

最新推荐文章于 2024-11-21 10:23:18 发布

原创最新推荐文章于 2024-11-21 10:23:18 发布 · 1.5k 阅读

0 ·

CC 4.0 BY-SA版权

NLP 同时被 3 个专栏收录

52 篇文章

订阅专栏

OOV

2 篇文章

订阅专栏

分词

2 篇文章

订阅专栏

该博客介绍了如何使用jieba进行中文文本分析，包括自定义词典、停用词表的使用以及词频统计。通过Trie树结构和动态规划实现高效分词，利用HMM模型处理未登录词。同时，展示了Python读取文件时涉及编码的情况以及如何进行文件读写操作。

reference：
Python 中文文本分析实战：jieba分词+自定义词典补充+停用词词库补充+词频统计：
https://zhuanlan.zhihu.com/p/46922291
示例代码来自：https://zhuanlan.zhihu.com/p/143099147
jieba源码解析：https://www.cnblogs.com/aloiswei/p/11567616.html
停用词表：（中文常用停用词表）https://github.com/goto456/stopwords
jieba中文分词：https://github.com/fxsjy/jieba

jieba分词有三种模式：全模式、精确模式、搜索引擎模式。全模式和精确模式通过jieba.cut实现，搜索引擎模式对应cut_for_search，且三者均可以通过参数HMM决定是否使用新词识别功能。

四.算法思路
基于Trie树结构实现高效的词图扫描，生成句子中汉字所有可能成词情况所构成的有向无环图（DAG)
采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
对于未登录词，采用了基于汉字成词能力的HMM模型，使用了Viterbi算法

# Python中，读取文件时什么情况时需写上encoding=utf-8，什么时候不用写？
# 主要针对的是Windows系统。所以涉及到Windows系统或者跨系统文件读写最好都加上

#!/user/bin/env python3
# -*- coding: utf-8 -*-

from collections import Counter
import jieba


# jieba.load_userdict('userdict.txt')
# 创建停用词list
# Python中，读取文件时什么情况时需写上encoding=utf-8，什么时候不用写？
# 主要针对的是Windows系统。所以涉及到Windows系统或者跨系统文件读写最好都加上
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='UTF-8').readlines()]
    return stopwords


# 对句子进行分词
def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist('stopwords.txt')  # 这里加载停用词的路径
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr


inputs = open('file1.txt', 'r', encoding='UTF-8')  # 加载要处理的文件的路径
outputs = open('file2', 'w')  # 加载处理后的文件路径
for line in inputs:
    line_seg = seg_sentence(line)  # 这里的返回值是字符串
    outputs.write(line_seg)
outputs.close()
inputs.close()
# WordCount
with open('file2', 'r') as fr:  # 读入已经去除停用词的文件
    data = jieba.cut(fr.read()) # 为什么fr已经是分好词的string，还要再分一遍词？ fr.read()返回从字符串中读取的字节,所以需要再分一遍词。
data = dict(Counter(data)) # 如不再分一遍词，则结果为：data = {dict:5} {'你': 1, ' ': 3, '今': 1, '天': 1, '吃': 1}

with open('file3', 'w') as fw:  # 读入存储wordcount的文件路径
    for k, v in data.items():
        fw.write('%s,%d\n' % (k, v))