jieba分词实验过程中的细节

无人机_Fly

已于 2023-04-02 13:32:59 修改

阅读量215

点赞数 1

文章标签：学习 python 机器学习

于 2023-03-23 08:41:14 首次发布

本文链接：https://blog.youkuaiyun.com/qq_60337394/article/details/129723416

版权

文章介绍了使用jieba进行中文分词的细节，包括自定义词频和保存切割结果。同时，探讨了pandasDataFrame的创建和操作，以及使用sklearn划分训练集和测试集的方法。此外，还提到了Python中正则表达式、collections.Counter和数据序列的操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

jieba分词实验过程中的细节

基础不牢，地动山摇，以下是实验中捡到的细节。

Pandas

Dataframe创建以及部分参数的说明
- pd.read_csv等读取文件的
- pd.Dataframe数据工厂，此处涉及设置行列索引
```
matrix=pd.DataFrame(columns=list(dic_first.keys()),index=list(dic_first.keys()))
```
访问

通过行列索引访问df['m']['种族']
判断nan

pd.isna(df)返回bool值
保存

保存csv文件最佳，再读取格式可保持与保存前相同df.to_csv('file.csv').如果想保存成其他类型也可以改变文件后缀，如：
```
import csv
matrix.to_csv('transfer.txt', sep=' ', index=True,header=True,quoting=csv.QUOTE_NONE,escapechar=' ')
```

ski-learn

划分训练集测试集

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=
train_test_split(words,labels,test_size=0.2)
#x,y,test_size(测试集的比例)

jieba

分词

import jieba

words=[]
with open('race.txt','r',encoding='utf-8') as f3:
    a = f3.readline()
    while a:
        words.append(a)
        a = f3.readline()
jieba.suggest_freq(words, True)
jieba.suggest_freq(('东南亚人'), True)

jieba.suggest_freq(('\t'), True)
with open('Cut.txt','w',encoding='utf-8') as f2:
    with open('CDIAL-BIAS-race.txt',encoding='utf-8') as f1:
        a=str(f1.readline())
        while a:
            c=jieba.lcut(str(a))
            f2.write(str(c)+'\n')
            a = str(f1.readline())

cut和lcut：cut返回迭代器类型的数据，lcut返回列表型数据
suggest_freq和load_userdict

词性标注

from jieba import posseg as ps
words = []
indexes = []
with open('CDIAL-BIAS-race.txt',encoding='utf-8') as f1:
    txt=f1.readline()
    while txt:
        txt=re.split(puntuation,txt)
        for i in range(len(txt)):
            if txt[i]=='':
                continue
            c=ps.cut(txt[i])
            rec=[]
            inderec=[]
            for i in c:
                rec.append(i.word)
                inderec.append(i.flag)
            words.append(rec[:])#细节：属于副本rec[:]被复制,但此处似乎不做此修改也不会出错
            indexes.append(inderec[:])
        txt=f1.readline()

collections

Counter：方便快捷的计数器

from collections import Counter
col=dict(Counter(col))

##re

正则化在目前任务中涉及到的功能

词典按值而非索引排序

b=dict(sorted(b.items(), key=lambda x: x[1], reverse=True))

按某些指定的符号分列表
```
import re
puntuation = r',|\.|\;|\?|\!|。|；|！|？'
txt=re.split(puntuation,txt)
```
讲实话，虽然还不很懂为何如此，但觉得正则nb！

python

copy：深度拷贝

有这种情况：

a=[1,2,3,4]
b=a
print(b)
#[1,2,3,4]

a.pop()
#4

print(b)
#[1,2,3]

此时可以：

import copy
b=copy.deepcopy(a)

eval: nb的各种数据被转换为字符串的读写小能手
```
a=eval("[(1,2),(1,2)]")
print(a[0])
#(1,2)
```
unhashable type: ‘list’：

使用Counter的时候发现了这个，list是unhashable的，所以要计数的二元组化为tuple
string
- split，让字符串按照某个/某些符号分割成list
- replace，替换某个/某些字符
- isdigit，判断字符串是否为数字
Lambda函数：

匿名函数，没有名称的函数
```
lambda argument_list:expersion
```
- map
- reduce