中文分词
- 下载一中文长篇小说,并转换成UTF-8编码。
- 使用jieba库,进行中文词频统计,输出TOP20的词及出现次数。
- 排除一些无意义词、合并同一词。
- 对词频统计结果做简单的解读。
-
import jieba book=open('D:\\xiaoshuo.txt','r',encoding='utf-8') #读入待分析的字符串 str=book.read() book.close() for i in ',。!、 \n “ ” ;': str=str.replace(i,'') words=jieba.cut(str) word=set(words) #计数字典 dic={} for i in word: if len(i)>1: dic[i]=str.count(i) str=list(dic.items()) #排序 str.sort(key=lambda x:x[1],reverse=True) for i in range(20): print(str[i])
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>
============================= RESTART: D:/daa.py =============================
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 1.306 seconds.
Prefix dict has been built succesfully.
('父亲', 10)
('背影', 4)
('丧事', 3)
('北京', 3)
('散文', 3)
('茶房', 3)
('那年', 2)
('父母', 2)
('踌躇', 2)
('朱自清', 2)
('要紧', 2)
('终于', 2)
('日子', 2)
('一会', 2)
('一半', 2)
('子女', 2)
('描写', 2)
('回家', 2)
('不必', 2)
('为了', 2)
>>>Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>
============================= RESTART: D:/daa.py =============================
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 1.306 seconds.
Prefix dict has been built succesfully.
('父亲', 10)
('背影', 4)
('丧事', 3)
('北京', 3)
('散文', 3)
('茶房', 3)
('那年', 2)
('父母', 2)
('踌躇', 2)
('朱自清', 2)
('要紧', 2)
('终于', 2)
('日子', 2)
('一会', 2)
('一半', 2)
('子女', 2)
('描写', 2)
('回家', 2)
('不必', 2)
('为了', 2)
>>>Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>
============================= RESTART: D:/daa.py =============================
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 1.306 seconds.
Prefix dict has been built succesfully.
('父亲', 10)
('背影', 4)
('丧事', 3)
('北京', 3)
('散文', 3)
('茶房', 3)
('那年', 2)
('父母', 2)
('踌躇', 2)
('朱自清', 2)
('要紧', 2)
('终于', 2)
('日子', 2)
('一会', 2)
('一半', 2)
('子女', 2)
('描写', 2)
('回家', 2)
('不必', 2)
('为了', 2)
>>>