综合练习：词频统计-优快云博客

本文介绍了一种实现英文及中文文本词频统计的方法。针对英文文本，通过去除标点符号、转换为小写并利用Python进行词频统计，最终输出前20个高频词。对于中文文本，则采用jieba分词工具进行分词处理，并同样输出前20个高频词，同时排除了一些常见的语法型词汇。

1.英文词频统

下载一首英文的歌词或文章

将所有,.？！’:等分隔符全部替换为空格

将所有大写转换为小写

生成单词列表

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20

将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

file=open('E:/python/test.txt','r')
news=file.read()
file.close()
# print(news)
sep=''',.?!":()'''
for i in sep:
    news=news.replace(i,                                                                                                            ' ')
wordList=news.lower().split()
wordDict={}
wordSet=set(wordList)
wordCutSet={'i','we','the','you','of','in','and','that','to','a','between','two','is','a','both','for','with'}
wordSet=wordSet-wordCutSet
for w in wordSet:
    wordDict[w]=wordList.count(w)
sortWord=sorted(wordDict.items(),key=lambda e:e[1],reverse=True)
save=open('E:/python/save.txt','w',encoding='UTF-8')
save.write("词频统计\n")
for w in range(20):
    save.write(str(sortWord[w])+"\n")
save.close()

运行结果

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

import jieba;
text=open('E:/python/围城.txt','r',encoding='utf-8')
story=text.read()
text.close()
sep = '''：。，？！；∶ ．．．“”'''
for i in sep:
    story = story.replace(i, ' ');

story_list = list(jieba.cut(story));
exclude =[' ','\n','你','我','他','和','但','了','的','来','是','去','在','上','高'
          ,'她','说','—','不','也','得','就','都','里']
story_dict={}
for w in story_list:
    story_dict[w] = story_dict.get(w,0)+1

for w in exclude:
    del (story_dict[w]);

for w in story_dict:
   print(w, story_dict[w])
dictList = list(story_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);
# print(dictList)

for i in range(20):
    print(dictList[i])


outfile = open("E:/python/Top20.txt","a")
for i in range(20):
    outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n")
outfile.close();

　　运行结果