1.英文词频统
下载一首英文的歌词或文章
将所有,.?!’:等分隔符全部替换为空格
将所有大写转换为小写
生成单词列表
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20
将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。
file=open('E:/python/test.txt','r')
news=file.read()
file.close()
# print(news)
sep=''',.?!":()'''
for i in sep:
news=news.replace(i, ' ')
wordList=news.lower().split()
wordDict={}
wordSet=set(wordList)
wordCutSet={'i','we','the','you','of','in','and','that','to','a','between','two','is','a','both','for','with'}
wordSet=wordSet-wordCutSet
for w in wordSet:
wordDict[w]=wordList.count(w)
sortWord=sorted(wordDict.items(),key=lambda e:e[1],reverse=True)
save=open('E:/python/save.txt','w',encoding='UTF-8')
save.write("词频统计\n")
for w in range(20):
save.write(str(sortWord[w])+"\n")
save.close()
运行结果
2.中文词频统计
下载一长篇中文文章。
从文件读取待分析文本。
news = open('gzccnews.txt','r',encoding = 'utf-8')
安装与使用jieba进行中文分词。
pip install jieba
import jieba
list(jieba.lcut(news))
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20(或把结果存放到文件里)
import jieba;
text=open('E:/python/围城.txt','r',encoding='utf-8')
story=text.read()
text.close()
sep = ''':。,?!;∶ ...“”'''
for i in sep:
story = story.replace(i, ' ');
story_list = list(jieba.cut(story));
exclude =[' ','\n','你','我','他','和','但','了','的','来','是','去','在','上','高'
,'她','说','—','不','也','得','就','都','里']
story_dict={}
for w in story_list:
story_dict[w] = story_dict.get(w,0)+1
for w in exclude:
del (story_dict[w]);
for w in story_dict:
print(w, story_dict[w])
dictList = list(story_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);
# print(dictList)
for i in range(20):
print(dictList[i])
outfile = open("E:/python/Top20.txt","a")
for i in range(20):
outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n")
outfile.close();
运行结果