python中利用jieba库统计词频,counts[word] = counts.get(word,0)+1的使用

博客介绍了在Python中使用counts[word] = counts.get(word,0)+1来统计词频的方法。当word不在words时返回0,在words中则返回值加1进行累计计数。
Python3.9

Python3.9

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

import jieba       
txt = open("阿甘正传-网络版.txt","r",encoding ="utf-8").read()
words = jieba.lcut(txt)        #用jieba库对文本进行中文分词,输出可能的分词的精确模式
counts ={}            #新建一个空字典
for word in words:
    if len(word) == 1:            #挑出单个的分词(不计数)
        continue
    else:
        counts[word] = counts.get(word,0)+1          #对word出现的频率进行统计,当word不在words时,返回值是0,当word在words中时,返回+1,以此进行累计计数
items = list(counts.items())
items.sort(key = lambda x:x[1],reverse = True)
for i in range(10):
    word,count = items[i]    #返回相对应的键值对
    print("{0}:{1}".format(word,count))
    

注: counts[word] = counts.get(word,0)+1 是对进行计数word出现的频率进行统计,当word不在words时,返回值是0,当word在words中时,返回+1,以此进行累计计数。

运行结果:

您可能感兴趣的与本文相关的镜像

Python3.9

Python3.9

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

def getText(filepath): f = open(filepath,"r",encoding='utf-8') text = f.read() f.close() return text import jieba def wordFreq(filepath,text,topn): words = jieba.lcut(text.strip()) counts = {} for word in words: counts[word]=counts.get(word,0)+1 items = list(counts.items()) items.sort(key = lambda x:x[1],reverse = True) f = open(filepath[:-4]+'_词频.txt',"w") for i in range(topn): word,count = items[i] f.writelines("{}\t{}\n".format(word,count)) print("{0:<10}{1:>5}".format(word,count)) f.close() import jieba def wordFreq(filepath,text,topn): words = jieba.lcut(text.strip()) counts = {} for word in words: if len(word)==1: continue counts[word]=counts.get(word,0)+1 items = list(counts.items()) items.sort(key = lambda x:x[1],reverse = True) f = open(filepath[:-4]+'_词频.txt',"w") for i in range(topn): word,count = items[i] f.writelines("{}\t{}\n".format(word,count)) f.close() def stopwordslist(filepath): stopwords = [line.strip() for line in open(filepath,'r',encoding='utf-8').readlines()] return stopwords def wordFreq(filepath,text,topn): words = jieba.lcut(text.strip()) counts = {} stopwords = stopwordslist('D:/python/stop_words.txt') for word in words: if len(word) == 1: continue elif word not in stopwords: counts[word]=counts.get(word,0)+1 items = list(counts.items()) items.sort(key = lambda x:x[1],reverse = True) f = open(filepath[:-4]+'_词频.txt',"w") for i in range(topn): word,count = items[i] f.writelines("{}\t{}\n".format(word,count)) f.close() def wordFreq(filepath,text,topn): words = jieba.lcut(text.strip()) counts = {} stopwords = stopwordslist('D:/python/stop_words.txt') for word in words: if len(word) == 1: continue elif word not in stopwords: if word == "凤姐儿": word="凤姐" elif word=="林黛玉" or word=="林妹妹"or word=="黛玉笑": word="黛玉" elif word == "宝二爷": word="宝玉" elif word == "袭人道": word="袭人" counts[word] = counts.get(word,0)+1 items = list(counts.items()) items.sort(key = lambda x:x[1],reverse = True) f = open(filepath[:-4]+'_词频.txt',"w") for i in range(topn): word,count = items[i] f.writelines("{}\t{}\n".format(word,count)) f.close() import wordcloud f=open("D:/python/红楼梦_词频.txt",'r') text=f.read() wcloud=wordcloud.WordCloud(background_color="white",width=100,max_words=500,height=860, margin=2).generate(text) wcloud.to_file("D:/python/红楼梦1cloud.png") import matplotlib.pyplot as plt import wordcloud from imageio import imread bg_pic=imread('D:/python/star(1).jpg') f=open("D:/python/红楼梦_词频.txt",'r') text=f.read() f.close() wcloud=wordcloud.WordCloud(font_path=r'C:\Windows\Fonts\simhei.ttf',background_color="white", width=1000,max_words=500,mask=bg_pic,height=860,margin=2).generate(text) wcloud.to_file("D:/python/红楼梦2cloud_star.png") plt.imshow(wcloud) plt.axis('off') plt.show() 修改
最新发布
05-30
### 更高效的词频统计与词云生成优化 为了实现更高效的词频统计与词云生成,可以从以下几个方面进行代码优化:去除停用词、合并同义词、使用特定字体和背景图片生成词云。以下是详细的实现方法。 #### 1. 去除停用词 停用词是指在文本中频繁出现但对语义贡献较小的词语(如“的”、“是”等)。可以通过加载一个停用词表来过滤掉这些词,从而提高词频统计的准确性。 ```python import jieba from collections import Counter # 加载停用词表 def load_stopwords(stopwords_path): with open(stopwords_path, 'r', encoding='utf-8') as f: stopwords = set(f.read().splitlines()) return stopwords # 分词并去除停用词 def preprocess_text(text, stopwords): words = jieba.lcut(text) filtered_words = [word for word in words if word not in stopwords and word.strip()] return filtered_words stopwords = load_stopwords("stopwords.txt") # 替换为你的停用词文件路径 text = "这是一段示例文本,用于演示如何去除停用词。" filtered_words = preprocess_text(text, stopwords) ``` 上述代码实现了通过停用词表过滤分词结果的功能[^1]。 #### 2. 合并同义词 为了进一步优化词频统计,可以将同义词视为一个整体进行统计。例如,“学习”和“研究”可能具有相似的语义,可以通过构建同义词词典来实现这一功能。 ```python # 构建同义词词典 synonyms_dict = { "学习": ["研究", "探索"], "高兴": ["开心", "愉快"] } # 合并同义词 def merge_synonyms(word_counts, synonyms_dict): merged_counts = {} for word, count in word_counts.items(): found = False for key, synonyms in synonyms_dict.items(): if word in synonyms: merged_counts[key] = merged_counts.get(key, 0) + count found = True break if not found: merged_counts[word] = count return merged_counts word_counts = Counter(filtered_words) merged_word_counts = merge_synonyms(word_counts, synonyms_dict) ``` 上述代码展示了如何通过同义词词典将多个词语合并为一个统计项。 #### 3. 使用特定字体和背景图片生成词云 词云生成可以通过 `wordcloud` 实现,并支持自定义字体和背景图片。 ```python from wordcloud import WordCloud import matplotlib.pyplot as plt import numpy as np from PIL import Image # 生成词云 def generate_wordcloud(word_counts, font_path, mask_image_path): mask = np.array(Image.open(mask_image_path)) wc = WordCloud( font_path=font_path, background_color="white", max_words=200, mask=mask, contour_width=3, contour_color='steelblue' ) wc.generate_from_frequencies(word_counts) plt.figure(figsize=(10, 5)) plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.show() generate_wordcloud(merged_word_counts, "simhei.ttf", "background.png") # 替换为你的字体和背景图片路径 ``` 上述代码展示了如何通过指定字体和背景图片生成美观的词云[^1]。 ### 注意事项 - 确保使用的字体支持中文字符,否则可能会导致词云中的中文显示异常。 - 背景图片应选择与主题相关的图像,以增强视觉效果。 ####
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值