中国大学MOOC课程《Python语言程序设计》第6章 文本词频人物统计 threekingdoms三国演义代码及解析

本文介绍了一个利用Python和jieba库进行《三国演义》中人物词频统计的方法,通过去除常见词汇干扰,精准统计了小说中出现频率最高的30个人物,并详细解释了代码实现过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

以下内容为嵩天老师在课堂上讲解的,分析三国演义中top20人物,也就是出现次数最高的20个人物。为方便童鞋们拷贝,我把代码及解析放到下面,且我多费了点人工,整出top30。通过这个例子可以很好地理解解决一个实际问题的思路:

import jieba      #调用jieba库,前提是已经安装好了这个第三方库,
# 我用的是pycharm教育版,可以在初次使用提示打叹号的位置选择安装jieba,这里安装的是jieba3k,
# 但不幸安装失败,只能按照网上教程将jieba库重新下载到本地然后再安装,安装好了之后,
# 重新建立一个工程,在setting里有两个选项要勾选,在csdn中我的收藏里有相关说明
#下面的是词频统计而非人物统计,要在此基础上修改变成人物统计,加入了excludes列表

txt=open("threekingdoms.txt","r",encoding="utf-8").read()
excludes={"将军","却说","荆州","二人","不可","不能","如此","商议","如何","主公","军士",
          "左右","军马","引兵","次日","大喜","天下","东吴","于是","今日","不敢","魏兵",
          "陛下","一人","都督","人马","不知","汉中","只见","众将","后主","蜀兵","上马",
          "大叫","太守","此人","夫人","先主","后人","背后","城中","天子","一面","何不",
          "大军","忽报","先生","百姓","何故","然后","先锋","不如","赶来","原来","令人",
          "江东","下马","喊声","正是","徐州","忽然","因此","成都","不见","未知","大败",
          "大事","之后","一军","引军","起兵","军中","接应","进兵","大惊","可以","以为",
          "大怒","不得","心中","下文","一声","追赶","粮草","曹兵","一齐","分解","回报",
          "分付","只得","出马","三千","大将","许都","随后","报知","前面","之兵","且说",
          "众官","洛阳","领兵","何人","星夜","精兵","城上","之计","不肯","相见","其言",
          "一日","而行","文武","襄阳","准备","若何","出战","亲自","必有","此事","军师",
          "之中","伏兵","祁山","乘势","忽见","大笑","樊城","兄弟","首级","立于","西川",
          "朝廷","三军","大王","传令","当先","五百","一彪","坚守","此时","之间","投降",
          "五千","埋伏","长安","三路","遣使"}
#这个excludes表是随着每次运行结果不断再手工添加一些不是人物的词汇扩容的,
# 后面的一些词出现的频率在160左右较多,90左右的也很多
# 其实人物排序也不准,比如都督,如果指周瑜的话,周瑜的排名可能会再靠前
words=jieba.lcut(txt)  #使用jieba的精确模式,返回一个列表类型的分词结果
counts={}
for word in words:  #这句关键,用words列表中的每个单词去索引字典,
    # 已经有这个键的话就把相应的值加1,没有的话就取值为0,再加1
    if len(word)==1:
        continue
    elif word=="诸葛亮"or word=="孔明曰":
        rword="孔明"
    elif word=="关公"or word=="云长":
        rword="关羽"
    elif word=="玄德"or word=="玄德曰":
        rword="刘备"
    elif word=="孟德"or word=="丞相":
        rword="曹操"
    else:
        rword=word
    counts[rword]=counts.get(rword,0)+1
for word in excludes:
    del counts[word]   #将在excludes表中的词也去除
items=list(counts.items())  #将字典类型转换为list类型便于操作
items.sort(key=lambda x:x[1],reverse=True)  #对一个列表,所有键值对的第2个元素进行排序,
# 默认是false,是从小到大,True的话就是从大到小,注意里面的'key='暂时记住这句,这是list的sort排序的lambda方法
for i in range(30):
    word,count=items[i]
    if len(word)==2:
        print('{0:<5} {1:<12} {2:>6}'.format(i+1,word,count))
        #发现名字中有三个字的如司马懿 夏侯惇等,显示时最右边会超出去一个字符,
        # 在于中间的{1:<12},汉字只算一个字符的话,会比空格位更宽一点,加个这样的if else判断下使文字对齐
    else:
        print('{0:<5} {1:<11} {2:>6}'.format(i+1,word,count))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

qlovepeng1314

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值