以下内容为嵩天老师在课堂上讲解的,分析三国演义中top20人物,也就是出现次数最高的20个人物。为方便童鞋们拷贝,我把代码及解析放到下面,且我多费了点人工,整出top30。通过这个例子可以很好地理解解决一个实际问题的思路:
import jieba #调用jieba库,前提是已经安装好了这个第三方库, # 我用的是pycharm教育版,可以在初次使用提示打叹号的位置选择安装jieba,这里安装的是jieba3k, # 但不幸安装失败,只能按照网上教程将jieba库重新下载到本地然后再安装,安装好了之后, # 重新建立一个工程,在setting里有两个选项要勾选,在csdn中我的收藏里有相关说明 #下面的是词频统计而非人物统计,要在此基础上修改变成人物统计,加入了excludes列表 txt=open("threekingdoms.txt","r",encoding="utf-8").read() excludes={"将军","却说","荆州","二人","不可","不能","如此","商议","如何","主公","军士", "左右","军马","引兵","次日","大喜","天下","东吴","于是","今日","不敢","魏兵", "陛下","一人","都督","人马","不知","汉中","只见","众将","后主","蜀兵","上马", "大叫","太守","此人","夫人","先主","后人","背后","城中","天子","一面","何不", "大军","忽报","先生","百姓","何故","然后","先锋","不如","赶来","原来","令人", "江东","下马","喊声","正是","徐州","忽然","因此","成都","不见","未知","大败", "大事","之后","一军","引军","起兵","军中","接应","进兵","大惊","可以","以为", "大怒","不得","心中","下文","一声","追赶","粮草","曹兵","一齐","分解","回报", "分付","只得","出马","三千","大将","许都","随后","报知","前面","之兵","且说", "众官","洛阳","领兵","何人","星夜","精兵","城上","之计","不肯","相见","其言", "一日","而行","文武","襄阳","准备","若何","出战","亲自","必有","此事","军师", "之中","伏兵","祁山","乘势","忽见","大笑","樊城","兄弟","首级","立于","西川", "朝廷","三军","大王","传令","当先","五百","一彪","坚守","此时","之间","投降", "五千","埋伏","长安","三路","遣使"} #这个excludes表是随着每次运行结果不断再手工添加一些不是人物的词汇扩容的, # 后面的一些词出现的频率在160左右较多,90左右的也很多 # 其实人物排序也不准,比如都督,如果指周瑜的话,周瑜的排名可能会再靠前 words=jieba.lcut(txt) #使用jieba的精确模式,返回一个列表类型的分词结果 counts={} for word in words: #这句关键,用words列表中的每个单词去索引字典, # 已经有这个键的话就把相应的值加1,没有的话就取值为0,再加1 if len(word)==1: continue elif word=="诸葛亮"or word=="孔明曰": rword="孔明" elif word=="关公"or word=="云长": rword="关羽" elif word=="玄德"or word=="玄德曰": rword="刘备" elif word=="孟德"or word=="丞相": rword="曹操" else: rword=word counts[rword]=counts.get(rword,0)+1 for word in excludes: del counts[word] #将在excludes表中的词也去除 items=list(counts.items()) #将字典类型转换为list类型便于操作 items.sort(key=lambda x:x[1],reverse=True) #对一个列表,所有键值对的第2个元素进行排序, # 默认是false,是从小到大,True的话就是从大到小,注意里面的'key='暂时记住这句,这是list的sort排序的lambda方法 for i in range(30): word,count=items[i] if len(word)==2: print('{0:<5} {1:<12} {2:>6}'.format(i+1,word,count)) #发现名字中有三个字的如司马懿 夏侯惇等,显示时最右边会超出去一个字符, # 在于中间的{1:<12},汉字只算一个字符的话,会比空格位更宽一点,加个这样的if else判断下使文字对齐 else: print('{0:<5} {1:<11} {2:>6}'.format(i+1,word,count))