第 0006 题：你有一个目录，放了你一个月的日记，都是 txt，为了避免分词的问题，假设内容都是英文，请统计出你认为每篇日记最重要的词。

最新推荐文章于 2025-07-10 23:34:11 发布

原创最新推荐文章于 2025-07-10 23:34:11 发布 · 1.2k 阅读

0 ·

CC 4.0 BY-SA版权

python练习专栏收录该内容

13 篇文章

订阅专栏

本文介绍了一个简单的Python程序，用于统计指定文件中单词出现的频率，并排除了一些常见的停用词。程序读取文件内容，清理非字母字符，转换为小写，并创建一个包含单词及其出现次数的字典。

偷懒一下，遍历目录就不写了，毕竟不爱写日记。
忽略的词严格来说应该有个词库，目前用不到就先不研究了。

import os

os.chdir('C:/workspace')

def count_words(inputname):    
    fh=open(inputname)    
    read_fh=fh.read()
    fh.close()
    number=1
    is_alpha=[]
    dict_words={}
    ignore_words=['a','an','is','it','are','of','by','the','and','for','in','to']

    for word in read_fh:#取出文本中的非英文字符
        if word.isalpha():
            is_alpha.append(word)
        elif word=='\t' or word=='\n' or word==' ':
            is_alpha.append(word)            
    fh_alpha=''.join(is_alpha)
    fh_words=fh_alpha.split()
    for words in fh_words: #建立单词及频次的字典
        words=words.lower()
        if words not in dict_words and words not in ignore_words:
            dict_words[words]=number
        elif words in ignore_words:
            continue
        else:
            dict_words[words]=dict_words[words]+1
    #字典按值排序        
    dict_sort= sorted(dict_words.iteritems(), key=lambda d:d[1], reverse = True)

    print 'Maximum number of words is "%s" and it appear "%d" times'%(dict_sort[0][0],dict_sort[0][1])
    print 'Second number of words is "%s" and it appear "%d" times'%(dict_sort[1][0],dict_sort[1][1])
    print 'Third number of words is "%s" and it appear "%d" times'%(dict_sort[2][0],dict_sort[2][1])

count_words("words.txt")