词频统计——基本功能

最新推荐文章于 2022-03-17 11:27:10 发布

weixin_30642029

最新推荐文章于 2022-03-17 11:27:10 发布

阅读量228

点赞数

CC 4.0 BY-SA版权

文章标签： git 人工智能 python

原文链接：http://www.cnblogs.com/zhuangzq/p/9815267.html

一、基本信息

　　1.1 本次作业地址：https://edu.cnblogs.com/campus/ntu/Embedded_Application/homework/2088

　　1.2 项目的Git地址：https://gitee.com/ntucs/PairProg

二、项目分析

2.1 程序运行模块（方法、函数）介绍

①任务一：读取文件、统计行数写入result.txt方法

import re
import jieba
from string import punctuation

def process_file(dst):     # 读文件到缓冲区
    try:     # 打开文件
        f=open(dst,'r')
    except IOError as s:
        print (s)
        return None
    try:     # 读文件到缓冲区
        x=f.read()
    except:
        print ("Read File Error!")
        return None
    bvffer=x
    return bvffer

②任务一：使用正则表达式统计词频，存放如字典模块

def line_count(dst):
    count=0
    for index,line in enumerate(open(dst,'r')):
        count+=1
    print("text line :",count)
def process_buffer(bvffer):
    c=bvffer.lower()
    result=re.sub("[0-9]+[a-z]+"," ",c)
    re1=re.findall('[a-z]+\w+',result)
    d=open("stopwords.txt",'r').read()
    if re1:
        word_freq = {}
        # 下面添加处理缓冲区 bvffer代码，统计每个单词的频率，存放在字典word_freq
        for word in re1:
            if word not in d:
                if word not in word_freq:
                    word_freq[word]=0
                word_freq[word]+=1
        return word_freq

③任务一：保存排名前十结果至result.txt模块

def output_result(word_freq):
    doc=open('result.txt','w')
    if word_freq:
        sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True)
        print(len(word_freq))
        for item in sorted_word_freq[:10]:  # 输出 Top 10 的单词
            print(item[0],":",item[1])
            print(item[0],":",item[1],file=doc)
    doc.close()

④任务一：主函数调用各个模块逻辑

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('dst')
    args = parser.parse_args()
    dst = args.dst
    line_count(dst)
    bvffer = process_file(dst)
    word_freq = process_buffer(bvffer)
    output_result(word_freq)
    word_frequency(bvffer)

　　⑤任务二：停词表模块

　　功能实现方法：使用 nltk（Natural Language Toolkit，自然语言处理工具包，在NLP领域中，最常使用的一个Python库。）下载英文停词表，存放到list_stopWords集合中，接着对将要处理的英文单词进行判断是否与list_stopWords中的词汇相等，如果相等则跳过，即停词功能。

　　代码模块如下：

    d=open("stopwords.txt",'r').read()       #停词
    if re1:
        word_freq = {}
        # 下面添加处理缓冲区 bvffer代码，统计每个单词的频率，存放在字典word_freq
        for word in re1:
            if word not in d:
                if word not in word_freq:
                    word_freq[word]=0
                word_freq[word]+=1
        return word_freq

任务二：列出高频短语模块

def Phrase_statistics(bvffer):       #统计高频词组
    text=nltk.text.Text(bvffer.split())
    print(text.collocations())

2.2 程序算法时间、空间复杂度分析

def process_buffer(bvffer):
    c=bvffer.lower()
    result=re.sub("[0-9]+[a-z]+"," ",c)
    re1=re.findall('[a-z]+\w+',result)
    d=open("stopwords.txt",'r').read()
    if re1:
        word_freq = {}
        # 下面添加处理缓冲区 bvffer代码，统计每个单词的频率，存放在字典word_freq
        for word in re1:
            if word not in d:
                if word not in word_freq:
                    word_freq[word]=0
                word_freq[word]+=1
        return word_freq