自然语言处理实验——使用jieba进行高频词汇提取

最新推荐文章于 2024-09-23 22:02:08 发布

c与j

最新推荐文章于 2024-09-23 22:02:08 发布

阅读量1.1k

点赞数 1

文章标签：自然语言处理人工智能

本文链接：https://blog.youkuaiyun.com/yyn15854/article/details/137473237

版权

本文介绍了如何使用Python的jieba库对1000篇新闻文本进行分词，去除停用词，并计算出现频率最高的前10个高频词，展示了TF的使用方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、实验目的：

通过jieba工具进行高频词提取的过程理解理解TF的使用方法，理解语料库的含义，熟悉jieba工具的使用。

实验要求：

从1000篇真实的新闻文本中随机选择一篇，使用jieba工具进行分词，并找出出现次数排名前十的高频词。

注：数据由1000个文本文件组成，存储在data文件夹下。另外，实例中用到的停用词表stop_words.utf8也存储在data文件夹下。

实验代码：

import glob
import random
import jieba

def getContent(path):
    with open(path, encoding='utf-8', errors='ignore') as f:
        content = ''
        for line in f:
            # 去除空行
            line = line.strip()
            if line:  # 确保行不为空再添加到内容中
                content += line
        return content

def get_TF(words, topK=10):
    tf_dic = {}
    # 遍历words中的每个词，如果这个词在tf_dic中出现过，则令其加一
    for w in words:
        tf_dic[w] = tf_dic.get(w, 0) + 1
    # 将字典tf_dic排序后取出前topK
    return sorted(tf_dic.items(), key=lambda x: x[1], reverse=True)[:topK]

def stop_words(path):
    with open(path, encoding='utf-8') as f:
        return [l.strip() for l in f]

# 修改cut函数，path_to_corpus和path_to_stop_words是你的文件和停用词表所放的位置
def cut(content, path_to_stop_words):
    split_words = [x for x in jieba.cut(content) if x not in stop_words(path_to_stop_words)]
    return split_words

def main():
    # 指定corpus.txt文件的路径
    path_to_corpus = 'corpus.txt'
    # 指定stop_words.utf8文件的路径
    path_to_stop_words = 'stop_words.utf8'

    # 获取文件内容
    corpus_content = getContent(path_to_corpus)

    # 分词并去除停用词
    split_words = cut(corpus_content, path_to_stop_words)

    # 打印结果
    print('原始文本内容：')
    print(corpus_content)
    print('分词效果：')
    print('/'.join(split_words))
    print('TF（词频）前10的词为：')
    print(get_TF(split_words))

if __name__ == '__main__':
    main()

实验结果截图：