任务一、THUCNews数据集的准备

博客围绕自然语言处理展开,但具体内容缺失。自然语言处理是信息技术领域重要方向,涵盖词法分析、句法分析等多方面。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、THUCNews数据集的准备
下载链接:https://pan.baidu.com/s/1hugrfRu 密码:qfud
二、数据集的预处理
1.获得词汇表:
from collections import Counter

def getVocabularyText(content_list,size):
    size = size - 1
    allContent = ''.join(content_list)
    #将内容列表中的所有文章合并起来变成字符串str形式
    counter = Counter(allContent)
    #将counter对象实例化并传入字符串形式的内容
    vocabulary = []
    vocabulary.append('<PAD>')
    for i in counter.most_common(size):
        vocabulary.append(i[0])
    with open('vocabulary.txt','w',encoding='utf8') as file:
        file.write(vocab + '\n')
2.读取数据
with open('cnews.vocab.txt',encoding='utf8') as file:
    vocabulary_list = [k.strip() for k in file.readlines()]
#读取词表
with open('cnews.train.txt',encoding='utf8') as file:
    line_list = [k.strip() for k in file.readlines()]
    #读取每行
    train_label_list = [k.split()[0] for k in line_list]
    #将标签依次取出
    train_content_list = [k.split(maxsplit = 1)[1] for k in line_list]
    #将内容依次取出
#同理读取test数据
with open('cnews.test.txt',encoding='utf8') as file:
    line_list = [k.strip() for k in file.readlines()]
    test_label_list = [k.split()[0] for k in line_list]
    test_content_list = [k.split(maxsplit = 1)[1] for k in line_list]
3.句子向量化
于一个基于整篇文本单词的向量维度十分大,在构造句子向量的时候就不会选择单词的向量拼接,而是选择单词对应词汇表的id(行号)拼接,这样可以有效避开了句子向量所造成的空间开销巨大问题。
vocab_size = 5000  # 词汇表达小
seq_length = 600  # 句子序列长度
num_classes = 10  # 类别数
一个句子的向量长度就是词的总数词向量的维度了。这样一乘发现维度就特别大,而且每个句子的长度不一,对于CNN, 输入与输出都是固定的,所以句子有长有短就没法按训练样本来训练了。所以规定,就是每个句子长度为seq_length(这个序列长度越大训练越慢,但是可能准确率会更好一些?)。如果句子长度不够就补0(有池化层所有补0对于结果没有任何影响),如果句子长度太长的话就去掉多余的部分。
利用kera进行规范
import tensorflow.contrib.keras as kr
train_X = kr.preprocessing.sequence.pad_sequences(train_vector_list,600)
test_X = kr.preprocessing.sequence.pad_sequences(test_vector_list,600)
word2id_dict = dict(((b,a) for a ,b in enumerate(vocabulary_list)))
def content2vector(content_list):
    content_vector_list = []
    for content in content_list:
        content_vector = []
        for word in content:
            if word in word2id_dict:
                content_vector.append(word2id_dict[word])
            else:
                content_vector.append(word2id_dict['<PAD>'])
        content_vector_list.append(content_vector)
    return content_vector_list
train_vector_list = content2vector(train_content_list)
test_vector_list = content2vector(test_content_list)

print(len(train_content_list[0]))
print(len(train_vector_list[:1][0]))
print('************************************')
print(len(test_content_list[0]))
print(len(test_vector_list[:1][0]))
746
746
************************************
1720
1720
4.训练集和测试集
vocab_size = 5000  # 词汇表达小
seq_length = 600  # 句子序列长度
num_classes = 10  # 类别数
import tensorflow.contrib.keras as kr
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
train_Y = kr.utils.to_categorical(label.fit_transform(train_label_list),num_classes=num_classes)
test_Y = kr.utils.to_categorical(label.fit_transform(test_label_list),num_classes=num_classes)

train_Y[:2]
array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
参考http://www.pianshen.com/article/1928259550/


AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值