新闻文本分类-Task3

本文介绍了一次使用传统机器学习方法进行文本分类的实践,重点是通过TF-IDF来表示文本特征。作者提到在实验中,随机森林模型比SVM和KNN等更快,实现了约0.74的F1分数。通过增加n-gram和特征截断操作,F1分数提升到0.82,但处理时间也相应增加。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Task03:基于机器学习的文本分类

本次主要基于传统的机器学习方法来进行文本分类。主要的思想是通过TFIDF来进行,TDIDF介绍的很多了,本文就不再介绍了,想了解相关原理的可以看下这篇博客
https://blog.youkuaiyun.com/hongyesuifeng/article/details/90256387

基本思想是通过TFIDF来对句子的特征进行表示的。首先通过计算每个词的TFIDF值,因为这里未做基本的数据处理,可能会把停用词和标点符号也算进去,本次实践主要是跑通全流程,所以不做过多数据预处理操作。实验时发现SVM和KNN等模型方法速度较慢,这里只计算随机森林的结果,同时给上其他模型的训练代码供参考。

下面直接上代码

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from datetime import datetime


print("reading data......")
data = pd.read_csv("./train_set.csv",sep="\t")

vectorizer = CountVectorizer()#该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频  
transformer = TfidfTransformer()#该类会统计每个词语的tf-idf权值  
tfidf = transformer.fit_transform(vectorizer.fit_transform(data['text']))

X = tfidf
y = np.array(data['label'])
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=666)
print(X.shape,y.shape)

#print(tfidf.toarray())
#print(tfidf.toarray().shape)
#print(train['label'].values)

#print("this is KNN Classifier:")
#model = KNeighborsClassifier()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))

#time_svm = datetime.now()
#
#print("this is the SVM Classifier F score:")
#model = SVC()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the SVM time:",(datetime.now() - time_svm).total_seconds())
#
#time_lr = datetime.now()
#print("\nthis is the LR Classifier F score:")
#model = LogisticRegression()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the LR time:",(datetime.now() - time_lr).total_seconds())
#
#time_DT = datetime.now()
#print("\nthis is the DT Classifier F score:")
#model = DecisionTreeClassifier()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the DT time:",(datetime.now() - time_DT).total_seconds())

time_RF = datetime.now()
print("\nthis is the RF Classifier F score:")
model = RandomForestClassifier()
model.fit(X_train,y_train)
y_predict = model.predict(X_test)
print(f1_score(y_predict, y_test, average='macro'))
print("this is the RF time:",(datetime.now() - time_RF).total_seconds())

#time_GBDT = datetime.now()
#print("\nthis is the GBDT Classifier F score:")
#model = GradientBoostingClassifier()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the GBDT time:",(datetime.now() - time_GBDT).total_seconds())

训练的F1score大概0.74。随机森林还是比较快的,几分钟就跑完了,跑SVM跑了好久还没出结果,可见其在高维稀疏数据上表现较差。

this is the RF Classifier F score:
0.7450054941922312
this is the RF time: 153.232528

为了对部分字符做一定的处理,补充ngram特征,同时进行过长的句子阶段截断操作,通过修改tfidf提取特征代码
修改:

vectorizer = CountVectorizer()#该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频  
transformer = TfidfTransformer()#该类会统计每个词语的tf-idf权值  
tfidf = transformer.fit_transform(vectorizer.fit_transform(data['text']))

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
tfidf = tfidf.fit_transform(data['text'])

结果有一定提升,但需要更长的时间进行tfidf的特征处理上。因为多了n-gram和特征截断操作。
this is the RF Classifier F score:
0.8237360545086597
this is the RF time: 259.655584

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值