Task03:基于机器学习的文本分类
本次主要基于传统的机器学习方法来进行文本分类。主要的思想是通过TFIDF来进行,TDIDF介绍的很多了,本文就不再介绍了,想了解相关原理的可以看下这篇博客
https://blog.youkuaiyun.com/hongyesuifeng/article/details/90256387
基本思想是通过TFIDF来对句子的特征进行表示的。首先通过计算每个词的TFIDF值,因为这里未做基本的数据处理,可能会把停用词和标点符号也算进去,本次实践主要是跑通全流程,所以不做过多数据预处理操作。实验时发现SVM和KNN等模型方法速度较慢,这里只计算随机森林的结果,同时给上其他模型的训练代码供参考。
下面直接上代码
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from datetime import datetime
print("reading data......")
data = pd.read_csv("./train_set.csv",sep="\t")
vectorizer = CountVectorizer()#该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频
transformer = TfidfTransformer()#该类会统计每个词语的tf-idf权值
tfidf = transformer.fit_transform(vectorizer.fit_transform(data['text']))
X = tfidf
y = np.array(data['label'])
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=666)
print(X.shape,y.shape)
#print(tfidf.toarray())
#print(tfidf.toarray().shape)
#print(train['label'].values)
#print("this is KNN Classifier:")
#model = KNeighborsClassifier()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#time_svm = datetime.now()
#
#print("this is the SVM Classifier F score:")
#model = SVC()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the SVM time:",(datetime.now() - time_svm).total_seconds())
#
#time_lr = datetime.now()
#print("\nthis is the LR Classifier F score:")
#model = LogisticRegression()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the LR time:",(datetime.now() - time_lr).total_seconds())
#
#time_DT = datetime.now()
#print("\nthis is the DT Classifier F score:")
#model = DecisionTreeClassifier()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the DT time:",(datetime.now() - time_DT).total_seconds())
time_RF = datetime.now()
print("\nthis is the RF Classifier F score:")
model = RandomForestClassifier()
model.fit(X_train,y_train)
y_predict = model.predict(X_test)
print(f1_score(y_predict, y_test, average='macro'))
print("this is the RF time:",(datetime.now() - time_RF).total_seconds())
#time_GBDT = datetime.now()
#print("\nthis is the GBDT Classifier F score:")
#model = GradientBoostingClassifier()
#model.fit(X_train,y_train)
#y_predict = model.predict(X_test)
#print(f1_score(y_predict, y_test, average='macro'))
#print("this is the GBDT time:",(datetime.now() - time_GBDT).total_seconds())
训练的F1score大概0.74。随机森林还是比较快的,几分钟就跑完了,跑SVM跑了好久还没出结果,可见其在高维稀疏数据上表现较差。
this is the RF Classifier F score:
0.7450054941922312
this is the RF time: 153.232528
为了对部分字符做一定的处理,补充ngram特征,同时进行过长的句子阶段截断操作,通过修改tfidf提取特征代码
修改:
vectorizer = CountVectorizer()#该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频
transformer = TfidfTransformer()#该类会统计每个词语的tf-idf权值
tfidf = transformer.fit_transform(vectorizer.fit_transform(data['text']))
为
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
tfidf = tfidf.fit_transform(data['text'])
结果有一定提升,但需要更长的时间进行tfidf的特征处理上。因为多了n-gram和特征截断操作。
this is the RF Classifier F score:
0.8237360545086597
this is the RF time: 259.655584