Datawhale | 自然语言处理（6）—

本文链接：https://blog.youkuaiyun.com/orient928/article/details/89374099

写在前面：

svm我之前的博客已经总结过了，这里就不在赘述了，直接附上链接，这篇博客只放我跑的代码的部分，请见谅。

文章目录

一.SVM算法
二. 利用SVM结合 Tf-idf 算法进行文本分类

一.SVM算法

https://blog.youkuaiyun.com/orient928/article/details/89220862

二. 利用SVM结合 Tf-idf 算法进行文本分类

1. 读取数据

from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pandas as pd

#初次使用这个数据集的时候，会在实例化的时候开始下载
data = fetch_20newsgroups()

categories = ["sci.space" #科学技术 - 太空
,"rec.sport.hockey" #运动 - 曲棍球
,"talk.politics.guns" #政治 - 枪支问题
,"talk.politics.mideast"] #政治 - 中东问题
train = fetch_20newsgroups(subset="train",categories = categories)
test = fetch_20newsgroups(subset="test",categories = categories)

2.使用TF-IDF将文本数据编码

from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF

Xtrain = train.data
Xtest = test.data
Ytrain = train.target
Ytest = test.target
tfidf = TFIDF().fit(Xtrain)
Xtrain_ = tfidf.transform(Xtrain)
Xtest_ = tfidf.transform(Xtest)
Xtrain_
tosee = pd.DataFrame(Xtrain_.toarray(),columns=tfidf.get_feature_names())
tosee.head()
tosee.shape

3.SVM建模

from sklearn.svm import SVC

clf = SVC()
clf.fit(Xtrain_,Ytrain)
y_pred = clf.predict(Xtest_)
proba = clf.predict_proba(Xtest_)
score = clf.score(Xtest_,Ytest)

print("\tAccuracy:{:.3f}".format(score))
print("\n")