写在前面:
svm我之前的博客已经总结过了,这里就不在赘述了,直接附上链接,这篇博客只放我跑的代码的部分,请见谅。
一.SVM算法
https://blog.youkuaiyun.com/orient928/article/details/89220862
二. 利用SVM结合 Tf-idf 算法进行文本分类
1. 读取数据
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pandas as pd
#初次使用这个数据集的时候,会在实例化的时候开始下载
data = fetch_20newsgroups()
categories = ["sci.space" #科学技术 - 太空
,"rec.sport.hockey" #运动 - 曲棍球
,"talk.politics.guns" #政治 - 枪支问题
,"talk.politics.mideast"] #政治 - 中东问题
train = fetch_20newsgroups(subset="train",categories = categories)
test = fetch_20newsgroups(subset="test",categories = categories)
2.使用TF-IDF将文本数据编码
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
Xtrain = train.data
Xtest = test.data
Ytrain = train.target
Ytest = test.target
tfidf = TFIDF().fit(Xtrain)
Xtrain_ = tfidf.transform(Xtrain)
Xtest_ = tfidf.transform(Xtest)
Xtrain_
tosee = pd.DataFrame(Xtrain_.toarray(),columns=tfidf.get_feature_names())
tosee.head()
tosee.shape
3.SVM建模
from sklearn.svm import SVC
clf = SVC()
clf.fit(Xtrain_,Ytrain)
y_pred = clf.predict(Xtest_)
proba = clf.predict_proba(Xtest_)
score = clf.score(Xtest_,Ytest)
print("\tAccuracy:{:.3f}".format(score))
print("\n")