概率图模型1-朴素贝叶斯之垃圾短信分类

阿值

已于 2023-05-16 13:06:09 修改

阅读量583

点赞数 1

分类专栏：概率图模型文章标签：机器学习 sklearn 人工智能 tf-idf python

于 2023-05-16 13:05:21 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_60536251/article/details/130703140

版权

概率图模型专栏收录该内容

5 篇文章

订阅专栏

概率图模型1-朴素贝叶斯之垃圾短信分类

1.数据加载
2.词向量
3.TF-IDF转换
4.数据集分割
5.建模
6.预测

垃圾短信分类项目：

(1) 数据加载
(2) 词向量
(3) 统计词频即TF-IDF、通过词频判断类别即是否是垃圾短信
(4) 建模
(5) 预测

1.数据加载

import pandas as pd
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
messages = pd.read_csv('./data/messages.csv',sep = '\t',header=None) # sep：间隔符号默认英文逗号
messages # 0表示短信类别；1表示短信内容
messages.rename({0:'label',1:'message'},axis = 1,inplace = True) # axis = 1：修改列头  inplace：替换
messages

在这里插入图片描述

y = messages['label']
y # 可以是文本
0        ham
1        ham
        ... 
5570     ham
5571     ham
Name: label, Length: 5572, dtype: object

2.词向量

cv = CountVectorizer()  # 文本数据处理即词向量 文本数据无法直接建模
X = cv.fit_transform(messages['message'])  # X必须向量化 原数据是单词(计算机无法建模)
X             # 5572样本  8713个非重复的词
<5572x8713 sparse matrix of type '<class 'numpy.int64'>'
	with 74169 stored elements in Compressed Sparse Row format>
5572*8713
48548836

3.TF-IDF转换

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfTransformer()
X2 = tf_idf.fit_transform(X) # x稀松矩阵也可拆分
tf_idf2 = TfidfVectorizer()  # TfidfVectorizer = 先使用CountVectorizer然后使用TfidfTransformer
X3 = tf_idf2.fit_transform(messages['message']) 
X3
<5572x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 74169 stored elements in Compressed Sparse Row format>

4.数据集分割

X_train,X_test,y_train,y_test = train_test_split(X2,y)
display(X_train,X_test)
<4179x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 55087 stored elements in Compressed Sparse Row format>
<1393x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 19082 stored elements in Compressed Sparse Row format>

5.建模

# 5.1 高斯分布朴素贝叶斯
%%time
gNB = GaussianNB()
gNB.fit(X_train.toarray(),y_train) # x数据高斯贝叶斯算法建模时必须是 稠密矩阵
gNB.score(X_test.toarray(),y_test) # toarray()：转化为稠密矩阵
Wall time: 7.04 s
0.8994974874371859

# 5.2 伯努利分布朴素贝叶斯
%%time # 人使用的语言更加符合二项分布
bNB = BernoulliNB() # 传入数据是稀松矩阵
bNB.fit(X_train,y_train) # x数据伯努利贝叶斯算法建模时不必须用稠密矩阵
bNB.score(X_test,y_test)
Wall time: 386 ms
0.9806173725771715

# 5.3 多项式分布朴素贝叶斯
%%time
mNB = MultinomialNB()
mNB.fit(X_train,y_train)
mNB.score(X_test,y_test)
Wall time: 39 ms
0.95908111988514

6.预测

# 6.1 预测文本
X_test = ['Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify.I see the letter B on my car Please call now 08000930705 for delivery tomorrow',
          'Precious things are very few in the world,that is the reason there is only one you',
          "GENT! We are trying to contact you. Last weekends draw shows that you won a £1000 prize GUARANTEED. U don't know how stubborn I am. Congrats! 1 year special cinema pass for 2 is yours.",
          'Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!']
X_test
['Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify.I see the letter B on my car Please call now 08000930705 for delivery tomorrow',
 'Precious things are very few in the world,that is the reason there is only one you',
 "GENT! We are trying to contact you. Last weekends draw shows that you won a £1000 prize GUARANTEED. U don't know how stubborn I am. Congrats! 1 year special cinema pass for 2 is yours.",
 'Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!']
 
# 6.2 预测TF-IDF文本转换
X_test_tf_idf = tf_idf.transform(cv.transform(X_test))  
X_test_tf_idf
<4x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 94 stored elements in Compressed Sparse Row format>
	
# 6.3 预测
bNB.predict(X_test_tf_idf)# 新信息用算法预测
array(['spam', 'ham', 'spam', 'spam'], dtype='<U4')