概率图模型1-朴素贝叶斯之垃圾短信分类

概率图模型1-朴素贝叶斯之垃圾短信分类


垃圾短信分类项目:

  • (1) 数据加载
  • (2) 词向量
  • (3) 统计词频即TF-IDF、通过词频判断类别即是否是垃圾短信
  • (4) 建模
  • (5) 预测

1.数据加载

import pandas as pd
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
messages = pd.read_csv('./data/messages.csv',sep = '\t',header=None) # sep:间隔符号默认英文逗号
messages # 0表示短信类别;1表示短信内容
messages.rename({0:'label',1:'message'},axis = 1,inplace = True) # axis = 1:修改列头  inplace:替换
messages

在这里插入图片描述

y = messages['label']
y # 可以是文本
0        ham
1        ham
        ... 
5570     ham
5571     ham
Name: label, Length: 5572, dtype: object

2.词向量

cv = CountVectorizer()  # 文本数据处理即词向量 文本数据无法直接建模
X = cv.fit_transform(messages['message'])  # X必须向量化 原数据是单词(计算机无法建模)
X             # 5572样本  8713个非重复的词
<5572x8713 sparse matrix of type '<class 'numpy.int64'>'
	with 74169 stored elements in Compressed Sparse Row format>
5572*8713
48548836

3.TF-IDF转换

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfTransformer()
X2 = tf_idf.fit_transform(X) # x稀松矩阵也可拆分
tf_idf2 = TfidfVectorizer()  # TfidfVectorizer = 先使用CountVectorizer然后使用TfidfTransformer
X3 = tf_idf2.fit_transform(messages['message']) 
X3
<5572x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 74169 stored elements in Compressed Sparse Row format>

4.数据集分割

X_train,X_test,y_train,y_test = train_test_split(X2,y)
display(X_train,X_test)
<4179x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 55087 stored elements in Compressed Sparse Row format>
<1393x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 19082 stored elements in Compressed Sparse Row format>

5.建模

# 5.1 高斯分布朴素贝叶斯
%%time
gNB = GaussianNB()
gNB.fit(X_train.toarray(),y_train) # x数据高斯贝叶斯算法建模时必须是 稠密矩阵
gNB.score(X_test.toarray(),y_test) # toarray():转化为稠密矩阵
Wall time: 7.04 s
0.8994974874371859

# 5.2 伯努利分布朴素贝叶斯
%%time # 人使用的语言更加符合二项分布
bNB = BernoulliNB() # 传入数据是稀松矩阵
bNB.fit(X_train,y_train) # x数据伯努利贝叶斯算法建模时不必须用稠密矩阵
bNB.score(X_test,y_test)
Wall time: 386 ms
0.9806173725771715

# 5.3 多项式分布朴素贝叶斯
%%time
mNB = MultinomialNB()
mNB.fit(X_train,y_train)
mNB.score(X_test,y_test)
Wall time: 39 ms
0.95908111988514

6.预测

# 6.1 预测文本
X_test = ['Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify.I see the letter B on my car Please call now 08000930705 for delivery tomorrow',
          'Precious things are very few in the world,that is the reason there is only one you',
          "GENT! We are trying to contact you. Last weekends draw shows that you won a £1000 prize GUARANTEED. U don't know how stubborn I am. Congrats! 1 year special cinema pass for 2 is yours.",
          'Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!']
X_test
['Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify.I see the letter B on my car Please call now 08000930705 for delivery tomorrow',
 'Precious things are very few in the world,that is the reason there is only one you',
 "GENT! We are trying to contact you. Last weekends draw shows that you won a £1000 prize GUARANTEED. U don't know how stubborn I am. Congrats! 1 year special cinema pass for 2 is yours.",
 'Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!']
 
# 6.2 预测TF-IDF文本转换
X_test_tf_idf = tf_idf.transform(cv.transform(X_test))  
X_test_tf_idf
<4x8713 sparse matrix of type '<class 'numpy.float64'>'
	with 94 stored elements in Compressed Sparse Row format>
	
# 6.3 预测
bNB.predict(X_test_tf_idf)# 新信息用算法预测
array(['spam', 'ham', 'spam', 'spam'], dtype='<U4')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

阿值

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值