案例说明
垃圾邮件是最基础的机器学习案例,其中涉及到的知识也比较全面。所以作为一个案例来进行学习和总结
目标:
将垃圾邮件自动识别, 其实是个二分类问题,我们需要将目标进行分解, 降低实现的难度。
步骤
1、进行数据的读取,把数据划分为垃圾和正常两个类型
def open_file(path):
list = []
#读取文件
with open(path, encoding='utf-8') as lines:
for line in lines:
line = line.strip('\n')
#因为词袋模型是用一个词语来表示频率,所以需要用jieba库进行分词,得到的是迭代器generator,可以用join的方法转回字符串。
seg_list = jieba.cut(clean_str(line), cut_all=False)
str = " ".join(seg_list)
# 放在一个数组里,存放的格式为:
['wo 是 大 大', ...]
list.append(str)
return list
2、将内容进行词频化,正常的邮件内容是文本,我们需要将其转换成数字,借助sklearn的
from sklearn.feature_extraction.text import CountVectorizer
完成
def format_data():
# 返回数据数组
x_pos = open_file('./data/ham_5000.utf8')
x_neg = open_file('./data/spam_5000.utf8')
# 生成标签集
y_pos = [1] * len(x_pos)
y_neg = [0] * len(x_neg)
# 数据集整合
X = x_pos + x_neg
y = y_pos + y_neg
# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# 向量化数据集 先统计词频, 再逆向带入变成向量
counter_vec = CountVectorizer(max_features=1000)
x_train = counter_vec.fit_transform(x_train)
x_test = counter_vec.transform(x_test)
return x_train, x_test, y_train, y_test
3、接着将数据传入模型进行训练,这里用的是最简单的平滑MultinomialNB模型,不需要参数的定义。
def main():
x_train, x_test, y_train, y_test = format_data()
print(x_train)
model = MultinomialNB()
model.fit(x_train, y_train)
y_pre = model.predict(x_test)
r = classification_report(y_test, y_pre)
print(r)

可以看到准确率和recall还是比较高的,达到了97%
数据链接:点击
完整代码:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import jieba
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import re
def clean_str(s):
return re.sub('[\.\。\,\“\”\!\?\!]', '', s)
def open_file(path):
list = []
with open(path, encoding='utf-8') as lines:
for line in lines:
line = line.strip('\n')
seg_list = jieba.cut(clean_str(line), cut_all=False)
str = " ".join(seg_list)
list.append(str)
return list
def format_data():
# 返回数据数组
x_pos = open_file('./data/ham_5000.utf8')
x_neg = open_file('./data/spam_5000.utf8')
# 生成标签集
y_pos = [1] * len(x_pos)
y_neg = [0] * len(x_neg)
# 数据集整合
X = x_pos + x_neg
y = y_pos + y_neg
# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# 向量化数据集 先统计词频, 再逆向带入变成向量
counter_vec = CountVectorizer(max_features=1000)
x_train = counter_vec.fit_transform(x_train)
x_test = counter_vec.transform(x_test)
return x_train, x_test, y_train, y_test
def main():
x_train, x_test, y_train, y_test = format_data()
print(x_train)
model = MultinomialNB()
model.fit(x_train, y_train)
y_pre = model.predict(x_test)
r = classification_report(y_test, y_pre)
print(r)
main()

该博客介绍了一个使用jieba分词和sklearn的MultinomialNB模型构建的垃圾邮件识别系统。首先,通过读取数据并使用jieba进行分词处理,然后利用CountVectorizer进行词频化。接着,数据被划分为训练集和测试集,并应用模型进行训练。最终,模型在测试集上的准确率和召回率均达到97%以上。
2万+

被折叠的 条评论
为什么被折叠?



