这里的中文垃圾邮件数据集内含100条正常邮件和50条垃圾邮件,其实,这做为训练数据,是远远不够的。不过,可以先大致看一下中文垃圾邮件处理的过程。
给大家看一下数据集的基本结构,大概是这个样子的:
话不多说,上代码.
导入各种包:
%pylab inline
import matplotlib.pyplot as plt
import pandas as pd
import string
import codecs
import os
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
from sklearn import naive_bayes as bayes
from sklearn.model_selection import train_test_split
打开文件:
#open file
file_path = "C:\\Users\\lenovo\\Documents\\tfstudy"
emailframe = pd.read_excel(os.path.join(file_path, "chinesespam.xlsx"), 0)
检查数据:
#inspect data
print("inspect top five rows:")
emailframe.head(5)
print("data shape:", emailframe.shape)
print("spams in rows:", emailframe.loc[emailframe['type'] == "spam"].shape[0])
print("ham in rows:", emailframe.loc[emailframe['type']