1. 数据来源
所用的数据是分类好的数据,详细描述见SMS Spam Collection v. 1,可以从github下载,数据在第4章。每一行数据包括包括两列,使用逗号隔开, 第1列是分类(lable),第2列是文本。
sms = pd.read_csv(filename, sep=',', header=0, names=['label','text'])
sms.head
Out[5]:
<bound method DataFrame.head of label text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam Free