2024年最新【数据科学项目02】：NLP应用之垃圾短信邮件检测（端到端的项目，2024年最新大数据开发开发面试技能介绍

本文链接：https://blog.youkuaiyun.com/2401_84164527/article/details/138845620

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新


## 1.数据收集和加载


我们将使用kaggle提供的数据集：[数据集](https://bbs.youkuaiyun.com/topics/618545628)


该数据集 包含一组带有标记的短信文本，这些消息被归类为**正常短信**和**垃圾短信。** 每行包含一条消息。每行由两列组成：v1 带有标签，（spam 或 ham），v2 是文本内容。

df=pd.read_csv(‘/content/spam/spam.csv’,encoding=‘latin-1’)#这里encoding需要指定为latin-1

查看一下数据基本情况

df.head()




|  | v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 |
| --- | --- | --- | --- | --- | --- |
| 0 | ham | Go until jurong point, crazy.. Available only ... | NaN | NaN | NaN |
| 1 | ham | Ok lar... Joking wif u oni... | NaN | NaN | NaN |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... | NaN | NaN | NaN |
| 3 | ham | U dun say so early hor... U c already then say... | NaN | NaN | NaN |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... | NaN | NaN | NaN |


该数据包含一组带有标记的短信数据，其中：



> 
> * v1表示短信标签，**ham表示正常信息，spam表示垃圾信息**
> * v2是短信的内容
> 
> 
>

#去除不需要的列
df=df.iloc[:,:2]

#重命名列
df=df.rename(columns={“v1”:“label”,“v2”:“message”})
df.head()




|  | label | message |
| --- | --- | --- |
| 0 | ham | Go until jurong point, crazy.. Available only ... |
| 1 | ham | Ok lar... Joking wif u oni... |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
| 3 | ham | U dun say so early hor... U c already then say... |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... |

将lable进行one-hot编码，其中0:ham，1:spam

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

df[‘label’]=encoder.fit_transform(df[‘label’])
df[‘label’].value_counts()

0 4825
1 747
Name: label, dtype: int64


可以看出一共有747个垃圾短信

查看缺失值

df.isnull().sum()

数据没有缺失值

label 0
message 0
dtype: int64


## 2.探索性数据分析（EDA）


通过可视化分析来更好的理解数据

import matplotlib.pyplot as plt
plt.style.use(‘ggplot’)
plt.figure(figsize=(9,4))
plt.subplot(1,2,1)
plt.pie(df[‘label’].va