既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!
由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频,并且后续会持续更新
## 1.数据收集和加载
我们将使用kaggle提供的数据集:[数据集](https://bbs.youkuaiyun.com/topics/618545628)
该数据集 包含一组带有标记的短信文本,这些消息被归类为**正常短信**和**垃圾短信。** 每行包含一条消息。每行由两列组成:v1 带有标签,(spam 或 ham),v2 是文本内容。
df=pd.read_csv(‘/content/spam/spam.csv’,encoding=‘latin-1’)#这里encoding需要指定为latin-1
查看一下数据基本情况
df.head()
| | v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 |
| --- | --- | --- | --- | --- | --- |
| 0 | ham | Go until jurong point, crazy.. Available only ... | NaN | NaN | NaN |
| 1 | ham | Ok lar... Joking wif u oni... | NaN | NaN | NaN |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... | NaN | NaN | NaN |
| 3 | ham | U dun say so early hor... U c already then say... | NaN | NaN | NaN |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... | NaN | NaN | NaN |
该数据包含一组带有标记的短信数据,其中:
>
> * v1表示短信标签,**ham表示正常信息,spam表示垃圾信息**
> * v2是短信的内容
>
>
>
#去除不需要的列
df=df.iloc[:,:2]
#重命名列
df=df.rename(columns={“v1”:“label”,“v2”:“message”})
df.head()
| | label | message |
| --- | --- | --- |
| 0 | ham | Go until jurong point, crazy.. Available only ... |
| 1 | ham | Ok lar... Joking wif u oni... |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
| 3 | ham | U dun say so early hor... U c already then say... |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... |
将lable进行one-hot编码,其中0:ham,1:spam
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df[‘label’]=encoder.fit_transform(df[‘label’])
df[‘label’].value_counts()
0 4825
1 747
Name: label, dtype: int64
可以看出一共有747个垃圾短信
查看缺失值
df.isnull().sum()
数据没有缺失值
label 0
message 0
dtype: int64
## 2.探索性数据分析(EDA)
通过可视化分析来更好的理解数据
import matplotlib.pyplot as plt
plt.style.use(‘ggplot’)
plt.figure(figsize=(9,4))
plt.subplot(1,2,1)
plt.pie(df[‘label’].va