一:数据集介绍
1:数据集下载
https://archive.ics.uci.edu/ml/datasets/Wine+Quality
我这里选择的是红酒样本
数据的特征与标签
特征:11个 ; 标签:红酒质量0-10之间,11个类别
2:查看数据集
可以看到数据都在一列里,需要改一下
二:数据处理
1:数据分列
观察数据,在一列里用分号隔开,由此对数据分列
选定需要分列的数据–选数据菜单–分列–分隔符–选分号–OK
分列后的数据
2:导入数据
import pandas as pd
#获取数据
data = pd.read_csv("F:\\书籍学习:python数据挖掘与机器学习实战\\葡萄酒数据集的随机森林分类\\winequality-red.csv")
data.head()#查看数据
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
# 导入所有需要的库
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
3:将数据拆分为特征与标签
features = data.drop('quality', 1)
# df = data.iloc[:, :11] #取前11列数据
labels = data['quality']
print(features.shape)
print(labels.shape)
(1599, 11)
(1599,)
C:\Users\Hp\AppData\Local\Temp\ipykernel_12320\351942566.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
features = data.drop('quality', 1)
三:数据分析
1:数据的描述性分析
# 描述性分析
print(features.describe())
# 直方图
# hist(),输出各个特征对比的直方图
features.hist()
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free sulfur dioxide total sulfur dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690
pH sulphates alcohol
count 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983
std 0.154386 0.169507 1.065668
min 2.740000 0.330000 8.400000
25% 3.210000 0.550000 9.500000
50% 3.310000 0.620000 10.200000
75% 3.400000 0.730000 11.100000
max 4.010000 2.000000 14.900000
array([[<AxesSubplot:title={'center':'fixed acidity'}>,
<AxesSubplot:title={'center':'volatile acidity'}>,
<AxesSubplot:title={'center':'citric acid'}>],
[<AxesSubplot:title={'center':'residual sugar'}>,
<AxesSubplot:title={'center':'chlorides'}>,
<AxesSubplot:title={'center':'free sulfur dioxide'}>],
[<AxesSubplot:title={'center':'total sulfur dioxide'}>,
<AxesSubplot:title={'center':'density'}>,
<AxesSubplot:title={'center':'pH'}>],
[<AxesSubplot:title={'center':'sulphates'}>,
<AxesSubplot:title&