Day5打卡 独热编码

@浙大疏锦行

回顾昨天的一些填充缺失值方法

fillna()是填补缺失值的核心函数,下面是一些常用的统计函数

填补缺失值时,对数值型数据用中位数填补,用median()方法;填补分类型数据时,用众数填补,用mode()方法。

一些转换为列表的方法

只遍历有缺失值的列,用循环填补

# 循环来遍历每一列,判断数据类型,数值型列用中位数填补,分类型列用众数填补。
# 因为前面处理发现有缺失的列不是很多,所以考虑在遍历的时候只去遍历有缺失值的列,可以减少不必要的内存消耗
missing_cols = data_csv.columns[data_csv.isnull().sum()>0]
print(f'需要处理的有缺失值的列:{missing_cols.tolist()}')
for column in missing_cols:
    if data_csv[column].dtype in ['float64','int64']:
        data_csv[column] = data_csv[column].fillna(data_csv[column].median)
    else:
        mode_val = data_csv[column].mode()
        if not mode_val.empty:
            data_csv[column] = data_csv[column].fillna(mode_val[0])
        else:
            data_csv[column] = data_csv[column].fillna(' ')
print('填补后的空值数量:\n',data_csv.isnull().sum())


需要处理的有缺失值的列:[]
填补后的空值数量:
 Id                              0
Home Ownership                  0
Annual Income                   0
Years in current job            0
Tax Liens                       0
Number of Open Accounts         0
Years of Credit History         0
Maximum Open Credit             0
Number of Credit Problems       0
Months since last delinquent    0
Bankruptcies                    0
Purpose                         0
Term                            0
Current Loan Amount             0
Current Credit Balance          0
Monthly Debt                    0
Credit Score                    0
Credit Default                  0
dtype: int64

正式开始今天的内容学习

今天的任务分成以下几步

1. 读取数据

2. 找到所有离散特征

# 1. 读取数据
import pandas as pd
data = pd.read_csv(r'data.csv')
print(data.columns)
# 2. 找到所有离散特征,为了便于后面对每个离散特征进行独热编码处理,用列表存储离散特征
# 需要先初始化一个空列表,不然就是字符串类型,这个要注意
discrete_feature = []
for feature in data.columns:
    if data[feature].dtype == 'object':
        discrete_feature.append(feature)
        print(feature) 
print(type(discrete_feature))

Index(['Id', 'Home Ownership', 'Annual Income', 'Years in current job',
       'Tax Liens', 'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
       'Credit Score', 'Credit Default'],
      dtype='object')
Home Ownership
Years in current job
Purpose
Term
<class 'list'>

3. 选择一个离散特征进行独热编码

对离散特征进行编码通常分为两种情况,一种是变量之间有顺序关系,称为定序变量;另一种是没有任何等级顺序关系的变量

可以通过标签编码定序变量

对于无任何顺序关系的变量可以进行,独热编码,就是用01矩阵来表示离散特征,如果有k个类别,那么只需要给出k-1个类别的二进制编码,那么就对所有类别完成了编码

`value_counts()`` 是pandas Series的一个方法,用于统计每个唯一值出现的次数,返回一个按降序排列的Series。这对了解数据分布非常重要,尤其是在处理分类数据时。

将分类数据转换为数值型,比如独热编码。在进行编码之前,了解每个类别的分布情况很关键,因为这会影响后续的处理方式。例如,如果某个类别占比过高,可能需要特殊处理,或者确认是否存在不平衡数据的问题。

接下来,我需要解释为什么要使用这个方法。主要原因包括:1. 了解数据分布,判断是否需要合并稀有类别;2. 检查是否有异常值或缺失值;3. 为特征工程提供依据,比如决定是否使用独热编码或其他编码方式。

# 3. 选择一个离散特征进行独热编码
print('选择的离散特征为:',discrete_feature[0])
print(data[discrete_feature[0]])
print(data[discrete_feature[0]].value_counts())

Home Ownership
Home Mortgage    3637
Rent             3204
Own Home          647
Have Mortgage      12
Name: count, dtype: int64

发现不是顺序类别,没有什么关联

按理应该考虑一下这边数据的实际意义来去判断是否需要合并稀有类别,不过不太清楚背景,就先不进行额外的操作,先熟悉如何进行独热编码的操作即可

if discrete_feature and len(discrete_feature)>0:
    data = pd.get_dummies(data,columns = [discrete_feature[0]],prefix=discrete_feature[0],drop_first=True)
# drop_first=True 可避免多重共线性
    print("独热编码后的前5行数据:\n", data.head())
else:
    print("离散特征列表为空,无法进行独热编码")
print(data.columns)

独热编码后的前5行数据:
    Id  Annual Income  ... Home Ownership_Own Home  Home Ownership_Rent
0   0       482087.0  ...                    True                False       
1   1      1025487.0  ...                    True                False       
2   2       751412.0  ...                   False                False       
3   3       805068.0  ...                    True                False       
4   4       776264.0  ...                   False                 True       

[5 rows x 20 columns]
Index(['Id', 'Annual Income', 'Years in current job', 'Tax Liens',
       'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',    
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',      
       'Credit Score', 'Credit Default', 'Home Ownership_Home Mortgage',     
       'Home Ownership_Own Home', 'Home Ownership_Rent'],
      dtype='object')

输出的bool类型要转化为int类型,为了便于后续的一些运算

类型转换的常用方法如下:

显然2是不推荐使用的,为了方便,我在这边使用方法3。更改后代码如下,完成了类型转换

if discrete_feature and len(discrete_feature)>0:
    data = pd.get_dummies(data,columns = [discrete_feature[0]],prefix=discrete_feature[0],drop_first=True,dtype=int)
# drop_first=True 可避免多重共线性
    print("独热编码后的前5行数据:\n", data.head())
else:
    print("离散特征列表为空,无法进行独热编码")
print(data.columns)

独热编码后的前5行数据:
    Id  ...  Home Ownership_Rent
0   0  ...                    0
1   1  ...                    0
2   2  ...                    0
3   3  ...                    0
4   4  ...                    1

[5 rows x 20 columns]
Index(['Id', 'Annual Income', 'Years in current job', 'Tax Liens',
       'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',  
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
       'Credit Score', 'Credit Default', 'Home Ownership_Home Mortgage',
       'Home Ownership_Own Home', 'Home Ownership_Rent'],   
      dtype='object')

4. 采取循环对所有离散特征进行独热编码

因为第一个离散特征已经处理过了,在后面进行全部处理的时候,会因为已经被处理过而发生报错,需要把前面第三步的操作注释掉

# 前面已经将离散特征存储在列表中了,现在只需进行独热编码操作即可,因为get_dummies本身支持传入多个列名,所以无需循环操作,函数内部本身就有
if discrete_feature and len(discrete_feature)>0:
    data = pd.get_dummies(data,columns = discrete_feature,prefix=discrete_feature,drop_first=True,dtype=int)
# drop_first=True 可避免多重共线性
    print("独热编码后的前5行数据:\n", data.head())
else:
    print("离散特征列表为空,无法进行独热编码")
print("编码后的所有列名:\n", data.columns.tolist())

独热编码后的前5行数据:
    Id  Annual Income  Tax Liens  ...  Purpose_vacation  Purpose_wedding  Term_Short Term
0   0       482087.0        0.0  ...                 0                0                1
1   1      1025487.0        0.0  ...                 0                0                0
2   2       751412.0        0.0  ...                 0                0                1
3   3       805068.0        0.0  ...                 0                0                1
4   4       776264.0        0.0  ...                 0                0                1

[5 rows x 42 columns]
编码后的所有列名:
 ['Id', 'Annual Income', 'Tax Liens', 'Number of Open Accounts', 'Years of Credit History', 'Maximum Open Credit', 'Number of Credit Problems', 'Months since last delinquent', 'Bankruptcies', 'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt', 'Credit Score', 'Credit Default', 'Home Ownership_Home Mortgage', 'Home Ownership_Own Home', 'Home Ownership_Rent', 'Years in current job_10+ years', 'Years in current job_2 years', 'Years in current job_3 years', 'Years in current job_4 years', 'Years in current job_5 years', 'Years in current job_6 years', 'Years in current job_7 years', 'Years in current job_8 years', 'Years in current job_9 years', 'Years in current job_< 1 year', 'Purpose_buy a car', 'Purpose_buy house', 'Purpose_debt consolidation', 'Purpose_educational expenses', 'Purpose_home improvements', 'Purpose_major purchase', 'Purpose_medical bills', 'Purpose_moving', 'Purpose_other', 'Purpose_renewable energy', 'Purpose_small business', 'Purpose_take a trip', 'Purpose_vacation', 'Purpose_wedding', 'Term_Short Term']

5. 加上昨天的内容 并且处理所有缺失值

# 5. 加上昨天的内容 并且处理所有缺失值
data2 = pd.read_csv(r'data.csv')
list_final = list(set(data.columns) - set(data2.columns))  # 集合差集运算,更高效
print(list_final)
print(data.dtypes)
data.isnull().sum()
# 用均值填补缺失值
for i in data.columns:
    if data[i].isnull().sum()>0:
        data[i] = data[i].fillna(data[i].mean())
print(data.isnull().sum())

['Purpose_vacation', 'Purpose_buy a car', 'Purpose_moving', 'Purpose_educational expenses', 'Years in current job_10+ years', 'Years in current job_2 years', 'Purpose_medical bills', 'Purpose_take a trip', 'Years in current job_8 years', 'Purpose_major purchase', 'Years in current job_3 years', 'Years in current job_4 years', 'Purpose_debt consolidation', 'Purpose_renewable energy', 'Years in current job_9 years', 'Purpose_home improvements', 'Home Ownership_Rent', 'Years in current job_7 years', 'Home Ownership_Home Mortgage', 'Term_Short Term', 'Years in current job_6 years', 'Purpose_wedding', 'Years in current job_< 1 year', 'Purpose_other', 'Purpose_buy house', 'Home Ownership_Own Home', 'Years in current job_5 years', 'Purpose_small business']
Id                                  int64
Annual Income                     float64
Tax Liens                         float64
Number of Open Accounts           float64
Years of Credit History           float64
Maximum Open Credit               float64
Number of Credit Problems         float64
Months since last delinquent      float64
Bankruptcies                      float64
Current Loan Amount               float64
Current Credit Balance            float64
Monthly Debt                      float64
Credit Score                      float64
Credit Default                      int64
Home Ownership_Home Mortgage        int64
Home Ownership_Own Home             int64
Home Ownership_Rent                 int64
Years in current job_10+ years      int64
Years in current job_2 years        int64
Years in current job_3 years        int64
Years in current job_4 years        int64
Years in current job_5 years        int64
Years in current job_6 years        int64
Years in current job_7 years        int64
Years in current job_8 years        int64
Years in current job_9 years        int64
Years in current job_< 1 year       int64
Purpose_buy a car                   int64
Purpose_buy house                   int64
Purpose_debt consolidation          int64
Purpose_educational expenses        int64
Purpose_home improvements           int64
Purpose_major purchase              int64
Purpose_medical bills               int64
Purpose_moving                      int64
Purpose_other                       int64
Purpose_renewable energy            int64
Purpose_small business              int64
Purpose_take a trip                 int64
Purpose_vacation                    int64
Purpose_wedding                     int64
Term_Short Term                     int64
dtype: object
Id                                0
Annual Income                     0
Tax Liens                         0
Number of Open Accounts           0
Years of Credit History           0
Maximum Open Credit               0
Number of Credit Problems         0
Months since last delinquent      0
Bankruptcies                      0
Current Loan Amount               0
Current Credit Balance            0
Monthly Debt                      0
Credit Score                      0
Credit Default                    0
Home Ownership_Home Mortgage      0
Home Ownership_Own Home           0
Home Ownership_Rent               0
Years in current job_10+ years    0
Years in current job_2 years      0
Years in current job_3 years      0
Years in current job_4 years      0
Years in current job_5 years      0
Years in current job_6 years      0
Years in current job_7 years      0
Years in current job_8 years      0
Years in current job_9 years      0
Years in current job_< 1 year     0
Purpose_buy a car                 0
Purpose_buy house                 0
Purpose_debt consolidation        0
Purpose_educational expenses      0
Purpose_home improvements         0
Purpose_major purchase            0
Purpose_medical bills             0
Purpose_moving                    0
Purpose_other                     0
Purpose_renewable energy          0
Purpose_small business            0
Purpose_take a trip               0
Purpose_vacation                  0
Purpose_wedding                   0
Term_Short Term                   0
dtype: int64

大致掌握流程了,但是还是不太会用debugger

明天试着对项目二的数据进行一下同样的操作

### 如何在时间序列预测中使用独热编码 #### 使用场景分析 在时间序列预测任务中,如果存在分类变量(例如季节、月份、星期几等),这些特征通常是非数值型的。为了使机器学习模型能够理解并利用这些特征,可以通过独热编码将其转换为数值形式[^2]。 #### 数据预处理方法 对于时间序列数据中的分类变量,可以采用如下方式实现独热编码: - **提取日期特征**:从原始的时间戳字段中解析出年份、季度、月份、星期几等信息。 - **应用独热编码**:将上述提取到的分类变量转化为二进制向量表示。 以下是具体操作流程的一个Python代码示例: ```python import pandas as pd from sklearn.preprocessing import OneHotEncoder # 创建模拟时间序列数据集 data = {'date': ['2023-01-01', '2023-01-02', '2023-01-03'], 'value': [10, 20, 30]} df = pd.DataFrame(data) # 将'date'列转成datetime格式 df['date'] = pd.to_datetime(df['date']) # 提取日期特征 df['month'] = df['date'].dt.month df['day_of_week'] = df['date'].dt.dayofweek # 初始化OneHotEncoder对象 encoder = OneHotEncoder(sparse=False) # 对'month'和'day_of_week'两列执行独热编码 encoded_features = encoder.fit_transform(df[['month', 'day_of_week']]) encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['month', 'day_of_week'])) # 合并原数据框与经过独热编码后的数据框 final_df = pd.concat([df.drop(columns=['month', 'day_of_week']), encoded_df], axis=1) print(final_df) ``` 此段脚本展示了如何先从日期字段里抽取有用的信息再运用`sklearn`库里的`OneHotEncoder`类完成转化过程[^2]。 #### 实验注意事项 当把经由独热编码变换过的特性加入至最终的数据集中时需留意两点事项: 1. 维度爆炸风险:随着类别数量增加,生成的新维度也会相应增多,这可能引起计算资源消耗过大以及潜在过拟合现象; 2. 特征重要性评估:部分算法可能会受到新增加虚拟变量的影响而改变原有解释力度,在建模之前应仔细考量每项新添属性的实际意义及其贡献程度。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值