离散型特征处理get_dummies()方法

最新推荐文章于 2024-12-07 11:29:50 发布

勤劳的大乐乐

最新推荐文章于 2024-12-07 11:29:50 发布

阅读量1.8k

点赞数

CC 4.0 BY-SA版权

分类专栏： python 文章标签： pandas 编码

本文链接：https://blog.youkuaiyun.com/qq_26255311/article/details/90256333

本文介绍了pandas库中用于离散型特征编码的get_dummies方法，包括其适用场景和参数设置。get_dummies主要用于one-hot编码，适用于取值之间无大小关系的离散特征。此外，文章还提到了针对有大小关系的离散特征的数值映射方法。在使用get_dummies时，可以设置prefix、prefix_sep、dummy_na等参数来自定义编码结果。例如，通过设置drop_first=True，可以避免产生冗余的全一列。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

官方文档：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

get_dummies()：对离散型数据进行one-hot编码

离散特征的编码分为两种情况：

1、离散特征的取值之间没有大小的意义，比如color：[red,blue],那么就使用one-hot编码。

2、离散特征的取值有大小的意义，比如size:[X,XL,XXL],那么就使用数值的映射，如{X:1,XL:2,XXL:3}。

get_dummies()的用法：

参数：

data : array-like, Series, or DataFrame
prefix : string, list of strings, or dict of strings, default None
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
prefix_sep : string, default ‘_‘，If appending prefix separator/delimiter to use. Or pass a list or dictionary as with prefix.
dummy_na : bool, default False. Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.
sparse : bool, default False Whether the dummy-encoded columns should be be backed by a SparseArray (True) or a regular NumPy array (False).
drop_first : bool, default False Whether to get k-1 dummies out of categorical levels by removing the first level. New in version 0.18.0.
dtype : dtype, default np.uint8 Data type for new columns. Only a single dtype is allowed. New in version 0.23.0.