Categorical Variables and One-Hot Encoding_encode the categorical variable-优快云博客

本文链接：https://blog.youkuaiyun.com/m0_50690440/article/details/120799181

类别变量分为有序和无序两种，机器学习算法通常不能直接处理，需要进行独热编码。独热编码能避免对无序类别变量赋予错误的自然顺序，并防止模型对特定语言的偏见。同时，它会引入多重共线性问题，导致模型中的特征之间线性相关，产生dummy变量陷阱。解决方法是通过消除一个编码列，确保输入特征的线性独立。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Categorical Variables

Categorical variables are those which can take values from a finite set. For such variables, the values that they take up can have an intrinsic ordering (for e.g., speed: {low, medium, high}). Such variables are called as Ordinal Categorical Variables. On the other hand, some categorical variables may not have any intrinsic ordering (for e.g., Gender: {Male, Female}). Such categorical variables are called as Nominal Categorical Variables.

If you are aware of the common practice of encoding categorical variables into numbers, you know that often it is suggested to get them One-hot-encoded. There are two reasons for that:

Most Machine Learning Algorithms cannot work with categorical variables directly, they need to be converted to numbers.
Even if we find a way to directly work with categorical variables without converting them to numbers, our model shall get biased towards the language we use. For eg, in an animal classification task, if the labels are {‘rat’, ‘dog’, ‘ant’}, then using such a labelling method would train our model to predict labels only in English, which would put a linguistic restriction upon the possible applications of the model.

In order to get around these, we can encode the categorical variable values to numbers as below:

Let’s discuss case 2 first:

There is one major issue with this- the labels in the animal classification problem should not be encoded to integers (like we have done above) since that would enforce an apparently incorrect natural ordering of: ‘rat’ < ‘dog’ < ‘ant’. While we understand no such ordering really exists and that the numbers 1, 2 and 3 do not hold any numerical ordering in the labels we have encoded, our Machine Learning model will not be able to intuitively understand that. If we feed these numbers directly into a model, the cost/loss function is likely to get affected by these values. We need to model this understanding of ours mathematically. One-hot-encoding is how we do it.

Case 1:

Speed is an ordinal variable. We may argue that the relation: ‘low’< ‘medium’<‘high’ makes sense and therefore, using labels 1, 2 and 3 should not be an issue. Unfortunately it is not so. Using labels as 100, 101 and 300000 in place of 1, 2 and 3 would still have the same relationship as ‘low’, ‘medium’ and ‘high’ have. There is nothing special about using 1, 2 and 3. In other words, we do not know how greater is a speed of ‘medium’ than a speed of ‘low’ and how small it is compared to a ‘high’ speed. Difference between these labels can potentially affect the model we train. So, we might want to one-hot-encode the variable ‘speed’ as well.

At this point, I hope we understand what categorical variables are all about, and why we would like to one-hot-encode them.

Multicollinearity

Multicollinearity occurs when two or more independent variables (a.k.a. features) in the dataset are correlated with each other.

In a nutshell, multicollinearity is said to exist in a dataset when the independent variables are (nearly) linearly related to each other.

Cases like as shown in the figure are called Perfect Multicollinearity. Likewise, we also have cases of Imperfect Multicollinearity, in which one or more highly linear relationships may be of our concern. These directly impact the linear regression analysis.

At this point, I hope we understand what multicollinearity is.

Dummy Variable Trap

The dummy variable trap manifests itself directly from one-hot-encoding applied on categorical variables. As discussed earlier, size of one-hot vectors is equal to the number of unique values that a categorical column takes up and each such vector contains exactly one ‘1’ in it. This ingests multicollinearity into our dataset.

From the encoded dataset in fig. 4, we can observe the following linear relationship (fig. 6):

For example: for data 0: speed_high + speed_medium + speed_low = 0 + 1 + 0 = 1

for data 1: speed_high + speed_medium + speed_low = 1 + 0 + 0 = 1

This is a case of perfect multicollinearity. The vectors that we use to encode the categorical columns are called ‘Dummy Variables’. We intended to solve the problem of using categorical variables, but got trapped by the problem of Multicollinearity. This is called the Dummy Variable Trap.

If we use three-column vector, and we get an incomplete variable like [0, _, 0], we can immediately know that the value 1 should be in the blank. And this unnecessarily causes sort of extra information. If we drop one column just like variables in the slide, if we get [0, _ ], we will not know whether it should be [0,0] or [0,1] , so we 'treasure' each bit of the information.

Now think about we have a set of data, then we would build a input variable matrix X

	Speed_high	Speed_medium	Speed_low
Data1	?	0	0
Data2	?	0	1
Data3	?	1	0
Data4	?	0	1
Data5	?	0	0

We would say by using the correlation equation we mentioned above, the whole column of speed_high would be expressed by the rest of columns in matrix X. In other words, the columns of X are linearly dependent. This will lead to the fact that the Gram matrix of X: $X^T X$ will be a singular matrix and have no inverse. (Proving here).

Remember what we want is , then all of these would make it infeasible.

As a conclusion, that's why we want to drop one column after One-hot encoding for categorical variables -- we want the columns of input categorical variables to be linearly independent.

As mentioned earlier, this directly impacts the linear regression analysis because linear regression assumes non-existence of multicollinearity in the dataset.

This has a direct implication on a linear regression problem. What follows is, therefore, a summary of issues that is general to all algorithms.

We can substitute speed_high with 1- (speed_medium + speed_ low). This actually means that (at least) one of the features we are working with is redundant- that feature could be any one of the three. So, we are making our model learn an additional weight which is not really needed. This consumes computational power and time. This also gives an optimisation objective that might not be very reasonable and might also be difficult to work with. Too many independent variables may also lead to Curse of Dimensionality. If multicollinearity also comes alongwith that, things become worse.
We not only want our model to predict well, but we also want it to be interpretable. For e.g., Logistic Regression is expected to learn relatively higher values for weights corresponding to relatively more important features. More important features have a greater impact on the final prediction. But if features are correlated, then it becomes hard to judge which feature has more “say” in the final decision because their values are actually dependent on one another. This affects the values of the weights. In other words, the weights not only get decided based on how an independent variable correlates to the dependent variable, they also get influenced by how independent variables correlate with one another. For e.g., if speed_high has higher values than others, then speed_low and speed_medium have to be lower so that the sum is always 1. Let’s assume that importance(speed_high ) = importance(speed_medium) = importance(speed_low). But since speed_high has higher values than the other independent variables, the learned weight corresponding to it will be much lower than the other two. In reality, we would want their respective weights to be (almost) equal.

Since, one-hot-encoding directly induces perfect multicollinearity, we drop one of the columns from the encoded features, the choice is completely arbitrary.

Suppose your four categories are eye colors (code):

catrgory 1: brown ,catrgory 2: blue , catrgory 3:green , catrgory 4:hazel

If we just assign simple value to them: brown=1, blue=2, green=3, hazel=4. In no way (that I can currently imagine) would we mean that green =(3× brown), or that hazel =(2× blue) as our codes imply, even though 3=3×1 and 4=2×2.

Therefore, we need to use some sort of coding. Dummy coding is one example, which eliminates such relationships from the statistical stories we want to tell about eye color.

Here category 4 is the reference category, assuming that there is a constant in your model, such as:

where 𝛽0 is the mean value of 𝑦 when category = 4, and the 𝛽 terms associated with each dummy indicate by what amount 𝑦 changes from 𝛽0 for that category.

If you do not have a constant (𝛽0) term in the model, then you need one more "dummy" predictor (perhaps less often termed "indicator variables"), in effect the dummies then each behave as the model constant for each category:

So this would get one around the issue of creating nonsensical quantitative relationships between category codes I mention at first -- We don't want to mean that green =(3× brown) etc.

Reference Link for One-Hot Relavant Part :

Reference 1

Reference 2

Reference 3