Data normalization is generally performed during the data pre-processing step.
1. why we need normalization
There are two major reasons that data normalization is so essential for machine learning algorithm.
- Data normalization can promote the performance in common machine learning problems.
- Data normalization can speed up the coverage of gradient descent algorithm.
Let's illustrate this using a screenshot from Andrew's machine learning course
2. how to normalize data
Three common methods are used to perform feature normalization in machine learning algorithms.- Rescaling
where is the original value,
is the normalized value.
The equation (1) rescales data into [0,1], and the equation (2) rescales data into [-1,1].
Note: the parameters and
should be computed in the training data only, but will be used in the training, validation, and testing data later.
There are also some methods to normalize the features using non-linear function, such as
logarithmic function:
inverse tangent function:
sigmoid function:
- Standardization
Feature standardization makes the values of each feature in the data have zero-mean and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines,logistic regression, and neural networks). The general formula is given as:
where is the standard deviation of the feature
.
- Scaling to unit length
This is especially important if the Scalar Metric is used as a distance measure in the following learning steps.
3. Some cases you don't need data normalization
3.1 using a similarity function instead of distance function
You can propose a similarity function rather than a distance function and plug it in a kernel (technically this function must generate positive-definite matrices).
3.2 random reforest
Random forest never compare one feature with another in magnitude, so the ranges don't matter.
Reference
本文阐述了数据标准化在机器学习中的重要性,并介绍了三种常用的数据标准化方法:重新缩放、标准化和单位长度缩放。此外,还讨论了某些情况下可能不需要进行数据标准化的情形。
693

被折叠的 条评论
为什么被折叠?



