General ML interview questions-优快云博客

本文链接：https://blog.youkuaiyun.com/ShadyPi/article/details/143766237

Questions

How to combat with overfitting
Difference between Random Forest and XGBoost
How to handle missing value?
How to train a regression tree?
Difference between Gradient Descent (GD) and Stochastic Gradient Descent (SGD)
More Variants of SGD
Non-differentiable?
L1 & L2 Regularization
Batch Normalization (BN) vs Layer Normalization (LN)
Dropout Principles & Implementation
Order of Normalization, Activation and Dropout
Implement Cross Entropy Loss
Implement Focal Loss

How to combat with overfitting

Regularization (Search for regularization coefficient)
Dropout (Randomly drops neurons during training)
Simplify Model (Reduce the number of trainable parameters)
Early Stopping
Ensemble Methods (Bagging like Random Forest, Boosting like XGBoost)
Data Augmentation
Feature Selection (Remove irrelevant or highly correlated features)

Difference between Random Forest and XGBoost

Random Forest will train multiple decision trees independently on subset of whole dataset. During inference, decision trees will process the input independently and then the results will be aggregated together. (vote for classification, average for regression).
XGBoost will train multiple decision trees sequentially, where the later tree will try to revise the error made by the former tree. During inference, the first tree will produce an initial prediction and the following trees will produce the increment to adjust previous prediction. The final prediction is the sum of all predictions made by each tree.

Feature	Random Forest	XGBoost
Ensemble Method	Bagging	Boosting
Tree Growth	Independent, deep trees	Sequential, shallow trees
Objective Function	Averaging/voting	Gradient-based with regularization
Bias-Variance Tradeoff	Reduces variance	Reduces both bias and variance
Speed	Faster due to parallelism	Slower but optimized with advanced techniques
Interpretability	More interpretable	Less interpretable without tools like SHAP

How to handle missing value?

Fill the missing value with mean, median or majority.
Fill the missing value by predict it with another model.
Treat it as a special feature, XGBoost and LightGBM can handle it automatically.

How to train a regression tree?

The regression tree will split the feature space into several subspace and each subspace has a certain prediction value. The goal is usually to minimize the MSE or MAE of the division, the corresponding prediction is the mean and median of the divided sub data.

Difference between Gradient Descent (GD) and Stochastic Gradient Descent (SGD)

GD uses the whole dataset to perform forward propagation, compute the gradient of loss function and back propagation. Comparatively, SGD only uses one sample to do the propagation and parameters update.

Usually, GD will converge more smoothly but the computation cost is expensive, especially for large dataset. SGD will converge more quickly with more noise and unstable, which means the order of samples will influent the final performance.

More Variants of SGD

Batch Gradient Descent (BGD)
Uses a mini batch to compute gradient. When $B = 1$ , it becomes SGD; when $B = L e n (D a t a se t)$ , it becomes GD. BGD is more stable than SGD and less resource consuming than GD.
Momentum-Based SGD
Momentum performs like a velocity vector that accumulates past gradients, so the parameter will move more consistently, which accelerates convergence, prevents fluctuation and more likely to pass the local minimums.
Adagrad (Adaptive Gradient Algorithm)
Scale down the updates of frequently updated parameters (divide by the square of past gradients), which prevents fluctuation. However, the gradients will keep decreasing to 0, which ends the training process earlier than expected.
RMSprop (Root Mean Square Propagation)
To prevent the gradients from becoming zero, RMSprop change the denominator of Adagrad into an exponentially decaying average of past squared gradients, which is stable and won’t lead to a zero.
Adam (Adaptive Moment Estimation)
Combining RMSprop and Momentum, the nominator is exponentially decaying gradients, aka momentum and the denominator is exponentially decaying past squared gradients.
AdamW
Adam just adds the regularization term to the loss function and compute gradients. However, AdamW add this term when each parameter is updated.

Non-differentiable?

Subgradient: a set of slopes that "lay under” the function. $\ge f(x) + g(y-x), \forall y$ , for differentiable function, the subgradient has only 1 solution; for convex function, the subgradient is a set; for non-convex function, the subgradient may not exist.
Proximal Gradient: Optimize differentiable part first and then find a Proximal Operator for non-differentiable part.
Smoothed Approximations: Find a differentiable approximation.
Gradient-Free Optimization Methods: Genetic Algorithm or Simulated Annealing.

L1 & L2 Regularization

L1 Regularization tends to produce spare parameters where many weights are exact zeros, while L2 Regularization tends to produce parameters with small weights but not zero.

Batch Normalization (BN) vs Layer Normalization (LN)

BN is usually applied to CNN and LN is usually applied to RNN or other sequential data. BN is normalized for every channel or say position in the sequence. For sequential data, it is not guaranteed that the lengths of each sequence are the same. Even though we can pad the sequence, there may not be enough data to train robust parameters in BN. Therefore, we usually apply LN to sequential data. In LN, we normalize the data for each sample. LN doesn’t require batch data and is appliable for sequence of any length.

BN will additionally maintain a global mean and std for inference to prevent the instability caused by the batch size. LN has no difference between training and test.

Normalization can keep the hidden states to a stable distribution and prevent gradient explosion and vanishing. Taking each block as an independent classifier, normalization is to make sure that the input of each classifier following the same distribution, or the distribution of input may vary a lot while the depth of network increase.

Normalization also provides some regularization function because it adds some noise into the hidden layer.

Dropout Principles & Implementation

During each training iteration, randomly set the output of certain neurons to 0 to perform regularization.

Discourage the dependencies among neurons, making the network more generalizable.
In inference, the subnetworks can be considered as sub-model for ensemble.
During training, the output need to be rescaled by $\frac{1}{1-p}$ because $p$ weights are dropped, to maintain a consistent magnitude of the output.

def dropout(x, prob):
    mask = (np.random.rand(*x.shape) > prob).astype(int)
    x = x * mask / (1-prob)
    return x

Remember to use * to unzip the tuple of x.shape

Order of Normalization, Activation and Dropout

Usually normalization -> activation -> dropout. Normalization will make the input to be the same distribution to make the training stable. Then, dropout the output of activation layer to perform regularization.

Implement Cross Entropy Loss

The definition of Cross Entropy Loss is $\mathcal{L}(p, q) = -\sum_i p_i\log q_i$ . Literally, the computation should go through every value in the distribution. However, for most classification task, the ground truth is a one-hot encoding that the probability of only one value is 1 and the probability of other values are 0. Therefore, for each sample the loss is just $log q_{ground\ truth}$ .
Notice that here we assume that the input predicted label is after-softmax., but in torch.nn.CrossEntropy there is an inherent log-softmax in the function.

def CrossEntropy(y, y_pred):
    # y(B, ) y_pred(B, C)
    prob = y_pred[:, y]
    log_prob = torch.log(prob)
    sum_up = torch.sum(log_prob)
    return -sum_up

Implement Focal Loss

Focal Loss is designed to assign more weight on badly predicted sample, or it can be seen as a variant of Cross Entropy Loss. In detail, if the predicted probability of ground truth is $p$ , then the assigned weight will be $(1-p)^\gamma$ to emphasize badly performed sample and understate well performed sample.

def Focal_Loss(y, y_pred, gamma):
    prob = y_pred[:, y]
    log_prob = torch.log(prob)
    sum_up = torch.sum(((1-prob)**gamma)*log_prob)
    return -sum_up