General ML interview questions

How to combat with overfitting

  1. Regularization (Search for regularization coefficient)
  2. Dropout (Randomly drops neurons during training)
  3. Simplify Model (Reduce the number of trainable parameters)
  4. Early Stopping
  5. Ensemble Methods (Bagging like Random Forest, Boosting like XGBoost)
  6. Data Augmentation
  7. Feature Selection (Remove irrelevant or highly correlated features)

Difference between Random Forest and XGBoost

  1. Random Forest will train multiple decision trees independently on subset of whole dataset. During inference, decision trees will process the input independently and then the results will be aggregated together. (vote for classification, average for regression).
  2. XGBoost will train multiple decision trees sequentially, where the later tree will try to revise the error made by the former tree. During inference, the first tree will produce an initial prediction and the following trees will produce the increment to adjust previous prediction. The final prediction is the sum of all predictions made by each tree.
FeatureRandom ForestXGBoost
Ensemble MethodBaggingBoosting
Tree GrowthIndependent, deep treesSequential, shallow trees
Objective FunctionAveraging/votingGradient-based with regularization
Bias-Variance TradeoffReduces varianceReduces both bias and variance
SpeedFaster due to parallelismSlower but optimized with advanced techniques
InterpretabilityMore interpretableLess interpretable without tools like SHAP

How to handle missing value?

  • Fill the missing value with mean, median or majority.
  • Fill the missing value by predict it with another model.
  • Treat it as a special feature, XGBoost and LightGBM can handle it automatically.

How to train a regression tree?

The regression tree will split the feature space into several subspace and each subspace has a certain prediction value. The goal is usually to minimize the MSE or MAE of the division, the corresponding prediction is the mean and median of the divided sub data.

Difference between Gradient Descent (GD) and Stochastic Gradient Descent (SGD)

GD uses the whole dataset to perform forward propagation, compute the gradient of loss function and back propagation. Comparatively, SGD only uses one sample to do the propagation and parameters update.

Usually, GD will converge more smoothly but the computation cost is expensive, especially for large dataset. SGD will converge more quickly with more noise and unstable, which means the order of samples will influent the final performance.

More Variants of SGD

  • Batch Gradient Descent (BGD)
    Uses a mini batch to compute gradient. When B = 1 B=1 B=1, it becomes SGD; when B = L e n ( D a t a s e t ) B=Len(Dataset) B=Len(Dataset), it becomes GD. BGD is more stable than SGD and less resource consuming than GD.
  • Momentum-Based SGD
    Momentum performs like a velocity vector that accumulates past gradients, so the parameter will move more consistently, which accelerates convergence, prevents fluctuation and more likely to pass the local minimums.
  • Adagrad (Adaptive Gradient Algorithm)
    Scale down the updates of frequently updated parameters (divide by the square of past gradients), which prevents fluctuation. However, the gradients will keep decreasing to 0, which ends the training process earlier than expected.
  • RMSprop (Root Mean Square Propagation)
    To prevent the gradients from becoming zero, RMSprop change the denominator of Adagrad into an exponentially decaying average of past squared gradients, which is stable and won’t lead to a zero.
  • Adam (Adaptive Moment Estimation)
    Combining RMSprop and Momentum, the nominator is exponentially decaying gradients, aka momentum and the denominator is exponentially decaying past squared gradients.
  • AdamW
    Adam just adds the regularization term to the loss function and compute gradients. However, AdamW add this term when each parameter is updated.

Non-differentiable?

  • Subgradient: a set of slopes that "lay under” the function. f ( y ) ≥ f ( x ) + g ( y − x ) , ∀ y f(y) \ge f(x) + g(y-x), \forall y f(y)f(x)+g(yx),y, for differentiable function, the subgradient has only 1 solution; for convex function, the subgradient is a set; for non-convex function, the subgradient may not exist.
  • Proximal Gradient: Optimize differentiable part first and then find a Proximal Operator for non-differentiable part.
  • Smoothed Approximations: Find a differentiable approximation.
  • Gradient-Free Optimization Methods: Genetic Algorithm or Simulated Annealing.

L1 & L2 Regularization

L1 Regularization tends to produce spare parameters where many weights are exact zeros, while L2 Regularization tends to produce parameters with small weights but not zero.

Batch Normalization (BN) vs Layer Normalization (LN)

BN is usually applied to CNN and LN is usually applied to RNN or other sequential data. BN is normalized for every channel or say position in the sequence. For sequential data, it is not guaranteed that the lengths of each sequence are the same. Even though we can pad the sequence, there may not be enough data to train robust parameters in BN. Therefore, we usually apply LN to sequential data. In LN, we normalize the data for each sample. LN doesn’t require batch data and is appliable for sequence of any length.

BN will additionally maintain a global mean and std for inference to prevent the instability caused by the batch size. LN has no difference between training and test.

Normalization can keep the hidden states to a stable distribution and prevent gradient explosion and vanishing. Taking each block as an independent classifier, normalization is to make sure that the input of each classifier following the same distribution, or the distribution of input may vary a lot while the depth of network increase.

Normalization also provides some regularization function because it adds some noise into the hidden layer.

Dropout Principles & Implementation

During each training iteration, randomly set the output of certain neurons to 0 to perform regularization.

  • Discourage the dependencies among neurons, making the network more generalizable.
  • In inference, the subnetworks can be considered as sub-model for ensemble.
  • During training, the output need to be rescaled by 1 1 − p \frac{1}{1-p} 1p1 because p p p weights are dropped, to maintain a consistent magnitude of the output.
def dropout(x, prob):
    mask = (np.random.rand(*x.shape) > prob).astype(int)
    x = x * mask / (1-prob)
    return x

Remember to use * to unzip the tuple of x.shape

Order of Normalization, Activation and Dropout

Usually normalization -> activation -> dropout. Normalization will make the input to be the same distribution to make the training stable. Then, dropout the output of activation layer to perform regularization.

Implement Cross Entropy Loss

The definition of Cross Entropy Loss is L ( p , q ) = − ∑ i p i log ⁡ q i \mathcal{L}(p, q) = -\sum_i p_i\log q_i L(p,q)=ipilogqi. Literally, the computation should go through every value in the distribution. However, for most classification task, the ground truth is a one-hot encoding that the probability of only one value is 1 and the probability of other values are 0. Therefore, for each sample the loss is just − log ⁡ q g r o u n d   t r u t h -\log q_{ground\ truth} logqground truth.
Notice that here we assume that the input predicted label is after-softmax., but in torch.nn.CrossEntropy there is an inherent log-softmax in the function.

def CrossEntropy(y, y_pred):
    # y(B, ) y_pred(B, C)
    prob = y_pred[:, y]
    log_prob = torch.log(prob)
    sum_up = torch.sum(log_prob)
    return -sum_up

Implement Focal Loss

Focal Loss is designed to assign more weight on badly predicted sample, or it can be seen as a variant of Cross Entropy Loss. In detail, if the predicted probability of ground truth is p p p, then the assigned weight will be ( 1 − p ) γ (1-p)^\gamma (1p)γ to emphasize badly performed sample and understate well performed sample.

def Focal_Loss(y, y_pred, gamma):
    prob = y_pred[:, y]
    log_prob = torch.log(prob)
    sum_up = torch.sum(((1-prob)**gamma)*log_prob)
    return -sum_up
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ShadyPi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值