机器学习技巧与实践-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_43310853/article/details/113147892

Error analysis

Methods to solve over fitting

more training examples
try smaller sets of features
try increasing $\lambda$

Methods to solve under fitting

getting additional features
try adding polynomial features
try decreasing $\lambda$

Recommend approach

start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data
plot learning curves to decide if more data, more features, etc. are likely to help.
Error analysis: See if you spot any systematic trend in what type of examples it is making errors on

Error metrics for skewed classes

	1	0
1	True Positive	False Positive
0	False Negative	True Negative

$precision=\frac{TP}{TP+FP}$ 准确率

$recall=\frac{TP}{TP+FN}$ 召回率

$\displaystyle F_1score=2\frac{PR}{P+R}$ F值

Data for machine learning

In the following conditions, more data makes sense.

Assume feature $x\in R^{n+1}$ has sufficient information to predict y accurately.
Use a learning algorithm with many parameters such as logistic regression or linear regression with many features or neural network with many hidden units.

Support Vector Machine

Logistic regression Cost function
$min_\theta\frac{1}{m}[\sum_{i=1}^my^{(i)}(-\log h_\theta(x^{(i)}))+(1-y^{(i)})(-\log(1-h_\theta(x^{(i)})))]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$
SVM hypothesis
$min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2$

$h_\theta(x)= \begin{cases} 0\\ 1\\ \end{cases}$

SVM parameters

$C=\frac{1}{\lambda}$

large C low bias, high variance
small C higher bias, low variance

$\sigma^2$

large $\sigma^2$ . feature $f_i$ vary more smoothly. higher bias, lower variance
small $\sigma^2$ . feature $f_i$ vary less smoothly. lower bias, higher variance

Kernel function

no kernel
高斯kernel
$f=e^{-\frac{|| x_1-x_2 ||}{2\sigma^2}}$
多项式kernel $k(x,l)=(x^Tl+constant)^{degree}$

Muti-class classification

train K SVMs, one to distinguish $y = i$ from the rest

Logistic regression vs SVMs

n=number of features $x\in R^{n+1}$ ，m=number of training examples

n is large relative to m: use logistic regression or SVM without kernel
n is small , m is intermediate: use SVM with Gaussian kernel
n is small and m is large: create or add more features and then use logistic regression or svm without kernel

K-means

Input

K (number of clusters)
training set $\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}$

K-means algorithm

randomly initialize K cluster centroids $\mu_1,\mu_2,\cdots,\mu_K\in R^{n}$

2. repeat{
for i in m:
    C^{(i)}=index(from 1 to K) of cluster centroid closest to x^{(i)}
for k=1 to K:
    u_k=mean of points assigned to cluster k
}

K-means optimization objective

$c^{(i)}= index\ (1,2,\dots,K) $ to which example $x^{(i)}$ is currently assigned
$\mu_k$ cluster centroid k $(\mu_k\in R^n)$
$\mu_{c^{(i)}}$ cluster centroid of cluster to which example $x^{(i)}$ has been assigned
$min_{c^{(1)},c^{(2)},\cdots,c^{(m)},\mu_1,\mu_2,\cdots,\mu_K} \frac{1}{m}\sum_{i=1}^m||x^{(i)}-\mu_{c^{(i)}}||^2$

Random initialization

for i in range(100):
    randomly initialize K-means
    run k-means. get c^{(i)},c^{(m)},...
    compute cost function J
pick clustering that gave lowest cost J(get c^{(i)},c^{(m)},....)

Usually, set k=2-10, kmeans can work well. And random initialization can avoid local optimal.

Choosing the number of clusters

Elbow method

choose the number for the purpose

Principle Component Analysis

在这里插入图片描述

Data preprocessing

Training set: $x^{(1)},x^{(2)},\cdots,x^{(m)}$

preprocessing $\mu_j=\frac{1}{m}\sum_{i=1}^m x_j^{(i)}$ , replace each $x_j^{(i)}$ with $x_j-\mu_j$ , if different features on different scales, scale features to have comparable range of values.

Choosing the number of principal components

Averaged squared projection error $\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}-x_{approx}^{(i)}\Vert^2$

Total variation in the data $\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}\Vert^2$

Typically, choosing k to the smallest value so that $\displaystyle \frac{\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}-x_{approx}^{(i)}\Vert^2}{\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}\Vert^2}<=0.01(1\%)$ (99% of variance is retained)

[U,S,V]=svd(Sigma)

pick the smallest value of k for which $\displaystyle \frac{\sum_{i=1}^k S_{ii}}{\sum_{i=1}^m S_{ii}}>=0.99$

Application of `PCA`

Compression
- reduce memory or disk needed to store data
- speed up learning algorithm
Visualization
- plot 2 or 3 dimensional data
A bad way to prevent over fitting
- Fewer features might work ok, but isn’t a good way to address over fitting. Use regularization instead.
A bad use to design ML systems

Anomaly detection

Examples

Fraud detection
Manufacturing
Monitoring computers in a data center

Anomaly detection algorithm

Choose features $x_i$ that you think might be indicative anomalous examples.
fit parameters $\mu_1,\cdots,\mu_n,\sigma_1^2,\cdots,\sigma_n^2$
$\mu_j=\sum_{i=1}^mx_j^{(i)}\\ \sigma_j^2=\frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j^2)$
given new example x, compute $p (x)$
$p(x)=\prod_{j=1}^np(x_j;\mu_j,\sigma_j^2)=\prod_{j=1}^n \frac{1}{\sqrt{2\pi\sigma_j}}e^{-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}}$
Anomaly if $p(x)<\epsilon $

Algorithm evaluation

fit model $p (x)$ on training set $\{x^{(1)},\cdots,x^{(m)}\}$
on a cross validation or test example $x$ , predict
$y=\begin{cases} 1\ if\ p(x)<\epsilon\ anomaly\\ 0\ if\ p(x)>=\epsilon\ normal \end{cases}$
possible evaluation
- precision or recall
- $F_1\ score$
- can choose suitable $\epsilon$ through maximize the $F_1$ score

Anomaly detection vs supervised learning

Use Anomaly detection when
- very small number of positive examples (0-20) and large number of negative examples
- many different types of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like
- future anomalies may look nothing like any of the anomalous examples we’ve seen so far
- examples like
  - fraud detection
  - manufacturing
  - monitoring machines in a data center
Use Supervised learning when
- large number of positive and negative examples
- enough positive examples for algorithm to get a sense of what positive examples are like, future examples likely to be similar to ones in training set
- examples like
  - email spam classification
  - weather prediction
  - cancer classification

Choosing what features to use

make non-gaussian features gaussian

$x_i\leftarrow \log(x_i)$
$x_2\leftarrow \log(x_2+c)$
$x_3=\sqrt{x_3}$

Multivariate Gaussian distribution

$x\in R^n$ . Don’t model $p(x_1),p(x_2),\cdots$ separately, model $ p(x)$ all in one goal. Parameters $\mu\in R^n$ , $\Sigma\in R^{n\times n}$
$p(x;\mu,\Sigma)=\frac{1}{\sqrt{2\pi|\Sigma|}}exp^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)}$
Use original model $p(x)=\prod_{i=1}^np(x_i;\mu_i,\sigma_i^2)$

Manually create features to capture anomalies where $x_1,x_2$ take unusual combinations of values
Computationally cheaper
ok even if m (training set size) is small

Use Multivariate Gaussian

Automatically captures correlations between features
Computationally more expensive
Must have $m > n$ or else $\Sigma$ is non-invertible

Recommender Systems

在这里插入图片描述
$n_u=number$ of users

$n_m=number$ of movies

$m_j=number\ of\ movies\ rated\ by\ user\ i$

$r (i, j) = 1$ if user j has rated movie i

$y^{(i,j)}=rating$ given by user j to movie i(defined only if $r(i,j)=1 $)

Content-based recommendations

For each user j, learn a parameter $\theta^{(j)}\in R^3$ . Predict user j as rating movie i with $(\theta^{(j)})^Tx^{(i)}$ stars.

optimization objective

to learn $\theta^{(j)}$ , parameter for user j, you should
$min_{\theta(j)}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2$
to learn $\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}$
$min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2$

Collaborative filtering

Given $x^{(1)},\cdots,x^{(n_m)}$ and movie ratings can estimate $\theta^{(1)},\cdots,\theta^{(n_u)}$
$min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2$
Given $\theta^{(1)},\cdots,\theta^{n_u}$ can estimate $x^{(1)},\cdots,x^{(n_m)}$
$min_{x^{(1)},x^{(2)},\cdots,x_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(x_k^{(i)})^2$
We can iterate $x\rightarrow \theta \rightarrow x\rightarrow \theta\cdots$
$min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u},x^{(1)},x^{(2)},\cdots,x_{n_u}} \frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2\\ +\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(x_k^{(i)})^2 +\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2$
predicted ratings

低秩矩阵分解
$\begin{bmatrix} \theta^{(1)}x^{(1)}&\theta^{(2)}x^{(1)}&\cdots&\theta^{(n_u)}x^{(1)}\\ \cdots&\cdots&\cdots&\cdots\\ \theta^{(1)}x^{(n_m)}&\theta^{(2)}x^{(n_m)}&\cdots&\theta^{(n_u)}x^{(n_m)} \end{bmatrix}=XO^T$

Learning with large datasets

$\theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta(x^{(i)})}-y^{(i)})x_j^{(i)}$

如果m特别大，那么全部遍历就不太容易。

Stochastic gradient descent

Randomly shuffle training examples

# 随机梯度下降，每次只输入一个实例，就直接更新所有参数
repeat{
    for i in range(m):
    	for j in range(n):
    		update $\theta$
}

Mini-batch gradient descent

Batch gradient descent: use all m examples in each iteration
Stochastic gradient descent: use 1 example in each iteration
Mini-batch gradient: use b examples in each iteration

b=10,m=10000
repeat{
    for i in range(1,m,10):
    	for j in range(n):
    		update $\theta$
}

Online learning

repeat forever{
    get (x,y) corresponding to user
    update \theta using (x,y)
}

可以适应客户变化和环境变化。

Application example Photo OCR

Photo OCR pipeline 机器学习流水线
$Image\rightarrow Text\ detection\ \rightarrow Character\ segmentation\ \rightarrow Character\ recognition$
Synthesizing data by introducing distortions

Distortion introduced should be representation of the type of noise or distortions in the test set
Usually does not help to add purely random or meaningless noise to your data. 高斯噪声