文章目录
Error analysis
Methods to solve over fitting
- more training examples
- try smaller sets of features
- try increasing λ \lambda λ
Methods to solve under fitting
- getting additional features
- try adding polynomial features
- try decreasing λ \lambda λ
Recommend approach
- start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data
- plot learning curves to decide if more data, more features, etc. are likely to help.
- Error analysis: See if you spot any systematic trend in what type of examples it is making errors on
Error metrics for skewed classes
1 | 0 | |
---|---|---|
1 | True Positive | False Positive |
0 | False Negative | True Negative |
p r e c i s i o n = T P T P + F P precision=\frac{TP}{TP+FP} precision=TP+FPTP 准确率
r e c a l l = T P T P + F N recall=\frac{TP}{TP+FN} recall=TP+FNTP 召回率
F 1 s c o r e = 2 P R P + R \displaystyle F_1score=2\frac{PR}{P+R} F1score=2P+RPR F值
Data for machine learning
In the following conditions, more data makes sense.
- Assume feature x ∈ R n + 1 x\in R^{n+1} x∈Rn+1 has sufficient information to predict y accurately.
- Use a learning algorithm with many parameters such as logistic regression or linear regression with many features or neural network with many hidden units.
Support Vector Machine
Logistic regression Cost function
m
i
n
θ
1
m
[
∑
i
=
1
m
y
(
i
)
(
−
log
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
(
−
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
]
+
λ
2
m
∑
j
=
1
n
θ
j
2
min_\theta\frac{1}{m}[\sum_{i=1}^my^{(i)}(-\log h_\theta(x^{(i)}))+(1-y^{(i)})(-\log(1-h_\theta(x^{(i)})))]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2
minθm1[i=1∑my(i)(−loghθ(x(i)))+(1−y(i))(−log(1−hθ(x(i))))]+2mλj=1∑nθj2
SVM
hypothesis
m
i
n
θ
C
∑
i
=
1
m
[
y
(
i
)
c
o
s
t
1
(
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
c
o
s
t
0
(
θ
T
x
(
i
)
)
]
+
1
2
∑
i
=
1
n
θ
j
2
min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2
minθCi=1∑m[y(i)cost1(θTx(i))+(1−y(i))cost0(θTx(i))]+21i=1∑nθj2
h θ ( x ) = { 0 1 h_\theta(x)= \begin{cases} 0\\ 1\\ \end{cases} hθ(x)={01
SVM
parameters
C = 1 λ C=\frac{1}{\lambda} C=λ1
- large C low bias, high variance
- small C higher bias, low variance
σ 2 \sigma^2 σ2
- large σ 2 \sigma^2 σ2. feature f i f_i fi vary more smoothly. higher bias, lower variance
- small σ 2 \sigma^2 σ2. feature f i f_i fi vary less smoothly. lower bias, higher variance
Kernel function
-
no kernel
-
高斯kernel
f = e − ∣ ∣ x 1 − x 2 ∣ ∣ 2 σ 2 f=e^{-\frac{|| x_1-x_2 ||}{2\sigma^2}} f=e−2σ2∣∣x1−x2∣∣ -
多项式kernel k ( x , l ) = ( x T l + c o n s t a n t ) d e g r e e k(x,l)=(x^Tl+constant)^{degree} k(x,l)=(xTl+constant)degree
Muti-class classification
train K SVMs
, one to distinguish
y
=
i
y=i
y=i from the rest
Logistic regression vs SVMs
n=number of features x ∈ R n + 1 x\in R^{n+1} x∈Rn+1,m=number of training examples
- n is large relative to m: use logistic regression or
SVM
without kernel - n is small , m is intermediate: use
SVM
with Gaussian kernel - n is small and m is large: create or add more features and then use logistic regression or
svm
without kernel
K-means
Input
- K (number of clusters)
- training set { x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) } \{x^{(1)},x^{(2)},\cdots,x^{(m)}\} {x(1),x(2),⋯,x(m)}
K-means algorithm
- randomly initialize K cluster centroids μ 1 , μ 2 , ⋯ , μ K ∈ R n \mu_1,\mu_2,\cdots,\mu_K\in R^{n} μ1,μ2,⋯,μK∈Rn
2. repeat{
for i in m:
C^{(i)}=index(from 1 to K) of cluster centroid closest to x^{(i)}
for k=1 to K:
u_k=mean of points assigned to cluster k
}
K-means optimization objective
-
$c^{(i)}= index\ (1,2,\dots,K) $ to which example x ( i ) x^{(i)} x(i) is currently assigned
-
μ k \mu_k μk cluster centroid k ( μ k ∈ R n ) (\mu_k\in R^n) (μk∈Rn)
-
μ c ( i ) \mu_{c^{(i)}} μc(i) cluster centroid of cluster to which example x ( i ) x^{(i)} x(i) has been assigned
-
m i n c ( 1 ) , c ( 2 ) , ⋯ , c ( m ) , μ 1 , μ 2 , ⋯ , μ K 1 m ∑ i = 1 m ∣ ∣ x ( i ) − μ c ( i ) ∣ ∣ 2 min_{c^{(1)},c^{(2)},\cdots,c^{(m)},\mu_1,\mu_2,\cdots,\mu_K} \frac{1}{m}\sum_{i=1}^m||x^{(i)}-\mu_{c^{(i)}}||^2 minc(1),c(2),⋯,c(m),μ1,μ2,⋯,μKm1i=1∑m∣∣x(i)−μc(i)∣∣2
Random initialization
for i in range(100):
randomly initialize K-means
run k-means. get c^{(i)},c^{(m)},...
compute cost function J
pick clustering that gave lowest cost J(get c^{(i)},c^{(m)},....)
Usually, set k=2-10, kmeans
can work well. And random initialization can avoid local optimal.
Choosing the number of clusters
Elbow method
choose the number for the purpose
Principle Component Analysis
Data preprocessing
Training set: x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) x^{(1)},x^{(2)},\cdots,x^{(m)} x(1),x(2),⋯,x(m)
preprocessing μ j = 1 m ∑ i = 1 m x j ( i ) \mu_j=\frac{1}{m}\sum_{i=1}^m x_j^{(i)} μj=m1∑i=1mxj(i), replace each x j ( i ) x_j^{(i)} xj(i) with x j − μ j x_j-\mu_j xj−μj, if different features on different scales, scale features to have comparable range of values.
Choosing the number of principal components
Averaged squared projection error 1 m ∑ i = 1 m ∥ x ( i ) − x a p p r o x ( i ) ∥ 2 \frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}-x_{approx}^{(i)}\Vert^2 m1∑i=1m∥x(i)−xapprox(i)∥2
Total variation in the data 1 m ∑ i = 1 m ∥ x ( i ) ∥ 2 \frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}\Vert^2 m1∑i=1m∥x(i)∥2
Typically, choosing k to the smallest value so that 1 m ∑ i = 1 m ∥ x ( i ) − x a p p r o x ( i ) ∥ 2 1 m ∑ i = 1 m ∥ x ( i ) ∥ 2 < = 0.01 ( 1 % ) \displaystyle \frac{\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}-x_{approx}^{(i)}\Vert^2}{\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}\Vert^2}<=0.01(1\%) m1∑i=1m∥x(i)∥2m1∑i=1m∥x(i)−xapprox(i)∥2<=0.01(1%)(99% of variance is retained)
[U,S,V]=svd(Sigma)
pick the smallest value of k for which ∑ i = 1 k S i i ∑ i = 1 m S i i > = 0.99 \displaystyle \frac{\sum_{i=1}^k S_{ii}}{\sum_{i=1}^m S_{ii}}>=0.99 ∑i=1mSii∑i=1kSii>=0.99
Application of PCA
-
Compression
- reduce memory or disk needed to store data
- speed up learning algorithm
-
Visualization
- plot 2 or 3 dimensional data
-
A bad way to prevent over fitting
- Fewer features might work
ok
, but isn’t a good way to address over fitting. Use regularization instead.
- Fewer features might work
-
A bad use to design ML systems
Anomaly detection
Examples
- Fraud detection
- Manufacturing
- Monitoring computers in a data center
Anomaly detection algorithm
-
Choose features x i x_i xi that you think might be indicative anomalous examples.
-
fit parameters μ 1 , ⋯ , μ n , σ 1 2 , ⋯ , σ n 2 \mu_1,\cdots,\mu_n,\sigma_1^2,\cdots,\sigma_n^2 μ1,⋯,μn,σ12,⋯,σn2
μ j = ∑ i = 1 m x j ( i ) σ j 2 = 1 m ∑ i = 1 m ( x j ( i ) − μ j 2 ) \mu_j=\sum_{i=1}^mx_j^{(i)}\\ \sigma_j^2=\frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j^2) μj=i=1∑mxj(i)σj2=m1i=1∑m(xj(i)−μj2) -
given new example x, compute p ( x ) p(x) p(x)
p ( x ) = ∏ j = 1 n p ( x j ; μ j , σ j 2 ) = ∏ j = 1 n 1 2 π σ j e − ( x j − μ j ) 2 2 σ j 2 p(x)=\prod_{j=1}^np(x_j;\mu_j,\sigma_j^2)=\prod_{j=1}^n \frac{1}{\sqrt{2\pi\sigma_j}}e^{-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}} p(x)=j=1∏np(xj;μj,σj2)=j=1∏n2πσj1e−2σj2(xj−μj)2 -
Anomaly if $p(x)<\epsilon $
Algorithm evaluation
-
fit model p ( x ) p(x) p(x) on training set { x ( 1 ) , ⋯ , x ( m ) } \{x^{(1)},\cdots,x^{(m)}\} {x(1),⋯,x(m)}
-
on a cross validation or test example x x x, predict
y = { 1 i f p ( x ) < ϵ a n o m a l y 0 i f p ( x ) > = ϵ n o r m a l y=\begin{cases} 1\ if\ p(x)<\epsilon\ anomaly\\ 0\ if\ p(x)>=\epsilon\ normal \end{cases} y={1 if p(x)<ϵ anomaly0 if p(x)>=ϵ normal -
possible evaluation
- precision or recall
- F 1 s c o r e F_1\ score F1 score
- can choose suitable ϵ \epsilon ϵ through maximize the F 1 F_1 F1 score
Anomaly detection vs supervised learning
-
Use Anomaly detection when
- very small number of positive examples (0-20) and large number of negative examples
- many different types of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like
- future anomalies may look nothing like any of the anomalous examples we’ve seen so far
- examples like
- fraud detection
- manufacturing
- monitoring machines in a data center
-
Use Supervised learning when
- large number of positive and negative examples
- enough positive examples for algorithm to get a sense of what positive examples are like, future examples likely to be similar to ones in training set
- examples like
- email
spam
classification - weather prediction
- cancer classification
- email
Choosing what features to use
make non-gaussian
features gaussian
- x i ← log ( x i ) x_i\leftarrow \log(x_i) xi←log(xi)
- x 2 ← log ( x 2 + c ) x_2\leftarrow \log(x_2+c) x2←log(x2+c)
- x 3 = x 3 x_3=\sqrt{x_3} x3=x3
Multivariate Gaussian distribution
x
∈
R
n
x\in R^n
x∈Rn. Don’t model
p
(
x
1
)
,
p
(
x
2
)
,
⋯
p(x_1),p(x_2),\cdots
p(x1),p(x2),⋯ separately, model $ p(x)$ all in one goal. Parameters
μ
∈
R
n
\mu\in R^n
μ∈Rn,
Σ
∈
R
n
×
n
\Sigma\in R^{n\times n}
Σ∈Rn×n
p
(
x
;
μ
,
Σ
)
=
1
2
π
∣
Σ
∣
e
x
p
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
p(x;\mu,\Sigma)=\frac{1}{\sqrt{2\pi|\Sigma|}}exp^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)}
p(x;μ,Σ)=2π∣Σ∣1exp−21(x−μ)TΣ−1(x−μ)
Use original model
p
(
x
)
=
∏
i
=
1
n
p
(
x
i
;
μ
i
,
σ
i
2
)
p(x)=\prod_{i=1}^np(x_i;\mu_i,\sigma_i^2)
p(x)=∏i=1np(xi;μi,σi2)
- Manually create features to capture anomalies where x 1 , x 2 x_1,x_2 x1,x2 take unusual combinations of values
- Computationally cheaper
ok
even if m (training set size) is small
Use Multivariate Gaussian
- Automatically captures correlations between features
- Computationally more expensive
- Must have m > n m>n m>n or else Σ \Sigma Σ is non-invertible
Recommender Systems
n
u
=
n
u
m
b
e
r
n_u=number
nu=number of users
n m = n u m b e r n_m=number nm=number of movies
m j = n u m b e r o f m o v i e s r a t e d b y u s e r i m_j=number\ of\ movies\ rated\ by\ user\ i mj=number of movies rated by user i
r ( i , j ) = 1 r(i,j)=1 r(i,j)=1 if user j has rated movie i
y ( i , j ) = r a t i n g y^{(i,j)}=rating y(i,j)=rating given by user j to movie i(defined only if $r(i,j)=1 $)
Content-based recommendations
For each user j, learn a parameter θ ( j ) ∈ R 3 \theta^{(j)}\in R^3 θ(j)∈R3. Predict user j as rating movie i with ( θ ( j ) ) T x ( i ) (\theta^{(j)})^Tx^{(i)} (θ(j))Tx(i) stars.
optimization objective
to learn
θ
(
j
)
\theta^{(j)}
θ(j), parameter for user j, you should
m
i
n
θ
(
j
)
1
2
∑
i
:
r
(
i
,
j
)
=
1
(
(
θ
(
j
)
)
T
x
(
i
)
−
y
(
i
,
j
)
)
2
+
λ
2
∑
k
=
1
n
(
θ
k
(
j
)
)
2
min_{\theta(j)}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2
minθ(j)21i:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λk=1∑n(θk(j))2
to learn
θ
(
1
)
,
θ
(
2
)
,
⋯
,
θ
n
u
\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}
θ(1),θ(2),⋯,θnu
m
i
n
θ
(
1
)
,
θ
(
2
)
,
⋯
,
θ
n
u
1
2
∑
j
=
1
n
u
∑
i
:
r
(
i
,
j
)
=
1
(
(
θ
(
j
)
)
T
x
(
i
)
−
y
(
i
,
j
)
)
2
+
λ
2
∑
j
=
1
n
u
∑
k
=
1
n
(
θ
k
(
j
)
)
2
min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2
minθ(1),θ(2),⋯,θnu21j=1∑nui:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λj=1∑nuk=1∑n(θk(j))2
Collaborative filtering
Given
x
(
1
)
,
⋯
,
x
(
n
m
)
x^{(1)},\cdots,x^{(n_m)}
x(1),⋯,x(nm) and movie ratings can estimate
θ
(
1
)
,
⋯
,
θ
(
n
u
)
\theta^{(1)},\cdots,\theta^{(n_u)}
θ(1),⋯,θ(nu)
m
i
n
θ
(
1
)
,
θ
(
2
)
,
⋯
,
θ
n
u
1
2
∑
j
=
1
n
u
∑
i
:
r
(
i
,
j
)
=
1
(
(
θ
(
j
)
)
T
x
(
i
)
−
y
(
i
,
j
)
)
2
+
λ
2
∑
j
=
1
n
u
∑
k
=
1
n
(
θ
k
(
j
)
)
2
min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2
minθ(1),θ(2),⋯,θnu21j=1∑nui:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λj=1∑nuk=1∑n(θk(j))2
Given
θ
(
1
)
,
⋯
,
θ
n
u
\theta^{(1)},\cdots,\theta^{n_u}
θ(1),⋯,θnu can estimate
x
(
1
)
,
⋯
,
x
(
n
m
)
x^{(1)},\cdots,x^{(n_m)}
x(1),⋯,x(nm)
m
i
n
x
(
1
)
,
x
(
2
)
,
⋯
,
x
n
u
1
2
∑
j
=
1
n
u
∑
j
:
r
(
i
,
j
)
=
1
(
(
θ
(
j
)
)
T
x
(
i
)
−
y
(
i
,
j
)
)
2
+
λ
2
∑
i
=
1
n
u
∑
k
=
1
n
(
x
k
(
i
)
)
2
min_{x^{(1)},x^{(2)},\cdots,x_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(x_k^{(i)})^2
minx(1),x(2),⋯,xnu21j=1∑nuj:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λi=1∑nuk=1∑n(xk(i))2
We can iterate
x
→
θ
→
x
→
θ
⋯
x\rightarrow \theta \rightarrow x\rightarrow \theta\cdots
x→θ→x→θ⋯
m
i
n
θ
(
1
)
,
θ
(
2
)
,
⋯
,
θ
n
u
,
x
(
1
)
,
x
(
2
)
,
⋯
,
x
n
u
1
2
∑
(
i
,
j
)
:
r
(
i
,
j
)
=
1
(
(
θ
(
j
)
)
T
x
(
i
)
−
y
(
i
,
j
)
)
2
+
λ
2
∑
i
=
1
n
u
∑
k
=
1
n
(
x
k
(
i
)
)
2
+
λ
2
∑
j
=
1
n
u
∑
k
=
1
n
(
θ
k
(
j
)
)
2
min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u},x^{(1)},x^{(2)},\cdots,x_{n_u}} \frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2\\ +\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(x_k^{(i)})^2 +\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2
minθ(1),θ(2),⋯,θnu,x(1),x(2),⋯,xnu21(i,j):r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λi=1∑nuk=1∑n(xk(i))2+2λj=1∑nuk=1∑n(θk(j))2
predicted ratings
低秩矩阵分解
[
θ
(
1
)
x
(
1
)
θ
(
2
)
x
(
1
)
⋯
θ
(
n
u
)
x
(
1
)
⋯
⋯
⋯
⋯
θ
(
1
)
x
(
n
m
)
θ
(
2
)
x
(
n
m
)
⋯
θ
(
n
u
)
x
(
n
m
)
]
=
X
O
T
\begin{bmatrix} \theta^{(1)}x^{(1)}&\theta^{(2)}x^{(1)}&\cdots&\theta^{(n_u)}x^{(1)}\\ \cdots&\cdots&\cdots&\cdots\\ \theta^{(1)}x^{(n_m)}&\theta^{(2)}x^{(n_m)}&\cdots&\theta^{(n_u)}x^{(n_m)} \end{bmatrix}=XO^T
⎣⎡θ(1)x(1)⋯θ(1)x(nm)θ(2)x(1)⋯θ(2)x(nm)⋯⋯⋯θ(nu)x(1)⋯θ(nu)x(nm)⎦⎤=XOT
Learning with large datasets
θ j : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta(x^{(i)})}-y^{(i)})x_j^{(i)} θj:=θj−αm1∑i=1m(hθ(x(i))−y(i))xj(i)
如果m特别大,那么全部遍历就不太容易。
Stochastic gradient descent
-
Randomly shuffle training examples
-
# 随机梯度下降,每次只输入一个实例,就直接更新所有参数 repeat{ for i in range(m): for j in range(n): update $\theta$ }
Mini-batch gradient descent
- Batch gradient descent: use all m examples in each iteration
- Stochastic gradient descent: use 1 example in each iteration
- Mini-batch gradient: use b examples in each iteration
b=10,m=10000
repeat{
for i in range(1,m,10):
for j in range(n):
update $\theta$
}
Online learning
repeat forever{
get (x,y) corresponding to user
update \theta using (x,y)
}
可以适应客户变化和环境变化。
Application example Photo OCR
Photo OCR pipeline 机器学习流水线
I
m
a
g
e
→
T
e
x
t
d
e
t
e
c
t
i
o
n
→
C
h
a
r
a
c
t
e
r
s
e
g
m
e
n
t
a
t
i
o
n
→
C
h
a
r
a
c
t
e
r
r
e
c
o
g
n
i
t
i
o
n
Image\rightarrow Text\ detection\ \rightarrow Character\ segmentation\ \rightarrow Character\ recognition
Image→Text detection →Character segmentation →Character recognition
Synthesizing data by introducing distortions
- Distortion introduced should be representation of the type of noise or distortions in the test set
- Usually does not help to add purely random or meaningless noise to your data. 高斯噪声
Ceiling analysis
What part of the pipeline should you spend the most time to work on next?
天花板分析,在一个组件前提供100%正确的内容,查看某一组件如果是perfect的,会增加多少accuracy,找到增加accuracy最多的组件并首先花费大量时间来完善该组件。