吴恩达机器学习笔记

Error analysis

Methods to solve over fitting

  • more training examples
  • try smaller sets of features
  • try increasing λ \lambda λ

Methods to solve under fitting

  • getting additional features
  • try adding polynomial features
  • try decreasing λ \lambda λ

Recommend approach

  • start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data
  • plot learning curves to decide if more data, more features, etc. are likely to help.
  • Error analysis: See if you spot any systematic trend in what type of examples it is making errors on

Error metrics for skewed classes

10
1True PositiveFalse Positive
0False NegativeTrue Negative

p r e c i s i o n = T P T P + F P precision=\frac{TP}{TP+FP} precision=TP+FPTP 准确率

r e c a l l = T P T P + F N recall=\frac{TP}{TP+FN} recall=TP+FNTP 召回率

F 1 s c o r e = 2 P R P + R \displaystyle F_1score=2\frac{PR}{P+R} F1score=2P+RPR F值

Data for machine learning

In the following conditions, more data makes sense.

  • Assume feature x ∈ R n + 1 x\in R^{n+1} xRn+1 has sufficient information to predict y accurately.
  • Use a learning algorithm with many parameters such as logistic regression or linear regression with many features or neural network with many hidden units.

Support Vector Machine

Logistic regression Cost function
m i n θ 1 m [ ∑ i = 1 m y ( i ) ( − log ⁡ h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) ( − log ⁡ ( 1 − h θ ( x ( i ) ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 min_\theta\frac{1}{m}[\sum_{i=1}^my^{(i)}(-\log h_\theta(x^{(i)}))+(1-y^{(i)})(-\log(1-h_\theta(x^{(i)})))]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2 minθm1[i=1my(i)(loghθ(x(i)))+(1y(i))(log(1hθ(x(i))))]+2mλj=1nθj2
SVM hypothesis
m i n θ C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T x ( i ) ) ] + 1 2 ∑ i = 1 n θ j 2 min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2 minθCi=1m[y(i)cost1(θTx(i))+(1y(i))cost0(θTx(i))]+21i=1nθj2

h θ ( x ) = { 0 1 h_\theta(x)= \begin{cases} 0\\ 1\\ \end{cases} hθ(x)={01

SVM parameters

C = 1 λ C=\frac{1}{\lambda} C=λ1

  • large C low bias, high variance
  • small C higher bias, low variance

σ 2 \sigma^2 σ2

  • large σ 2 \sigma^2 σ2. feature f i f_i fi vary more smoothly. higher bias, lower variance
  • small σ 2 \sigma^2 σ2. feature f i f_i fi vary less smoothly. lower bias, higher variance

Kernel function

  • no kernel

  • 高斯kernel
    f = e − ∣ ∣ x 1 − x 2 ∣ ∣ 2 σ 2 f=e^{-\frac{|| x_1-x_2 ||}{2\sigma^2}} f=e2σ2x1x2

  • 多项式kernel k ( x , l ) = ( x T l + c o n s t a n t ) d e g r e e k(x,l)=(x^Tl+constant)^{degree} k(x,l)=(xTl+constant)degree

Muti-class classification

train K SVMs, one to distinguish y = i y=i y=i from the rest

Logistic regression vs SVMs

n=number of features x ∈ R n + 1 x\in R^{n+1} xRn+1,m=number of training examples

  • n is large relative to m: use logistic regression or SVM without kernel
  • n is small , m is intermediate: use SVM with Gaussian kernel
  • n is small and m is large: create or add more features and then use logistic regression or svm without kernel

K-means

Input

  • K (number of clusters)
  • training set { x ( 1 ) , x ( 2 ) , ⋯   , x ( m ) } \{x^{(1)},x^{(2)},\cdots,x^{(m)}\} {x(1),x(2),,x(m)}

K-means algorithm

  1. randomly initialize K cluster centroids μ 1 , μ 2 , ⋯   , μ K ∈ R n \mu_1,\mu_2,\cdots,\mu_K\in R^{n} μ1,μ2,,μKRn
2. repeat{
for i in m:
    C^{(i)}=index(from 1 to K) of cluster centroid closest to x^{(i)}
for k=1 to K:
    u_k=mean of points assigned to cluster k
}

K-means optimization objective

  • $c^{(i)}= index\ (1,2,\dots,K) $ to which example x ( i ) x^{(i)} x(i) is currently assigned

  • μ k \mu_k μk cluster centroid k ( μ k ∈ R n ) (\mu_k\in R^n) (μkRn)

  • μ c ( i ) \mu_{c^{(i)}} μc(i) cluster centroid of cluster to which example x ( i ) x^{(i)} x(i) has been assigned

  • m i n c ( 1 ) , c ( 2 ) , ⋯   , c ( m ) , μ 1 , μ 2 , ⋯   , μ K 1 m ∑ i = 1 m ∣ ∣ x ( i ) − μ c ( i ) ∣ ∣ 2 min_{c^{(1)},c^{(2)},\cdots,c^{(m)},\mu_1,\mu_2,\cdots,\mu_K} \frac{1}{m}\sum_{i=1}^m||x^{(i)}-\mu_{c^{(i)}}||^2 minc(1),c(2),,c(m),μ1,μ2,,μKm1i=1mx(i)μc(i)2

Random initialization

for i in range(100):
    randomly initialize K-means
    run k-means. get c^{(i)},c^{(m)},...
    compute cost function J
pick clustering that gave lowest cost J(get c^{(i)},c^{(m)},....)

Usually, set k=2-10, kmeans can work well. And random initialization can avoid local optimal.

Choosing the number of clusters

Elbow method

choose the number for the purpose

Principle Component Analysis

在这里插入图片描述

Data preprocessing

Training set: x ( 1 ) , x ( 2 ) , ⋯   , x ( m ) x^{(1)},x^{(2)},\cdots,x^{(m)} x(1),x(2),,x(m)

preprocessing μ j = 1 m ∑ i = 1 m x j ( i ) \mu_j=\frac{1}{m}\sum_{i=1}^m x_j^{(i)} μj=m1i=1mxj(i), replace each x j ( i ) x_j^{(i)} xj(i) with x j − μ j x_j-\mu_j xjμj, if different features on different scales, scale features to have comparable range of values.

Choosing the number of principal components

Averaged squared projection error 1 m ∑ i = 1 m ∥ x ( i ) − x a p p r o x ( i ) ∥ 2 \frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}-x_{approx}^{(i)}\Vert^2 m1i=1mx(i)xapprox(i)2

Total variation in the data 1 m ∑ i = 1 m ∥ x ( i ) ∥ 2 \frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}\Vert^2 m1i=1mx(i)2

Typically, choosing k to the smallest value so that 1 m ∑ i = 1 m ∥ x ( i ) − x a p p r o x ( i ) ∥ 2 1 m ∑ i = 1 m ∥ x ( i ) ∥ 2 < = 0.01 ( 1 % ) \displaystyle \frac{\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}-x_{approx}^{(i)}\Vert^2}{\frac{1}{m}\sum_{i=1}^m\Vert x^{(i)}\Vert^2}<=0.01(1\%) m1i=1mx(i)2m1i=1mx(i)xapprox(i)2<=0.01(1%)(99% of variance is retained)

[U,S,V]=svd(Sigma)

pick the smallest value of k for which ∑ i = 1 k S i i ∑ i = 1 m S i i > = 0.99 \displaystyle \frac{\sum_{i=1}^k S_{ii}}{\sum_{i=1}^m S_{ii}}>=0.99 i=1mSiii=1kSii>=0.99

Application of PCA

  • Compression

    • reduce memory or disk needed to store data
    • speed up learning algorithm
  • Visualization

    • plot 2 or 3 dimensional data
  • A bad way to prevent over fitting

    • Fewer features might work ok, but isn’t a good way to address over fitting. Use regularization instead.
  • A bad use to design ML systems

Anomaly detection

Examples

  • Fraud detection
  • Manufacturing
  • Monitoring computers in a data center

Anomaly detection algorithm

  1. Choose features x i x_i xi that you think might be indicative anomalous examples.

  2. fit parameters μ 1 , ⋯   , μ n , σ 1 2 , ⋯   , σ n 2 \mu_1,\cdots,\mu_n,\sigma_1^2,\cdots,\sigma_n^2 μ1,,μn,σ12,,σn2
    μ j = ∑ i = 1 m x j ( i ) σ j 2 = 1 m ∑ i = 1 m ( x j ( i ) − μ j 2 ) \mu_j=\sum_{i=1}^mx_j^{(i)}\\ \sigma_j^2=\frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j^2) μj=i=1mxj(i)σj2=m1i=1m(xj(i)μj2)

  3. given new example x, compute p ( x ) p(x) p(x)
    p ( x ) = ∏ j = 1 n p ( x j ; μ j , σ j 2 ) = ∏ j = 1 n 1 2 π σ j e − ( x j − μ j ) 2 2 σ j 2 p(x)=\prod_{j=1}^np(x_j;\mu_j,\sigma_j^2)=\prod_{j=1}^n \frac{1}{\sqrt{2\pi\sigma_j}}e^{-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}} p(x)=j=1np(xj;μj,σj2)=j=1n2πσj 1e2σj2(xjμj)2

  4. Anomaly if $p(x)<\epsilon $

Algorithm evaluation

  • fit model p ( x ) p(x) p(x) on training set { x ( 1 ) , ⋯   , x ( m ) } \{x^{(1)},\cdots,x^{(m)}\} {x(1),,x(m)}

  • on a cross validation or test example x x x, predict
    y = { 1   i f   p ( x ) < ϵ   a n o m a l y 0   i f   p ( x ) > = ϵ   n o r m a l y=\begin{cases} 1\ if\ p(x)<\epsilon\ anomaly\\ 0\ if\ p(x)>=\epsilon\ normal \end{cases} y={1 if p(x)<ϵ anomaly0 if p(x)>=ϵ normal

  • possible evaluation

    • precision or recall
    • F 1   s c o r e F_1\ score F1 score
    • can choose suitable ϵ \epsilon ϵ through maximize the F 1 F_1 F1 score

Anomaly detection vs supervised learning

  • Use Anomaly detection when

    • very small number of positive examples (0-20) and large number of negative examples
    • many different types of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like
    • future anomalies may look nothing like any of the anomalous examples we’ve seen so far
    • examples like
      • fraud detection
      • manufacturing
      • monitoring machines in a data center
  • Use Supervised learning when

    • large number of positive and negative examples
    • enough positive examples for algorithm to get a sense of what positive examples are like, future examples likely to be similar to ones in training set
    • examples like
      • email spam classification
      • weather prediction
      • cancer classification

Choosing what features to use

make non-gaussian features gaussian

  • x i ← log ⁡ ( x i ) x_i\leftarrow \log(x_i) xilog(xi)
  • x 2 ← log ⁡ ( x 2 + c ) x_2\leftarrow \log(x_2+c) x2log(x2+c)
  • x 3 = x 3 x_3=\sqrt{x_3} x3=x3

Multivariate Gaussian distribution

x ∈ R n x\in R^n xRn. Don’t model p ( x 1 ) , p ( x 2 ) , ⋯ p(x_1),p(x_2),\cdots p(x1),p(x2), separately, model $ p(x)$ all in one goal. Parameters μ ∈ R n \mu\in R^n μRn, Σ ∈ R n × n \Sigma\in R^{n\times n} ΣRn×n
p ( x ; μ , Σ ) = 1 2 π ∣ Σ ∣ e x p − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) p(x;\mu,\Sigma)=\frac{1}{\sqrt{2\pi|\Sigma|}}exp^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} p(x;μ,Σ)=2πΣ 1exp21(xμ)TΣ1(xμ)
Use original model p ( x ) = ∏ i = 1 n p ( x i ; μ i , σ i 2 ) p(x)=\prod_{i=1}^np(x_i;\mu_i,\sigma_i^2) p(x)=i=1np(xi;μi,σi2)

  • Manually create features to capture anomalies where x 1 , x 2 x_1,x_2 x1,x2 take unusual combinations of values
  • Computationally cheaper
  • ok even if m (training set size) is small

Use Multivariate Gaussian

  • Automatically captures correlations between features
  • Computationally more expensive
  • Must have m > n m>n m>n or else Σ \Sigma Σ is non-invertible

Recommender Systems

在这里插入图片描述
n u = n u m b e r n_u=number nu=number of users

n m = n u m b e r n_m=number nm=number of movies

m j = n u m b e r   o f   m o v i e s   r a t e d   b y   u s e r   i m_j=number\ of\ movies\ rated\ by\ user\ i mj=number of movies rated by user i

r ( i , j ) = 1 r(i,j)=1 r(i,j)=1 if user j has rated movie i

y ( i , j ) = r a t i n g y^{(i,j)}=rating y(i,j)=rating given by user j to movie i(defined only if $r(i,j)=1 $)

Content-based recommendations

For each user j, learn a parameter θ ( j ) ∈ R 3 \theta^{(j)}\in R^3 θ(j)R3. Predict user j as rating movie i with ( θ ( j ) ) T x ( i ) (\theta^{(j)})^Tx^{(i)} (θ(j))Tx(i) stars.

optimization objective

to learn θ ( j ) \theta^{(j)} θ(j), parameter for user j, you should
m i n θ ( j ) 1 2 ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ k = 1 n ( θ k ( j ) ) 2 min_{\theta(j)}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2 minθ(j)21i:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λk=1n(θk(j))2
to learn θ ( 1 ) , θ ( 2 ) , ⋯   , θ n u \theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u} θ(1),θ(2),,θnu
m i n θ ( 1 ) , θ ( 2 ) , ⋯   , θ n u 1 2 ∑ j = 1 n u ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 minθ(1),θ(2),,θnu21j=1nui:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λj=1nuk=1n(θk(j))2

Collaborative filtering

Given x ( 1 ) , ⋯   , x ( n m ) x^{(1)},\cdots,x^{(n_m)} x(1),,x(nm) and movie ratings can estimate θ ( 1 ) , ⋯   , θ ( n u ) \theta^{(1)},\cdots,\theta^{(n_u)} θ(1),,θ(nu)
m i n θ ( 1 ) , θ ( 2 ) , ⋯   , θ n u 1 2 ∑ j = 1 n u ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 minθ(1),θ(2),,θnu21j=1nui:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λj=1nuk=1n(θk(j))2
Given θ ( 1 ) , ⋯   , θ n u \theta^{(1)},\cdots,\theta^{n_u} θ(1),,θnu can estimate x ( 1 ) , ⋯   , x ( n m ) x^{(1)},\cdots,x^{(n_m)} x(1),,x(nm)
m i n x ( 1 ) , x ( 2 ) , ⋯   , x n u 1 2 ∑ j = 1 n u ∑ j : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n u ∑ k = 1 n ( x k ( i ) ) 2 min_{x^{(1)},x^{(2)},\cdots,x_{n_u}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(x_k^{(i)})^2 minx(1),x(2),,xnu21j=1nuj:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λi=1nuk=1n(xk(i))2
We can iterate x → θ → x → θ ⋯ x\rightarrow \theta \rightarrow x\rightarrow \theta\cdots xθxθ
m i n θ ( 1 ) , θ ( 2 ) , ⋯   , θ n u , x ( 1 ) , x ( 2 ) , ⋯   , x n u 1 2 ∑ ( i , j ) : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n u ∑ k = 1 n ( x k ( i ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 min_{\theta^{(1)},\theta^{(2)},\cdots,\theta_{n_u},x^{(1)},x^{(2)},\cdots,x_{n_u}} \frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2\\ +\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(x_k^{(i)})^2 +\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 minθ(1),θ(2),,θnu,x(1),x(2),,xnu21(i,j):r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λi=1nuk=1n(xk(i))2+2λj=1nuk=1n(θk(j))2
predicted ratings

低秩矩阵分解
[ θ ( 1 ) x ( 1 ) θ ( 2 ) x ( 1 ) ⋯ θ ( n u ) x ( 1 ) ⋯ ⋯ ⋯ ⋯ θ ( 1 ) x ( n m ) θ ( 2 ) x ( n m ) ⋯ θ ( n u ) x ( n m ) ] = X O T \begin{bmatrix} \theta^{(1)}x^{(1)}&\theta^{(2)}x^{(1)}&\cdots&\theta^{(n_u)}x^{(1)}\\ \cdots&\cdots&\cdots&\cdots\\ \theta^{(1)}x^{(n_m)}&\theta^{(2)}x^{(n_m)}&\cdots&\theta^{(n_u)}x^{(n_m)} \end{bmatrix}=XO^T θ(1)x(1)θ(1)x(nm)θ(2)x(1)θ(2)x(nm)θ(nu)x(1)θ(nu)x(nm)=XOT

Learning with large datasets

θ j : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta(x^{(i)})}-y^{(i)})x_j^{(i)} θj:=θjαm1i=1m(hθ(x(i))y(i))xj(i)

如果m特别大,那么全部遍历就不太容易。

Stochastic gradient descent

  1. Randomly shuffle training examples

  2. # 随机梯度下降,每次只输入一个实例,就直接更新所有参数
    repeat{
        for i in range(m):
        	for j in range(n):
        		update $\theta$
    }
    

Mini-batch gradient descent

  • Batch gradient descent: use all m examples in each iteration
  • Stochastic gradient descent: use 1 example in each iteration
  • Mini-batch gradient: use b examples in each iteration
b=10,m=10000
repeat{
    for i in range(1,m,10):
    	for j in range(n):
    		update $\theta$
}

Online learning

repeat forever{
    get (x,y) corresponding to user
    update \theta using (x,y)
}

可以适应客户变化和环境变化。

Application example Photo OCR

Photo OCR pipeline 机器学习流水线
I m a g e → T e x t   d e t e c t i o n   → C h a r a c t e r   s e g m e n t a t i o n   → C h a r a c t e r   r e c o g n i t i o n Image\rightarrow Text\ detection\ \rightarrow Character\ segmentation\ \rightarrow Character\ recognition ImageText detection Character segmentation Character recognition
Synthesizing data by introducing distortions

  • Distortion introduced should be representation of the type of noise or distortions in the test set
  • Usually does not help to add purely random or meaningless noise to your data. 高斯噪声

Ceiling analysis

What part of the pipeline should you spend the most time to work on next?

在这里插入图片描述

天花板分析,在一个组件前提供100%正确的内容,查看某一组件如果是perfect的,会增加多少accuracy,找到增加accuracy最多的组件并首先花费大量时间来完善该组件。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值