Course 1: Supervised Machine Learning: Regression and Classification
Week 1: Introduction to Machine Learning
supervised learning v.s. unsupervised learning
supervised learning:
algorithms that learn x to y. give your learning algorithm examples to learn from, given “right answers” (output label).
e.g.
| input(X) | output(Y) | application |
|---|---|---|
| spam? (0/1) | spam filtering | |
| audio | text transcript | speech recognition |
| English | Spanish | machine translation |
| ad, user info | click? (0/1) | online advertising |
| image, radar info | position of other cars | self-driving car |
| image of phone | defect? (0/1) | visual inspection |
Regression: predict a number from infinitely many possible outputs
Classification: predict categories from a small number of possible outputs
unsupervised learning:
given data that isn’t associated with any output label y, find some structure/pattern / something interesting in unlabeled data
Clustering: group similar data points together. e.g. Google news, DNA microarray, grouping customers
Anomaly Detection: find unusual data points. e.g. fraud detection
Dimensionality Reduction: compress data using fewer numbers
Regression model
Linear Regression with one variable
Notation:
xxx = “input” variable, feature
yyy = “output” variable, “target” variable
mmm = number of training examples
(x,y)(x, y)(x,y) = single training example
(x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)) = i-th training example
Univariate linear regression: linear regression with one variable fw,b(x)=wx+bf_{w,b}(x) = wx+bfw,b(x)=wx+b
Cost Function:
squared-error cost function
J(w,b)=12m∑i=1m(y^(i)−y(i))2J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)}-y^{(i)})^2J(w,b)=2m1i=1∑m(y^(i)−y(i))2
where y^(i)=fw,b(x(i))\hat{y}^{(i)} =f_{w,b}(x^{(i)})y^(i)=fw,b(x(i))
bowl-shaped for squared-error cost function
Train the model with gradient descent
Gradient Descent:
repeat until convergence:
w=w−α∂∂wJ(w,b)w = w - \alpha \frac{\partial}{\partial w} J(w,b) w=w−α∂w∂J(w,b) b=b−α∂∂bJ(w,b)b = b - \alpha \frac{\partial}{\partial b} J(w,b)b=b−α∂b∂J(w,b)
where α\alphaα is the learning rate
Note: simultaneously update www and bbb. simultaneously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

Choosing a different starting point (even just a few steps away from the original starting point), may leading to the reached local minimum different.

Learning Rate:
if α\alphaα is too small, gradient descent will work but may be slow.
if α\alphaα is too large, gradient descent may overshoot and never reach minimum. May fail to converge, and even diverge
If already at a local minimum, gradient descent leaves www unchanged (since slope=0).
Gradient descent can reach local minimum with fixed learning rate. Because: as we get nearer a local minimum, gradient descent will automatically take smaller steps, since derivative automatically gets smaller.
Gradient Descent for Linear Regression:
w=w−α∂∂wJ(w,b)w = w - \alpha \frac{\partial}{\partial w} J(w,b) w=w−α∂w∂J(w,b)b=b−α∂∂bJ(w,b)b = b - \alpha \frac{\partial}{\partial b} J(w,b)b=b−α∂b∂J(w,b)
where
∂∂wJ(w,b)=1m∑i=1m(fw,b(x(i))−y(i))x(i)\frac{\partial}{\partial w} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}∂w∂J(w,b)=m1i=1∑m(fw,b(x(i))−y(i))x(i)∂∂bJ(w,b)=1m∑i=1m(fw,b(x(i))−y(i))\frac{\partial}{\partial b} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})∂b∂J(w,b)=m1i=1∑m(fw,b(x(i))−y(i))

Squared-error cost function is a convex function, which has a single global minimum, because of the bowl shape. So as long as your learning rate is chosen appropriately, it will always converge to the global minimum.
“Batch” gradient descent: each step of gradient descent uses all the training examples.
文章介绍了监督学习和无监督学习的区别,监督学习包括分类和回归,如垃圾邮件过滤、语音识别等,无监督学习则涉及聚类、异常检测和降维。线性回归作为监督学习的一个例子,其模型通过梯度下降法进行训练,调整权重和偏置以最小化平方误差成本函数。学习率的选择对梯度下降过程的收敛速度和能否找到全局最小值至关重要。
3500

被折叠的 条评论
为什么被折叠?



