Machine Learning 02 - Multivariate Linear Regression

最新推荐文章于 2025-08-11 22:34:28 发布

原创最新推荐文章于 2025-08-11 22:34:28 发布 · 298 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能 #Standford公开课

机器学习专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了斯坦福大学吴恩达教授机器学习课程中的多元线性回归概念，包括多变量假设函数的形式及其向量化表达方式，并探讨了梯度下降算法在多元线性回归中的应用，特征缩放和归一化技巧，以及正规方程方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。
鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。非常欢迎一起学习的伙伴们来讨论！

Week 02

2.1 Multivariate Linear Regression

2.1.1 Multiple Features

The multivariable form of the hypothesis function :
$h θ (x) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + \dots + θ n x n$ $h_{\theta }(x)=\theta _{0}x_{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{3}+\cdots +\theta _{n}x_{n}$
$= [θ 0 θ 1 \dots θ n] ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ x 0 x 1 ⋮ x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ = θ T x$ $=\left [ \theta _{0} \quad \theta _{1}\quad\cdots \quad \theta _{n} \right ] \begin{bmatrix} x_{0}\\ x_{1}\\ \vdots\\ x_{n}\\ \end{bmatrix}=\theta ^{T}x$
Remark : For convenice, assume $x_{0}^{(i)}=1 \quad \text{for} \ i\in 1,\cdots ,m$ .
The cost function $J(\theta)$ has the same form
$J (θ) = 1 2 m \sum i = 1 m (h θ (x) - y) 2$ $J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta }(x)-y)^{2}$

2.1.2 Gradient Descent

Gradient descent for mutivariate linear Regression - Algorithm 1’

Repeat {

$θ j : = θ j - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x (i) j$ $\theta _{j}:=\theta _{j}-\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})x_{j}^{(i)}$
(simultaneously update $\theta _{j}$ for $j=0,\cdots ,n$ )
}

2.1.3 Practical Tricks in GD

Feature Scaling si
- Idea : Make sure features are on a similar scale. This is because $\theta$ will descent very quickly on small ranges, otherwise it will oscillate inefficiently down to the optimum.
- Get every feature into approximately a $-1\leq x_{i}\leq 1$ range (number 1 is no a necessary problem).
- Remark : The quizzes in this course use range - the programming exercises use standard deviation.
Mean Normalization μi
- Replace $x_{i}$ with $x_{i}-\mu _{i}$ to make features have approximately zero mean (do no apply to $x_{0}=1$ ).
- In general, we have :
  $x i : = x i - μ i s i$ $x_{i}:=\frac{x_{i}-\mu _{i}}{s_{i}}$
  where $\mu _{i}$ is the average of all the values for features $(i)$ and $s_{i}$ is the range of values (max-min), or $s_{i}$ is the standard deviation.
Learning Rate Check
- Debug gradient descent, make a plot of iterations on x-axis, judge whether the J(θ) converge to zeor or not :
  - If $\alpha$ is too small, slow convergence
  - If $\alpha$ is too large, $J(\theta)$ may not decrease on every iteration.
- Try to use $1\times 10^{k}$ or $3\times 10^{k}$ or other similar value, when judging from the plot.
- It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

2.1.4 Improvement of Linear Regression

Feature Combination
- Combine some features in one using a variety of methods.
Polynomial Regression
$h θ (x) = θ 0 x 0 + θ 1 x a 1 1 + θ 2 x a 2 2 + \dots + θ n x a n n$ $h_{\theta}(x)=\theta _{0}x_{0}+\theta _{1}x_{1}^{a_{1}}+\theta _{2}x_{2}^{a_{2}}+\cdots +\theta _{n}x_{n}^{a_{n}}$
Remark : One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

2.2 Another Method for Normal Equation

2.2.1 Normal Equaltion

x i 1 \leq i \leq m = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ x (i) 0 x (i) 1 x (i) 2 \dots x (i) n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ \in R n + 1, X = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ (x (1)) T (x (2)) T (x (3)) T \dots (x (m)) T ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, Y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ y (1) y (2) y (3) \dots y (m) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$\underset{1\leq i\leq m}{x^{i}}=\begin{bmatrix} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ \cdots \\ x_{n}^{(i)} \end{bmatrix} \in R^{n+1} ,\quad X=\begin{bmatrix} (x^{(1)})^{T} \\ (x^{(2)})^{T} \\ (x^{(3)})^{T} \\ \cdots \\ (x^{(m)})^{T} \end{bmatrix} ,\quad Y= \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ \cdots \\ y^{(m)} \end{bmatrix}$
and

θ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ θ 0 θ 1 θ 2 \dots θ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$\theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \theta_{2} \\ \cdots \\ \theta_{n} \end{bmatrix}$
Then the normal equation formula is given below :

θ = (X T X) - 1 X T y

$\theta = (X^{T}X)^{-1}X^{T}y$

2.2.2 Comparison of GD and NE

Gradient Descent
- Need to choose alpha and iterate
- Need learning rate
- $O(kn^{2})$
- Works well when $n$ is large.
Normal Equation
- No need to choose alpha and iterate
- No Need to set learning rate
- $O(n^{3})$
- Slow if $n$ is large