Machine Learning 学习笔记(第一周)

本文介绍了机器学习的基础概念,包括监督学习和无监督学习,并详细探讨了线性回归模型和成本函数。通过一周的学习笔记,讲解了线性回归的一维模型、梯度下降算法及其在批量梯度下降中的应用,同时回顾了线性代数中的矩阵、向量、乘法等基础知识。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Week1 Introduction

Introduction

Welcome

Machine learning is everywhere.

The aim of machine learning is to build machines as intelligent.

Two goals:

  • know the algorithms & math
  • implement each algorithmes

Why machine learing successful today

  • grew out of work in AI
  • New capability for computers

Examples:

  • Database mining
  • application can’t program by hand

handwriting recognition, NLP, CV

  • Self-customizing programs

Recommendations

  • understanding human learning (brain, real AI)

What is Machine Learning

  • Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
  • Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine learning algorithms:

  • Supervised learning

  • Unsupervised learning

    Others: Reinforcement learning, recommender system

Supervised Learning

Example:

  • housing price prediction

    “right answers” given

    Regression: Predict continuous valued output (price)

  • Breast cancer (malignant, benign)

    Classification: Discrete valued output (0 or 1)

    two input parameters (Age, Tumor Size) to predict (malignant, benign)

Unsupervised Learning

no label, cluster by algorithm itself

no feedback based on the prediction results

Examples:
  • Clustering in 2-D data

    • Age, Tumor Size data
    • Genes VS. individuals clustering
  • Organize computing clusters

  • Social network analysis

  • Market segmentation

  • Astronomical data analysis

  • cocktail party problem

    More than one people speaking, distinguish the vioce from microphones

Week1 Linear Regression with One Variable

Model and Cost function

Model Representation

Linear regression
  • Features

    supervised, regression

  • notation

    training set

    m = Number of training examples

    x’s = “input” variable / features

    y’s = “output” variable / “target” variable

    (x, y): one traning example

    ( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) (x(i),y(i)): its training example

ML workflow

在这里插入图片描述

h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x

For historical reasons, this function h is called a hypothesis. When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

机器学习学到的是 h h h,也即某种假设。这种假设(数学上为各种参数 θ \theta θ)指的是输入数据和输出结果间的关系,就是说根据输入( X X X),我们能得出什么样的结果( y y y)。

Cost Function

Using linear regression as example:

h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x

Idea: Choose θ 0 , θ 1 \theta_0, \theta_1 θ0,θ1 so that h θ ( x ) h_\theta(x) hθ(x) is close to y y y for our training examples ( x , y ) (x,y) (x,y)

We want to minimize θ 0 , θ 1 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \underset{\theta_0,\theta_1}{\text{minimize}} \frac{1}{2m} \displaystyle\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 θ0,θ1minimize2m1i=1m(hθ(x(i))y(i))2,

the cost function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1)=2m1i=1m(hθ(x(i))y(i))2

which is half of squared error function, the 1 2 \frac{1}{2} 21 is convenience for computation of the gradient descent.

Cost Function Intuition

  • Hypothesis:

h θ ( x ) = θ 0 + θ 1 x h_\theta(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x

a function of x

  • Parameters:

    θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1

  • Cost Function:

    J ( θ 0 , θ 1 ) = 1 2 m = ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m} = \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1)=2m1=i=1m(hθ(x(i))y(i))2

    a function of θ \theta θ

  • Goal:

    minimize θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta_0,\theta_1}{\text{minimize}} J(\theta_0,\theta_1) θ0,θ1minimizeJ(θ0,θ1)

contour plot for two parameters and quadratic function for one parameters

Parameter Learning

Gradient Descent

Have some function J ( θ ) J(\theta) J(θ)

Want min θ J ( θ ) \underset{\theta}{\text{min}}J(\theta) θminJ(θ)

Outline:

  • Start with some θ \theta θ
  • Keep changing θ \theta θ to reduce J ( θ ) J(\theta) J(θ) until we hopefully end up at a minimum
Gradient descent algorithm

repeat unitl convergence {
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) ( for j = 0 and j = 1 ) \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1) \quad (\text{for} j=0 \text{and} j=1) θj:=θjαθjJ(θ0,θ1)(forj=0andj=1)
}

α \alpha α is learning rate

Correct: Simultaneous update

t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) temp0:=θ0αθ0J(θ0,θ1)

t e m p 1 : = θ 0 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1 := \theta_0 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) temp1:=θ0αθ1J(θ0,θ1)

θ 0 : = t e m p 0 \theta_0 := temp0 θ0:=temp0

θ 1 : = t e m p 1 \theta_1 := temp1 θ1:=temp1

Incorrect: Simultaneous update

t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) temp0:=θ0αθ0J(θ0,θ1)

θ 0 : = t e m p 0 \theta_0:=temp0 θ0:=temp0

t e m p 1 : = θ 0 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1 := \theta_0 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) temp1:=θ0αθ1J(θ0,θ1)

θ 1 : = t e m p 1 \theta_1:=temp1 θ1:=temp1

在不正确的形式中, θ 0 \theta_0 θ0 θ 1 \theta_1 θ1不是同时更新的(更新 θ 1 \theta_1 θ1之前 θ 0 \theta_0 θ0已经改变了)

Gradient Descent Intuition

不断逼近局部最小值。

如果学习率过小则会导致逼近过程过慢,如果学习率过大则无法收敛。

Gradient Descent For Linear Regression

repeat until convergence {
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) ( for j = 0 and j = 1 ) \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1) \quad (\text{for} j=0 \text{and} j=1) θj:=θjαθjJ(θ0,θ1)(forj=0andj=1)
}

h θ ( x ) = θ 0 + θ 1 x h_\theta(x) = \theta_0 + \theta_1 x hθ(x)=θ0+θ1x

J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1)=2m1i=1m(hθ(x(i))y(i))2

∂ ∂ θ j J ( θ 0 , θ 1 ) = ∂ ∂ θ j 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 = ∂ ∂ θ j 1 2 m ∑ i = 1 m ( θ 0 + θ 1 x ( i ) − y ( i ) ) 2 \begin{aligned} \frac{\partial}{\partial\theta_j} J(\theta_0,\theta_1) &= \frac{\partial}{\partial\theta_j} \frac{1}{2m} \displaystyle\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 \\ &= \frac{\partial}{\partial\theta_j} \frac{1}{2m} \displaystyle\sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 \end{aligned} θjJ(θ0,θ1)=θj2m1i=1m(hθ(x(i))y(i))2=θj2m1i=1m(θ0+θ1x(i)y(i))2

θ 0 : = θ 0 − ∂ ∂ θ 0 J ( θ 0 , θ 1 ) : = θ 0 − 1 m ( h θ ( x ( i ) ) − y ( i ) ) \begin{aligned} \theta_0 &:= \theta_0 - \frac{\partial}{\partial\theta_0} J(\theta_0,\theta_1) \\ & := \theta_0 - \frac{1}{m} (h_\theta(x^{(i)}) - y^{(i)}) \end{aligned} θ0:=θ0θ0J(θ0,θ1):=θ0m1(hθ(x(i))y(i))

θ 1 : = θ 1 − ∂ ∂ θ 1 J ( θ 0 , θ 1 ) : = θ 1 − 1 m ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \begin{aligned} \theta_1 &:= \theta_1 - \frac{\partial}{\partial\theta_1} J(\theta_0,\theta_1) \\ & := \theta_1 - \frac{1}{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} \end{aligned} θ1:=θ1θ1J(θ0,θ1):=θ1m1(hθ(x(i))y(i))x(i)

Cost function J J J for linear regression is convex function (凸函数), have no local optimum but a global optimum.

“Batch” Gradient Descent

“Batch”: Each step of gradient descent uses all the training examples.

Week1 Linear Algebra Review

Linear Algebra Review

Matrices and Vectors

  • matrices

    A = [ 1 2 3 4 5 6 ] ∈ R 2 × 3 A= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \in \R^{2 \times 3} A=[142536]R2×3

    A 12 = 2 A_{12}=2 A12=2

    维度是先行数,再列数

  • Vector

    is An n × 1 n\times1 n×1matrix

Addition and Scalar

  • Matrix Addition

    [ 1 0 2 5 3 1 ] + [ 4 0.5 2 5 0 1 ] = [ 5 0.5 4 10 3 2 ] \begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix} + \begin{bmatrix} 4 & 0.5 \\ 2 & 5 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 5 & 0.5 \\ 4 & 10 \\ 3 & 2 \end{bmatrix} 123051+4200.551=5430.5102

    size相同的matrix可以相加

  • Scalar Multiplication

    3 × [ 1 0 2 5 3 1 ] = [ 3 0 6 15 9 3 ] 3 \times \begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 3 & 0 \\ 6 & 15 \\ 9 & 3 \end{bmatrix} 3×123051=3690153

    [ 4 0 6 3 ] / 4 = 1 4 [ 4 0 6 3 ] = [ 1 0 3 2 3 4 ] \begin{bmatrix} 4 & 0 \\ 6 & 3 \end{bmatrix} / 4 = \frac{1}{4} \begin{bmatrix} 4 & 0 \\ 6 & 3 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ \frac{3}{2} & \frac{3}{4} \end{bmatrix} [4603]/4=41[4603]=[123043]

  • Combination of Operands

    3 × [ 1 4 2 ] + [ 0 0 5 ] − [ 3 0 2 ] / 2 = [ 2 12 10 1 3 ] 3 \times \begin{bmatrix} 1 \\ 4 \\ 2 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 5 \end{bmatrix} - \begin{bmatrix} 3 \\ 0 \\ 2 \end{bmatrix} / 2 = \begin{bmatrix} 2 \\ 12 \\ 10\frac{1}{3} \end{bmatrix} 3×142+005302/2=2121031

Matrix Vector Multiplication

[ 1 3 4 0 2 1 ] [ 1 5 ] = [ 16 4 7 ] \begin{bmatrix} 1 & 3 \\ 4 & 0 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 5 \end{bmatrix} = \begin{bmatrix} 16 & 4 & 7 \end{bmatrix} 142301[15]=[1647]

[ a b c d e f ] [ g h ] = [ a × g + b × h c × g + d × h e × g + f × h ] \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} \begin{bmatrix} g \\ h \end{bmatrix} = \begin{bmatrix} a \times g + b \times h \\ c \times g + d \times h \\ e \times g + f \times h \end{bmatrix} acebdf[gh]=a×g+b×hc×g+d×he×g+f×h

Matrix Matrix Multiplication

[ 1 3 2 4 0 1 ] [ 1 3 0 1 5 2 ] = [ 11 9 10 14 ] \begin{bmatrix} 1 & 3 & 2 \\ 4 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 3 \\ 0 & 1 \\ 5 & 2 \end{bmatrix} = \begin{bmatrix} 11 & 9 \\ 10 & 14 \end{bmatrix} [143021]105312=[1110914]

[ a b c d e f ] [ g h i j k l ] = [ a × g + b × i + c × k a × h + b × j + c × l d × g + e × i + f × k d × h + e × j + f × l ] \begin{bmatrix} a & b & c \\ d & e & f \end{bmatrix} \begin{bmatrix} g & h \\ i & j \\ k & l \end{bmatrix} = \begin{bmatrix} a \times g + b \times i + c \times k & a \times h + b \times j + c \times l \\ d\times g + e \times i + f \times k & d \times h + e \times j + f \times l \end{bmatrix} [adbecf]gikhjl=[a×g+b×i+c×kd×g+e×i+f×ka×h+b×j+c×ld×h+e×j+f×l]

Matrix Multiplication Properties

  • Commutative

    A × B ≠ B × A A \times B \neq B \times A A×B=B×A (not commutative)

  • Associative

    ( A × B ) × C = A × ( B × C ) (A \times B) \times C = A \times (B \times C) (A×B)×C=A×(B×C)

  • Identity Matrix

    I n × n = [ 1 0 ⋯ 0 0 1 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ 1 ] I_{n\times n}=\begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} In×n=100010001

    For any matrix A A A, A m × n I n × n = I m × m A m × n = A m × n A_{m \times n} I_{n \times n} = I_{m \times m} A_{m \times n} = A_{m \times n} Am×nIn×n=Im×mAm×n=Am×n

Matrix Inverse and Transpose

  • Matrix Inverse

    If A is an m × m m \times m m×mmatrix, and if it has an inverse,

    A A − 1 = A − 1 A = I AA^{-1}=A^{-1}A=I AA1=A1A=I

  • Matrix Transpose

Mathematica for Machine Learning机器学习的Mathematica)是一份关于使用Mathematica进行机器学习的笔记。Mathematica是一种功能强大的数学软件包,在处理和分析数据方面非常有用。使用Mathematica,我们可以使用其内置的机器学习函数和算法进行数据建模、预测和分类。 笔记中可能包含以下内容: 1. 数据准备:读取和处理数据是机器学习的第一步。Mathematica提供了各种函数和工具来读取和处理数据。这些函数可以从各种数据源中读取数据,并进行数据清洗、转换和归一化。 2. 特征工程:特征工程是机器学习中至关重要的一步,它涉及将原始数据转换为更有信息量的特征。Mathematica提供了各种函数和工具来进行特征选择、提取和变换。 3. 模型选择和训练:Mathematica提供了各种机器学习算法和函数,可以帮助我们选择适当的模型,并使用训练数据对模型进行训练。这些算法包括回归、分类、聚类和降维等。 4. 模型评估和验证:一旦模型训练完成,需要对其进行评估和验证。Mathematica提供了各种性能评估指标和图形化工具来评估和比较不同的模型。 5. 预测和推断:一旦我们有了训练好的模型,我们可以使用Mathematica进行预测和推断。该软件包提供了函数和工具,可以使用模型对新数据进行预测,并生成相关的可视化结果。 6. 高级机器学习功能:Mathematica还提供了一些高级的机器学习功能,如深度学习和强化学习。这些功能可以帮助我们解决更复杂的机器学习问题。 总之,Mathematica for Machine Learning提供了许多有用的函数和工具,可以帮助我们在机器学习中进行数据处理、模型选择和训练、模型评估和预测等任务。通过学习和使用这些笔记,我们可以更好地理解和应用机器学习算法并解决实际问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值