CMU 11-785 L18 Representation

本文深入探讨了逻辑回归原理,包括其作为感知器的扩展通过sigmoid激活函数计算输入属于类别1的概率,以及如何通过最大似然估计求解模型参数。此外,还讨论了多层感知器在网络中的作用,特别是在非线性分类任务中的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Logistic regression

在这里插入图片描述

  • This the perceptron with a sigmoid activation
    • It actually computes the probability that the input belongs to class 1
    • Decision boundaries may be obtained by comparing the probability to a threshold
    • These boundaries will be lines (hyperplanes in higher dimensions)
    • The sigmoid perceptron is a linear classifier

Estimating the model

  • Given: Training data: (X1,y1),(X2,y2),…,(XN,yN)\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)(X1,y1),(X2,y2),,(XN,yN)
  • XXX are vectors, yyy are binary (0/1) class values
  • Total probability of data

P((X1,y1),(X2,y2),…,(XN,yN))=∏iP(Xi,yi)=∏iP(yi∣Xi)P(Xi)=∏i11+e−yi(w0+wTXi)P(Xi) \begin{array}{l} P\left(\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)\right)= \prod_{i} P\left(X_{i}, y_{i}\right) \\\\ =\prod_{i} P\left(y_{i} \mid X_{i}\right) P\left(X_{i}\right)=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right) \end{array} P((X1,y1),(X2,y2),,(XN,yN))=iP(Xi,yi)=iP(yiXi)P(Xi)=i1+eyi(w0+wTXi)1P(Xi)

  • Likelihood

P(Training data)=∏i11+e−yi(w0+wTXi)P(Xi) P(\text {Training data})=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right) P(Training data)=i1+eyi(w0+wTXi)1P(Xi)

  • Log likelihood

log⁡P(Training data)=∑ilog⁡P(Xi)−∑ilog⁡(1+e−yi(w0+wTXi)) \begin{array}{l} \log P(\text {Training data})= \sum_{i} \log P\left(X_{i}\right)-\sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right) \end{array} logP(Training data)=ilogP(Xi)ilog(1+eyi(w0+wTXi))

  • Maximum Likelihood Estimate

w0,w1=argmax⁡w0,w1log⁡P(Training data) w_{0}, w_{1}=\underset{w_{0}, w_{1}}{\operatorname{argmax}} \log P(\text {Training data}) w0,w1=w0,w1argmaxlogP(Training data)

  • Equals (note argmin rather than argmax)

w0,w1=argmin⁡w0,w∑ilog⁡(1+e−yi(w0+wTXi)) w_{0}, w_{1}=\underset{w_{0}, w}{\operatorname{argmin}} \sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right) w0,w1=w0,wargminilog(1+eyi(w0+wTXi))

  • Identical to minimizing the KL divergence between the desired output and actual output 11+e−(w0+wTXi)\frac{1}{1+e^{-\left(w_{0}+w^{T} X_{i}\right)}}1+e(w0+wTXi)1

MLP

Separable case

在这里插入图片描述

  • The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features
    • We can now attach any linear classifier above it for perfect classification
    • Need not be a perceptron
    • Could even train an SVM on top of the features!
  • For insufficient structures, the network may attempt to transform the inputs to linearly separable features
    • Will fail to separate exactly, but will try to minimize error
  • The network until the second-to-last layer is a non-linear function f(X)f(X)f(X) that converts the input space XXX of into the feature space where the classes are maximally linearly separable

Lower layers

在这里插入图片描述

  • Manifold hypothesis: For separable classes, the classes are linearly separable on a non-linear manifold
  • Layers sequentially “straighten” the data manifold
  • The “feature extraction” layer transforms the data such that the posterior probability may now be modelled by a logistic

Weight as a template

在这里插入图片描述

  • In high dimensional space, all vectors are more or less the same length
    • Which means all xxx are in this surface of sphere
  • The perceptron fires if the input is within a specified angle of the weight
    • Represents a convex region on the surface of the sphere!
    • The network is a Boolean function over these regions
  • Neuron fires if the input vector is close enough to the weight vector
    • If the input pattern matches the weight pattern closely enough
  • The perceptron is a correlation filter!

Autoencoder

在这里插入图片描述

  • The lowest layers of a network detect significant features in the signal
  • The signal could be (partially) reconstructed using these features
    • Will retain all the significant components of the signal

Simplest autoencoder

在这里插入图片描述

  • This is just PCA!
  • The autoencoder finds the direction of maximum energy
  • Simply varying the hidden representation will result in an output that lies along the major axis

Terminology

  • Encoder
    • The “Analysis” net which computes the hidden representation
  • Decoder
    • The “Synthesis” which recomposes the data from the hidden representation

Nonlinearity

在这里插入图片描述

  • When the hidden layer has a linear activation the decoder represents the best linear manifold to fit the data
    • Varying the hidden value will move along this linear manifold
  • When the hidden layer has non-linear activation, the net performs nonlinear PCA
    • The decoder represents the best non-linear manifold to fit the data
    • Varying the hidden value will move along this non-linear manifold

在这里插入图片描述

  • The model is specific to the training data
    • Varying the hidden layer value only generates data along the learned manifold
    • Any input will result in an output along the learned manifold
    • But may not generalize beyond the manifold
      • Input unseen data may behave beyond intuitive manner, no constrain!
      • The decoder can only generate data on the manifold that the training data lie on
  • This also makes it an excellent “generator” of the distribution of the training data

Dictionary-based techniques

  • The decoder represents a source-specific generative dictionary
    • Exciting it will produce typical data from the source!

Signal separation

在这里插入图片描述

  • Separation: Identify the combination of entries from both dictionaries that compose the mixed signal

在这里插入图片描述

  • Given mixed signal and source dictionaries, find excitation that best recreates mixed signal
    • Simple backpropagation
  • Intermediate results are separated signals
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值