CS224N notes_chapter4_Word window classification and Neural Network

本文探讨了WordWindow分类在自然语言处理中的应用,详细介绍了使用神经网络进行分类的原理和方法,包括softmax分类器、单层神经网络的构建及训练过程。文中还讨论了正则化防止过拟合的策略,以及如何更新词向量用于分类任务。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

第四讲 word window分类与神经网络

Classification background

Notations
input: xix_ixi, words/context windows/sentences/doc.etc
output: yiy_iyi, labels such as sentiment/NER/other words.etc
i=1,2,…,N
#为了方便, 笔者接下来用cls代替classification
General cls method: assume x is fixed, train logistic regression weights W -> the decision boundary modified.
Goal: p(y∣x)=exp(Wyx)∑c=1Cexp(Wcx)p(y|x)= \frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)}p(yx)=c=1Cexp(Wcx)exp(Wyx)
We try to minimize the negative log probability of the True class
−log⁡p(y∣x)=−log⁡exp(Wyx)∑c=1Cexp(Wcx)-\log p(y|x) = -\log \frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)} logp(yx)=logc=1Cexp(Wcx)exp(Wyx)
Cross entropy. Because of one-hot p, the only term left is the negative log prob of the true label.
H(p,q)=−∑c=1Cp(c)log⁡q(c)H(p,q)=-\sum_{c=1}^C p(c)\log q(c) H(p,q)=c=1Cp(c)logq(c)
Thus, our final loss func could be written as
J(θ)=1N∑i=1N−logexp(fyi)∑c=1Cexp(fyc) J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} J(θ)=N1i=1Nlogc=1Cexp(fyc)exp(fyi)
In practice, we usually use Regularization terms to prevent the model from overfitting.
J(θ)=1N∑i=1N−logexp(fyi)∑c=1Cexp(fyc)+λθ2 J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2 J(θ)=N1i=1Nlogc=1Cexp(fyc)exp(fyi)+λθ2

Updating word vectors for classification

Assume we have some pretrained word vectors and we want to use them in some new tasks.
In training set, we have ‘TV’ and ‘telly’ while in testing set we have ‘television’.
In pretrained model, they might be very similar, which means they are close in the vector space. But, after training, ‘TV’ and ‘telly’ might move a lot in the vector space while ‘television’ doesn’t move. This causes the similarity changes.
So, if dataset is small, we usually fix the word vec. Otherwise, retraining word vec might lead to good result.

Window classification & cross entropy error derivation tips

Window cls:

  • Idea: cls a word in its context window of neighboring words.
  • For example, named entity recognition(NER) into 4 classes:
    • person, location, organization, none

Method:

  • softmax classifier by assigning a label to a center word and concatenating all word vecs surrounding it
  • example:
    • … museums in Paris are amazing …
    • xwindow=[xmuseums,xin,xParis,xare,xamazing]x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}]xwindow=[xmuseums,xin,xParis,xare,xamazing]
    • xwindow∈R5dx_{window}\in R^{5d}xwindowR5d

Then we could use
J(θ)=1N∑i=1N−logexp(fyi)∑c=1Cexp(fyc)+λθ2 J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2 J(θ)=N1i=1Nlogc=1Cexp(fyc)exp(fyi)+λθ2
to update θ\thetaθ

A single layer neural network

Neuron:
hw,b(x)=f(wTx+b)f(z)=11+e−zh_{w,b}(x)=f(w^Tx+b) \\ f(z)=\frac 1 {1+e^{-z}} hw,b(x)=f(wTx+b)f(z)=1+ez1
A neural network = running several logistic regressions at the same time
For the first layer, we have
a1=f(W11x1+W12x2+W13x3+W14x4+b)a2=f(W21x1+W22x2+W23x3+W24x4+b)a_1 = f(W_{11}x_1+W_{12}x_2+W_{13}x_3+W_{14}x_4+b) \\ a_2 = f(W_{21}x_1+W_{22}x_2+W_{23}x_3+W_{24}x_4+b) \\ a1=f(W11x1+W12x2+W13x3+W14x4+b)a2=f(W21x1+W22x2+W23x3+W24x4+b)
In matrix notation for a layer
z=Wx+ba=f(z)\begin{aligned} z=&Wx+b \\ a=&f(z) \end{aligned}z=a=Wx+bf(z)
f-> activation function. Usually a nonlinear function. If f is linear, then the multilayer NN just equals to a linear transform and thus, it is not powerful.
Take window cls for example.
z=Wx+ba=f(z)s=UTaz=Wx+b \\ a=f(z) \\ s=U^Ta z=Wx+ba=f(z)s=UTa
x: word vector,such as xwindow=[xmuseums,xin,xParis,xare,xamazing]x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}]xwindow=[xmuseums,xin,xParis,xare,xamazing], we assume x∈R20×1x \in \mathbb{R}^{20\times 1}xR20×1
W: parameters, W∈R8×20W\in\mathbb{R}^{8\times 20}WR8×20
U: parameters, U∈R8×1U\in\mathbb{R}^{8\times 1}UR8×1
s means score

Max-Margin loss and backprop

max-margin loss
sss= score(museums in Paris are amazing)
scs_csc = score(not all museums in Paris)
We want sss to be high, and scs_csc to be low. Thus, we try to minimize
J=max⁡(0,1−s+sc)J=\max (0,1-s+s_c)J=max(0,1s+sc)
We call this **max-margin **loss.
Each window with a location at its center should have a score +1 higher than any window without a location at its center.
For each true window, we use negative sampling to get a false window and then sum over all training windows to get the final JJJ
s=UTf(Wx+b)sc=UTf(Wxc+b)s=U^Tf(Wx+b) \\ s_c=U^Tf(Wx_c+b) s=UTf(Wx+b)sc=UTf(Wxc+b)
Next, we will do Backpropagation.
if J&lt;0J&lt;0J<0, it means that the model works well with this true/false pair. and we needn’t do backprop. Otherwise, we will update the parameters. After training for a long time, most of the training samples lead to J&lt;0J&lt;0J<0, and thus we do less calculation compared with the start.
∂s∂U=a=f(Wx+b)∂s∂W=Uf′(Wx+b)xTf′(z)=f′(z)(1−f(z))\frac{\partial s}{\partial U} = a=f(Wx+b) \\ \frac{\partial s}{\partial W}=U f&#x27;(Wx+b) x^T \\ f&#x27;(z) = f&#x27;(z)(1-f(z)) Us=a=f(Wx+b)Ws=Uf(Wx+b)xTf(z)=f(z)(1f(z))
And
∂s∂Wij=Uif′(zi)xj=δixj∂s∂bi=δi \frac{\partial s}{\partial W_{ij}}=U_i f&#x27;(z_i) x_j \\ =\delta_i x_j \\ \frac{\partial s}{\partial b_i} = \delta_i Wijs=Uif(zi)xj=δixjbis=δi
for word vector
∂s∂xj=∑i=12∂s∂ai∂ai∂xj=∑i=12∂UTa∂ai∂f(Wi⋅x+bi)∂xj=∑i=12Uif′(Wi⋅x+bi)Wij=∑i=12δiWij=W⋅jTδ\begin{aligned} \frac{\partial s}{\partial x_{j}} =&amp; \sum_{i=1}^2 \frac{\partial s}{\partial a_i} \frac{\partial a_i}{\partial x_j} \\ =&amp; \sum_{i=1}^2 \frac{\partial U^Ta}{\partial a_i} \frac{\partial f(W_{i·} x + b_i)}{\partial x_j} \\ =&amp; \sum_{i=1}^2 U_i f&#x27;(W_{i·} x + b_i) W_{ij} \\ =&amp; \sum_{i=1}^2 \delta_i W_{ij} \\ =&amp; W_{·j}^T \delta \end{aligned}xjs=====i=12aisxjaii=12aiUTaxjf(Wix+bi)i=12Uif(Wix+bi)Wiji=12δiWijWjTδ
Thus,
∂s∂x=WTδ\frac{\partial s}{\partial x}=W^T \delta xs=WTδ

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值