CS224N notes_chapter2_word2vec

最新推荐文章于 2020-12-12 17:47:21 发布

lirt15

最新推荐文章于 2020-12-12 17:47:21 发布

阅读量217

点赞数 1

CC 4.0 BY-SA版权

分类专栏： CS224N 文章标签： NLP cs224n

本文链接：https://blog.youkuaiyun.com/lirt15/article/details/94392313

CS224N 专栏收录该内容

11 篇文章

订阅专栏

本文深入探讨了Word2Vec的工作原理，包括其背后的理论基础、Skip-Gram与CBOW两种算法，以及如何通过优化目标函数来提升词向量的质量。文章还介绍了如何使用梯度下降法进行参数优化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第二讲 word2vec

1 Word meaning

the idea that is represented by a word, phrase, writing, art etc.
How do we have usable meaning in a computer?
Common answer: toxonomy(分类系统) like WordNet that has hypernyms relations(is-a) and synonym(同义词) sets.
Problems with toxonomy：

missing nuances(细微差别) 比如 proficient 就比 good 更适合形容专家, 但是在分类系统中它们就是同义词
missing new words
subjective
requires human labor to create and adapt
Hard to compute accurate word similarity

Problems with discrete representation: one-hot representation dimensions.
[0,0,0,...,1,...,0]
and one-hot doesn’t give the relation/similarity between words.
Distributional similarity: you can get a lot of value for representing a word by means of its neighbors.
Next, we want to use vectors to represent words.
distributional: understand word meaning by context.
distributed:dense vectors to represent the meaning of the words.

2. Word2vec intro

Basic idea of learning Neural Network word embeddings
We def a model to predict the center word $w_t$ and context words in terms of word vectors.
$p(context|w_t)$
which has a loss function like
$J = 1 -p(w_{-t}|w_t)$
-t means neighbors of $w_t$ except $w_t$
Main idea of word2vec: Predict between every word and its context words.
Two algorithms.

Skip-grams(SG)
Predict context words given target(position independent)
… turning into banking crises as …
banking: center word
turning: $p(w_{t-2}|w_t)$
For each word t=1,…T, we predict surrounding words in a window of “radius” m of every word
$J′(θ)=∏t=1T∏0m≤j≤m,j≠0P(wt+j∣wt;θ)J(θ)=−1T∑t=1T∑0m≤j≤m,j≠0P(wt+j∣wt;θ)J'(\theta)=\prod_{t=1}^T \prod_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) \\ J(\theta)=-\frac 1 T \sum_{t=1}^T \sum_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta)$
hyperparameter: window size m
we use $p(wt+j∣wt)=exp(uoTvc)∑w=1Vexp(uwTvc)p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)}$ ,
the dot product will be greater if two words are more similar. And softmax maps the values to probability distribution.
Continuous Bag of Words(CBOW)
Predict target word from bag-of-words context.

3. Research highlight

omit

4. Word2vec objective function gradients

all parameters in model
$θ=[va⋮vzebraua⋮uzebra]\theta=\left[\begin{aligned} v_a \\ \vdots \\ v_{zebra} \\ u_a \\ \vdots \\ u_{zebra} \end{aligned}\right]$
We try to optimize these parameters by training the model. We use gradients descent.
$∂∂vclog⁡exp(uoTvc)−log⁡∑x=1Vexp(uwTvc)=uo−∑x=1Vuxexp(uxTvc)∑w=1Vexp(uwTvc)=u0−∑x=1vp(x∣c)ux\begin{aligned} &\frac{\partial}{\partial v_c} \log{exp(u_o^Tv_c)}-\log{\sum_{x=1}^V}exp(u_w^Tv_c) \\ =& u_o - \frac{\sum_{x=1}^{V}u_x exp(u_x^Tv_c)}{\sum_{w=1}^Vexp(u_w^Tv_c)} \\ =&u_0 - \sum_{x=1}^{v}p(x|c)u_x \end{aligned}$

5. Optimization refresher

We have the gradients at point x. Then we go along the negative gradients.
$θjnew=θjold−α∂∂θjoldJ(θ)\theta_j^{new}=\theta_j^{old} - \alpha\frac{\partial}{\partial \theta_j^{old}}J(\theta)$
$α\alpha$ : step size.
In matrix notation for parameters
$θjnew=θjold−α∇θJ(θ)\theta_j^{new}=\theta_j^{old} - \alpha\nabla_\theta J(\theta)$
Stochastic Gradient Descent: