第二讲 word2vec
1 Word meaning
the idea that is represented by a word, phrase, writing, art etc.
How do we have usable meaning in a computer?
Common answer: toxonomy(分类系统) like WordNet that has hypernyms relations(is-a) and synonym(同义词) sets.
Problems with toxonomy:
- missing nuances(细微差别) 比如 proficient 就比 good 更适合形容专家, 但是在分类系统中它们就是同义词
- missing new words
- subjective
- requires human labor to create and adapt
- Hard to compute accurate word similarity
Problems with discrete representation: one-hot representation dimensions.
[0,0,0,...,1,...,0]
and one-hot doesn’t give the relation/similarity between words.
Distributional similarity: you can get a lot of value for representing a word by means of its neighbors.
Next, we want to use vectors to represent words.
distributional: understand word meaning by context.
distributed:dense vectors to represent the meaning of the words.
2. Word2vec intro
Basic idea of learning Neural Network word embeddings
We def a model to predict the center word wtw_twt and context words in terms of word vectors.
p(context∣wt)p(context|w_t)
p(context∣wt)
which has a loss function like
J=1−p(w−t∣wt)J = 1 -p(w_{-t}|w_t)
J=1−p(w−t∣wt)
-t means neighbors of wtw_twt except wtw_twt
Main idea of word2vec: Predict between every word and its context words.
Two algorithms.
-
Skip-grams(SG)
Predict context words given target(position independent)
… turning into banking crises as …
banking: center word
turning: p(wt−2∣wt)p(w_{t-2}|w_t)p(wt−2∣wt)
For each word t=1,…T, we predict surrounding words in a window of “radius” m of every word
J′(θ)=∏t=1T∏0m≤j≤m,j≠0P(wt+j∣wt;θ)J(θ)=−1T∑t=1T∑0m≤j≤m,j≠0P(wt+j∣wt;θ)J'(\theta)=\prod_{t=1}^T \prod_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) \\ J(\theta)=-\frac 1 T \sum_{t=1}^T \sum_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) J′(θ)=t=1∏T0m≤j≤m,j̸=0∏P(wt+j∣wt;θ)J(θ)=−T1t=1∑T0m≤j≤m,j̸=0∑P(wt+j∣wt;θ)
hyperparameter: window size m
we use p(wt+j∣wt)=exp(uoTvc)∑w=1Vexp(uwTvc)p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)}p(wt+j∣wt)=∑w=1Vexp(uwTvc)exp(uoTvc),
the dot product will be greater if two words are more similar. And softmax maps the values to probability distribution. -
Continuous Bag of Words(CBOW)
Predict target word from bag-of-words context.
3. Research highlight
omit
4. Word2vec objective function gradients
all parameters in model
θ=[va⋮vzebraua⋮uzebra]\theta=\left[\begin{aligned}
v_a \\
\vdots \\
v_{zebra} \\
u_a \\
\vdots \\
u_{zebra}
\end{aligned}\right]
θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡va⋮vzebraua⋮uzebra⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
We try to optimize these parameters by training the model. We use gradients descent.
∂∂vclogexp(uoTvc)−log∑x=1Vexp(uwTvc)=uo−∑x=1Vuxexp(uxTvc)∑w=1Vexp(uwTvc)=u0−∑x=1vp(x∣c)ux\begin{aligned}
&\frac{\partial}{\partial v_c} \log{exp(u_o^Tv_c)}-\log{\sum_{x=1}^V}exp(u_w^Tv_c) \\
=& u_o - \frac{\sum_{x=1}^{V}u_x exp(u_x^Tv_c)}{\sum_{w=1}^Vexp(u_w^Tv_c)} \\
=&u_0 - \sum_{x=1}^{v}p(x|c)u_x
\end{aligned}
==∂vc∂logexp(uoTvc)−logx=1∑Vexp(uwTvc)uo−∑w=1Vexp(uwTvc)∑x=1Vuxexp(uxTvc)u0−x=1∑vp(x∣c)ux
5. Optimization refresher
We have the gradients at point x. Then we go along the negative gradients.
θjnew=θjold−α∂∂θjoldJ(θ)\theta_j^{new}=\theta_j^{old} - \alpha\frac{\partial}{\partial \theta_j^{old}}J(\theta)
θjnew=θjold−α∂θjold∂J(θ)
α\alphaα: step size.
In matrix notation for parameters
θjnew=θjold−α∇θJ(θ)\theta_j^{new}=\theta_j^{old} - \alpha\nabla_\theta J(\theta)θjnew=θjold−α∇θJ(θ)
Stochastic Gradient Descent:
- global update -> much time
- mini batch -> also good idea