第三讲 Deeper Look at Word Vectors
Negtive Sampling
Firstly, we need to review the Skip-gram
p(wt+j∣wt)=exp(uoTvc)∑w=1Vexp(uwTvc)p(w_{t+j}|w_t)=
\frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)}
p(wt+j∣wt)=∑w=1Vexp(uwTvc)exp(uoTvc)
the calculation of numerator is very easy, while we have to go through the whole dict to calculate the denominator. Any method to simplify this?
We could do negative sampling. Firstly we do minor changes to the loss func.
Jt(θ)=logσ(uoTvc)+∑j∼P(w)[logσ(−ujTvc)]σ(x)=11+e−x \begin{aligned}
J_t(\theta) =& \log \sigma(u_o^T v_c)+ \sum_{j\sim P(w)}[\log \sigma (-u_j^Tv_c)] \\
\sigma(x) =& \frac 1 {1+e^{-x}}
\end{aligned}Jt(θ)=σ(x)=logσ(uoTvc)+j∼P(w)∑[logσ(−ujTvc)]1+e−x1
We only sample several words (usually less than 10) from dict based on their frequency(U(w)U(w)U(w)) to calculate the second term in J(θ)J(\theta)J(θ). And P(w)=U(w)3/4/ZP(w)=U(w)^{3/4}/ZP(w)=U(w)3/4/Z. The power 3/4 makes less frequent words be sampled more often.
Word2vec captures cooccurrence of words one at a time. Could we capture cooccurrence counts directly?
Cooccurrence matrix X
We could use cooccurrence matrix X
- 2 options: windows v.s. full doc.
- Window: similar to word2vec -> it captures both syntactic and semantic info.
- Word-document co-occurrence matrix will give general topics. -> Latent Semantic Analysis.
Example:
- I like deep learning.
- I like NLP.
- I enjoy flying.
Window size: 1
counts | I | like | enjoy | deep | learning | NLP | flying | . |
---|---|---|---|---|---|---|---|---|
I | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
like | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
enjoy | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
deep | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
learning | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
NLP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
flying | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
. | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
Problems with cooccurrence matrix
- Increase in size with vocabulary
- high dimensional
- sparsity
-> less robust.
How to get Low dimensional vectors?
SVD.
Hacks to X
- Problem: function words (the, he, has) are too frequent.
- min(X, t), with t~100
- Ignore them all
- Ramped windows that count closer words more
- Use pearson correlations instead of counts. set neg values to 0.
Problems of SVD
- Bad for millions of words.
- Hard to incorporate new words.
Count based | Direct prediction |
---|---|
LSA, HAL; COALS | SG/CBOW;NNLM,RNN |
Fast training | scale with corpus size |
Efficient usage of statistics | inefficient usage of statistics |
primarily used to capture word similarity | generate improved performance on other tasks |
Disproportionate importance given to large counts | can capture complex patterns beyond word similarity |
GloVe: Global Vectors model
J(θ)=12∑i,j=1Wf(Pij)(uiTvj−logPij)2f(x)=min(2x,1) J(\theta) = \frac 1 2 \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j-\log P_{ij})^2 \\
f(x) = min(2x,1)
J(θ)=21i,j=1∑Wf(Pij)(uiTvj−logPij)2f(x)=min(2x,1)
Finally, X=U+V
How to evaluate word vectors
- Intrinsic
- Eval on a specific subtask
- Fast to compute
- Helps to understand that system
- Not clear if really helpful unless correlation to real task is established
- Extrinsic
- Eval on a real task
- Can take long time to compute acc.
- Unclear if the subsystem is the problem or its interaction or other subsystems.
- If replacing exactly subsystem with another improves acc -> winning!
An example of Intrinsic eval.
a:b :: c:?
man:woman :: king:?
#这里想表的是,男人对女人类似于国王对王后
d=argmaxi(xb−xa+xc)Txi∣∣xb−xa+xc∣∣d = \mathop{\arg \max_i}{\frac{(x_b-x_a+x_c)^Tx_i}{||x_b-x_a+x_c||}}
d=argimax∣∣xb−xa+xc∣∣(xb−xa+xc)Txi
cosine distance.
city - in - state
capital - world
verb - past - tense