CS224N notes_chapter3_Deeper Look at Word Vectors

最新推荐文章于 2022-03-01 19:12:56 发布

lirt15

最新推荐文章于 2022-03-01 19:12:56 发布

阅读量193

点赞数

CC 4.0 BY-SA版权

分类专栏： CS224N 文章标签： CS224N NLP

本文链接：https://blog.youkuaiyun.com/lirt15/article/details/94407692

CS224N 专栏收录该内容

11 篇文章

订阅专栏

本文深入探讨了词向量，包括Skip-gram模型的负采样方法，如何简化计算过程。接着介绍了共现矩阵X及其在窗口和完整文档中的应用，以及使用SVD进行降维时遇到的问题。然后详细阐述了GloVe全球向量模型的优化目标和优势。最后讨论了评估词向量的方法，分为内在评估和外在评估，并举例说明。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第三讲 Deeper Look at Word Vectors

Negtive Sampling

Firstly, we need to review the Skip-gram
$p(wt+j∣wt)=exp(uoTvc)∑w=1Vexp(uwTvc)p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)}$
the calculation of numerator is very easy, while we have to go through the whole dict to calculate the denominator. Any method to simplify this?
We could do negative sampling. Firstly we do minor changes to the loss func.
$\begin{aligned} J_t(\theta) =& \log \sigma(u_o^T v_c)+ \sum_{j\sim P(w)}[\log \sigma (-u_j^Tv_c)] \\ \sigma(x) =& \frac 1 {1+e^{-x}} \end{aligned}$
We only sample several words (usually less than 10) from dict based on their frequency( $U (w)$ ) to calculate the second term in $J(θ)J(\theta)$ . And $P(w)=U(w)^{3/4}/Z$ . The power 3/4 makes less frequent words be sampled more often.
Word2vec captures cooccurrence of words one at a time. Could we capture cooccurrence counts directly?

Cooccurrence matrix X

We could use cooccurrence matrix X

2 options: windows v.s. full doc.
Window: similar to word2vec -> it captures both syntactic and semantic info.
Word-document co-occurrence matrix will give general topics. -> Latent Semantic Analysis.

Example:

I like deep learning.
I like NLP.
I enjoy flying.

Window size: 1

counts	I	like	enjoy	deep	learning	NLP	flying	.
I	0	2	1	0	0	0	0	0
like	2	0	0	1	0	1	0	0
enjoy	1	0	0	0	0	0	1	0
deep	0	1	0	0	1	0	0	0
learning	0	0	0	1	0	0	0	1
NLP	0	1	0	0	0	0	0	1
flying	0	0	1	0	0	0	0	1
.	0	0	0	0	1	1	1	0

Problems with cooccurrence matrix

Increase in size with vocabulary
high dimensional
sparsity

-> less robust.

How to get Low dimensional vectors?
SVD.

Hacks to X

Problem: function words (the, he, has) are too frequent.
- min(X, t), with t~100
- Ignore them all
Ramped windows that count closer words more
Use pearson correlations instead of counts. set neg values to 0.

Problems of SVD

Bad for millions of words.
Hard to incorporate new words.

Count based	Direct prediction
LSA, HAL; COALS	SG/CBOW;NNLM,RNN
Fast training	scale with corpus size
Efficient usage of statistics	inefficient usage of statistics
primarily used to capture word similarity	generate improved performance on other tasks
Disproportionate importance given to large counts	can capture complex patterns beyond word similarity

GloVe: Global Vectors model

$J(\theta) = \frac 1 2 \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j-\log P_{ij})^2 \\ f(x) = min(2x,1)$
Finally, X=U+V

How to evaluate word vectors

Intrinsic
- Eval on a specific subtask
- Fast to compute
- Helps to understand that system
- Not clear if really helpful unless correlation to real task is established
Extrinsic
- Eval on a real task
- Can take long time to compute acc.
- Unclear if the subsystem is the problem or its interaction or other subsystems.
- If replacing exactly subsystem with another improves acc -> winning!

An example of Intrinsic eval.
a:b :: c:?
man:woman :: king:?
#这里想表的是，男人对女人类似于国王对王后
$\mathop{\arg \max_i}{\frac{(x_b-x_a+x_c)^Tx_i}{||x_b-x_a+x_c||}}$
cosine distance.
city - in - state
capital - world
verb - past - tense