CS224N notes_chapter3_Deeper Look at Word Vectors

本文深入探讨了词向量,包括Skip-gram模型的负采样方法,如何简化计算过程。接着介绍了共现矩阵X及其在窗口和完整文档中的应用,以及使用SVD进行降维时遇到的问题。然后详细阐述了GloVe全球向量模型的优化目标和优势。最后讨论了评估词向量的方法,分为内在评估和外在评估,并举例说明。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

第三讲 Deeper Look at Word Vectors

Negtive Sampling

Firstly, we need to review the Skip-gram
p(wt+j∣wt)=exp(uoTvc)∑w=1Vexp(uwTvc)p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)} p(wt+jwt)=w=1Vexp(uwTvc)exp(uoTvc)
the calculation of numerator is very easy, while we have to go through the whole dict to calculate the denominator. Any method to simplify this?
We could do negative sampling. Firstly we do minor changes to the loss func.
Jt(θ)=log⁡σ(uoTvc)+∑j∼P(w)[log⁡σ(−ujTvc)]σ(x)=11+e−x \begin{aligned} J_t(\theta) =& \log \sigma(u_o^T v_c)+ \sum_{j\sim P(w)}[\log \sigma (-u_j^Tv_c)] \\ \sigma(x) =& \frac 1 {1+e^{-x}} \end{aligned}Jt(θ)=σ(x)=logσ(uoTvc)+jP(w)[logσ(ujTvc)]1+ex1
We only sample several words (usually less than 10) from dict based on their frequency(U(w)U(w)U(w)) to calculate the second term in J(θ)J(\theta)J(θ). And P(w)=U(w)3/4/ZP(w)=U(w)^{3/4}/ZP(w)=U(w)3/4/Z. The power 3/4 makes less frequent words be sampled more often.
Word2vec captures cooccurrence of words one at a time. Could we capture cooccurrence counts directly?

Cooccurrence matrix X

We could use cooccurrence matrix X

  • 2 options: windows v.s. full doc.
  • Window: similar to word2vec -> it captures both syntactic and semantic info.
  • Word-document co-occurrence matrix will give general topics. -> Latent Semantic Analysis.

Example:

  • I like deep learning.
  • I like NLP.
  • I enjoy flying.

Window size: 1

countsIlikeenjoydeeplearningNLPflying.
I02100000
like20010100
enjoy10000010
deep01001000
learning00010001
NLP01000001
flying00100001
.00001110

Problems with cooccurrence matrix

  • Increase in size with vocabulary
  • high dimensional
  • sparsity

-> less robust.

How to get Low dimensional vectors?
SVD.

Hacks to X

  • Problem: function words (the, he, has) are too frequent.
    • min(X, t), with t~100
    • Ignore them all
  • Ramped windows that count closer words more
  • Use pearson correlations instead of counts. set neg values to 0.

Problems of SVD

  • Bad for millions of words.
  • Hard to incorporate new words.
Count basedDirect prediction
LSA, HAL; COALSSG/CBOW;NNLM,RNN
Fast trainingscale with corpus size
Efficient usage of statisticsinefficient usage of statistics
primarily used to capture word similaritygenerate improved performance on other tasks
Disproportionate importance given to large countscan capture complex patterns beyond word similarity

GloVe: Global Vectors model

J(θ)=12∑i,j=1Wf(Pij)(uiTvj−log⁡Pij)2f(x)=min(2x,1) J(\theta) = \frac 1 2 \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j-\log P_{ij})^2 \\ f(x) = min(2x,1) J(θ)=21i,j=1Wf(Pij)(uiTvjlogPij)2f(x)=min(2x,1)
Finally, X=U+V

How to evaluate word vectors

  • Intrinsic
    • Eval on a specific subtask
    • Fast to compute
    • Helps to understand that system
    • Not clear if really helpful unless correlation to real task is established
  • Extrinsic
    • Eval on a real task
    • Can take long time to compute acc.
    • Unclear if the subsystem is the problem or its interaction or other subsystems.
    • If replacing exactly subsystem with another improves acc -> winning!

An example of Intrinsic eval.
a:b :: c:?
man:woman :: king:?
#这里想表的是,男人对女人类似于国王对王后
d=arg⁡max⁡i(xb−xa+xc)Txi∣∣xb−xa+xc∣∣d = \mathop{\arg \max_i}{\frac{(x_b-x_a+x_c)^Tx_i}{||x_b-x_a+x_c||}} d=argimaxxbxa+xc(xbxa+xc)Txi
cosine distance.
city - in - state
capital - world
verb - past - tense

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值