word2vec - cs224n class 1

本文借鉴cs224n notes 1,介绍了word2vec的背景理论,包括分布语义学和分布相似性。详细阐述了word2vec模型,包含语言模型(CBOW和skip - gram)和训练方法(负采样和分层softmax),还说明了其迭代训练过程,并分别介绍了CBOW和skip - gram的具体步骤。

借鉴:
cs224n notes 1:
http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf

1. word2vec background theory

  • distributional semantics: represent the meaning of a word based on the context in which it usually appears. The obtaiend representations/ word embeddings are dense and can better capture similarity.
  • distributional similarity: the idea that similar words have similar context.

2. word2vec

A language model can assign a probability to a sequence of tokens w1, w2, …, wn.
e.g. unigram, bigram model

word2vec contains:
language models:
CBOW: predict center word from the context
skip-gram: predict the context from the center word

training methods:
negative sampling: defines an objective by sampling negative examples
hierarchical softmax: defines an objective by using an efficient tree structure to compute probabilities for all the vocabulary.

word2vec is iteration-based:

  • model parameters are the word vectors
  • train the model on a certain objective
  • at every iteration, evaluate the errors and follow an update rule that has some notion of penalizing the model parameters that caused the errors thus, we learn the word vectors (modelparameters)

2.1 CBOW

steps:

  1. generate one-hot word vectors for the input context of size m: (x_c-m, x_c-m+1, …, x_c-1, x_c+1, …, x_c+m-1, x_c+m)
    c: center word, x: |V|x1 vector
  2. get our embedded word vectors for the context (v_c-m = V.x_c-m, v_c-m+1 = V.x_c-m+1, …, v_c-1, v_c+1, …, v_c+m-1, v_c+m)
    v: nx1 embedded vector
  3. average these 2m vectors to get v_ave
  4. generate a score vector z = Uv_ave
  5. yˆ = softmax(z)
  6. We desire our probabilities generated, yˆ ∈ R|V| to match the true probabilities

how to measure the degree of matching/similarity? cross-entropy.
so our optimization function is about cross entropy. The probability of (correct) center word given its context can be calculated using the cross entropy (similarity between the center word uc and its context v_bar).
So, our objective of updating the word vectors of center word and context words is to maximize the similarity of the center word and its context.
在这里插入图片描述

We use stochastic gradient descent to update. SGD compute gradients for a window with the size of m (2m context words) and update the parameters:
在这里插入图片描述

2.2 skip-gram

given center word, predict its context words in the window

steps:

  1. generate one-hot input vector x of the center word
    x: Vx1 vector
  2. get our embedded word vector for the center word: v_c = Vx
    v_c: nx1 vector
  3. generate a score vector: z = Uv_c
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值