【吴恩达深度学习】05_week2_quiz Natural Language Processing & Word Embeddings

本文探讨了词嵌入的基本概念,包括Word2Vec和GloVe模型,解释了它们如何通过数学公式捕捉词汇之间的语义关系。同时,提到了预训练词嵌入在小规模任务中的应用,以及在大规模语料库中训练的有效性。文章还讨论了降维技术t-SNE的作用,以及在训练过程中初始化参数和优化方法的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

(1)Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.
[A]True
[B]False

答案:B
解析:注意和one-hot的区别。

(2)What is t-SNE?
[A]A linear transformation that allows us to solve analogies on word vectors.
[B]A non-linear dimensionality reduction technique.
[C]A supervised learning algorithm for learning word embeddings.
[D]An open-source sequence modeling library.

答案:B
解析:t-SNE是一种非线性的降维算法。

(3)Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.

x(input text)y(happy?)
I’m feeling wonderful today!1
I’m bummed my cat is ill0
Really enjoying this!1

Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label y=1.
[A]True
[B]False

答案:A
解析:正向积极的词会有相似的特征向量。

(4)Which of these equations do you think should hold for a good word embedding?(Check all that apply)
[A] e b o y − e g i r l ≈ e b r o t h e r − e s i s t e r e_{boy}-e_{girl} \approx e_{brother}-e_{sister} eboyegirlebrotheresister
[B] e b o y − e g i r l ≈ e s i s t e r − e b r o t h e r e_{boy}-e_{girl} \approx e_{sister}-e_{brother} eboyegirlesisterebrother
[C] e b o y − e b r o t h e r ≈ e g i r l − e s i s t e r e_{boy}-e_{brother} \approx e_{girl}-e_{sister} eboyebrotheregirlesister
[D] e b o y − e b r o t h e r ≈ e s i s t e r − e g i r l e_{boy}-e_{brother} \approx e_{sister}-e_{girl} eboyebrotheresisteregirl

答案:A,C

(5)Let E E E be an embedding matrix, and let o 1234 o_{1234} o1234 be a one-hot vector, corresponding to word 1234. Then to get the embedding of word 1234, why don’t we call E T ∗ o 1234 E^T*o_{1234} ETo1234 in Python?
[A]it is computationally wasteful.
[B]The correct formula is E T ∗ e 1234 E^T*e_{1234} ETe1234
[C]This doesn’t handle unknown words (<UNK>)
[D]None of the above: Calling the Python snippet as described above is fine.

答案:A
解析:one-hot向量维度高,并且大多数为0,所以 E E E o 1234 o_{1234} o1234 进行相乘效率很低。

(6)When learning word embeddings, we create an artificial task of estimating P ( t a r g e t ∣ c o n t e x t ) P(target|context) P(targetcontext). It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.
[A]True
[B]False

答案:B
解析:错在artificial人工。

(7)In the word2vec algorithm, you estimate P ( t ∣ c ) P(t|c) P(tc), where t t t is the target word and c c c is a context word, How are t t t and c c c chosen from the training set? Pick the best answer.
[A] c c c is the one word that comes immediately before t t t.
[B] c c c is the sequence of all the words in the sentence before t t t
[C] c c c is a sequence of several words immediately before t t t.
[D] c c c and t t t are chosen to be nearby words.

答案:D

(8)Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec mode: uses the following softmax function:
P ( t ∣ c ) = e θ t T e c ∑ t ′ = 1 10000 e θ t ′ T e c P\left( t|c \right) =\frac{e^{\theta _t^Te_c}}{\sum_{t'=1}^{10000}{e^{\theta _{t'}^{T}e_c}}} P(tc)=t=110000eθtTeceθtTec
Which of these statements are correct? Check all that apply.
[A] θ t \theta_t θt and e c e_c ec are both 500 dimensional vectors.
[B] θ t \theta_t θt and e c e_c ec are both 10000 dimensional vectors.
[C] θ t \theta_t θt and e c e_c ec are both trained with an optimization algorithm such as Adam or gradient descent.
[D]After training, we should expect θ t \theta_t θt to be very close to e c e_c ec when t t t and c c c are the same word.

答案:A,C
解析:由题意embedding的大小为500维度,所以 θ t \theta_t θt e c e_c ec的维度都为500。
D选项有点争议,具体见
Why does word2vec use 2 representations for each word?
Word2Vec哪个矩阵是词向量?
word2Vec的CBOW,SKIP-gram为什么有2组词向量?
本人认为 θ \theta θ向量和 e e e向量均可作为词向量,只是表达的方式和所表达的特征有所不同,所以数值上也会不同。
表达方式不同可以理解为半径为1的圆和面积为 π \pi π的圆,他们表达方式不同但都表示同一个圆。也可以理解为处于不同基底的向量空间。
表达的特征不同可以理解为对于同一个词不同向量提取到的特征不同。就比如“juice”这个词,一个提取到的特征这是一种液体,另一个提取到的特征这是由水果制成的。
如有错误,请大佬指出。

(9)Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The GloVe model minimizes this objective:
min ⁡ ∑ i = 1 10000 ∑ j = 1 10000 f ( X i j ) ( θ i T e j + b i + b j ′ − log ⁡ X i j ) 2 \min \sum_{i=1}^{10000}{\sum_{j=1}^{10000}{f\left( X_{ij} \right) \left( \theta _i^Te_j+b_i+b_j'-\log X_{ij} \right) ^2}} mini=110000j=110000f(Xij)(θiTej+bi+bjlogXij)2
Which of these statements are correct? Check all that apply.
[A] θ i \theta_i θi and e j e_j ej should be initialized to 0 at the beginning of training.
[B] θ i \theta_i θi and e j e_j ej should be initialized randomly at the beginning of training.
[C] X i j X_{ij} Xij is the number of times word i appears in the context of word j.
[D]The weighting function f ( . ) f(.) f(.) must satisfy f ( 0 ) = 0 f(0)=0 f(0)=0

答案:B,C,D

(10)You have trained word embeddings using a text dataset of m1 words. You are considering using these word embeddings for a language task, for which you have separate labeled dataset of m2 words. keeping in mind that using word embeddings of a form of transfer learning, under which of these circumstance would you expect the word embeddings to be helpful?
[A] m 1 > > m 2 m1>>m2 m1>>m2
[B] m 1 < < m 2 m1<<m2 m1<<m2

答案:A

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值