【吴恩达深度学习】05_week2_quiz Natural Language Processing & Word Embeddings

本文链接：https://blog.youkuaiyun.com/qq_50710984/article/details/124138758

本文探讨了词嵌入的基本概念，包括Word2Vec和GloVe模型，解释了它们如何通过数学公式捕捉词汇之间的语义关系。同时，提到了预训练词嵌入在小规模任务中的应用，以及在大规模语料库中训练的有效性。文章还讨论了降维技术t-SNE的作用，以及在训练过程中初始化参数和优化方法的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

(1)Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.
[A]True
[B]False
答案：B
解析：注意和one-hot的区别。

(2)What is t-SNE?
[A]A linear transformation that allows us to solve analogies on word vectors.
[B]A non-linear dimensionality reduction technique.
[C]A supervised learning algorithm for learning word embeddings.
[D]An open-source sequence modeling library.
答案：B
解析：t-SNE是一种非线性的降维算法。

(3)Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.

x(input text)	y(happy?)
I’m feeling wonderful today!	1
I’m bummed my cat is ill	0
Really enjoying this!	1

Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label y=1.
[A]True
[B]False
答案：A
解析：正向积极的词会有相似的特征向量。

(4)Which of these equations do you think should hold for a good word embedding?(Check all that apply)
[A] $e_{boy}-e_{girl} \approx e_{brother}-e_{sister}$
[B] $e_{boy}-e_{girl} \approx e_{sister}-e_{brother}$
[C] $e_{boy}-e_{brother} \approx e_{girl}-e_{sister}$
[D] $e_{boy}-e_{brother} \approx e_{sister}-e_{girl}$
答案：A,C

(5)Let $E$ be an embedding matrix, and let $o_{1234}$ be a one-hot vector, corresponding to word 1234. Then to get the embedding of word 1234, why don’t we call $E^T*o_{1234}$ in Python?
[A]it is computationally wasteful.
[B]The correct formula is $E^T*e_{1234}$
[C]This doesn’t handle unknown words (<UNK>)
[D]None of the above: Calling the Python snippet as described above is fine.
答案：A
解析：one-hot向量维度高，并且大多数为0，所以 $E$ 和 $o_{1234}$ 进行相乘效率很低。

(6)When learning word embeddings, we create an artificial task of estimating $P (t a r g e t ∣ c o n t e x t)$ . It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.
[A]True
[B]False
答案：B
解析：错在artificial人工。

(7)In the word2vec algorithm, you estimate $P (t ∣ c)$ , where $t$ is the target word and $c$ is a context word, How are $t$ and $c$ chosen from the training set? Pick the best answer.
[A] $c$ is the one word that comes immediately before $t$ .
[B] $c$ is the sequence of all the words in the sentence before $t$
[C] $c$ is a sequence of several words immediately before $t$ .
[D] $c$ and $t$ are chosen to be nearby words.
答案：D

(8)Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec mode: uses the following softmax function:
$P\left( t|c \right) =\frac{e^{\theta _t^Te_c}}{\sum_{t'=1}^{10000}{e^{\theta _{t'}^{T}e_c}}}$
Which of these statements are correct? Check all that apply.
[A] $\theta_t$ and $e_c$ are both 500 dimensional vectors.
[B] $\theta_t$ and $e_c$ are both 10000 dimensional vectors.
[C] $\theta_t$ and $e_c$ are both trained with an optimization algorithm such as Adam or gradient descent.
[D]After training, we should expect $\theta_t$ to be very close to $e_c$ when $t$ and $c$ are the same word.
答案：A，C
解析：由题意embedding的大小为500维度，所以 $\theta_t$ 和 $e_c$ 的维度都为500。
D选项有点争议，具体见
Why does word2vec use 2 representations for each word?
Word2Vec哪个矩阵是词向量？
word2Vec的CBOW，SKIP-gram为什么有2组词向量？
本人认为 $\theta$ 向量和 $e$ 向量均可作为词向量，只是表达的方式和所表达的特征有所不同，所以数值上也会不同。
表达方式不同可以理解为半径为1的圆和面积为 $\pi$ 的圆，他们表达方式不同但都表示同一个圆。也可以理解为处于不同基底的向量空间。
表达的特征不同可以理解为对于同一个词不同向量提取到的特征不同。就比如“juice”这个词，一个提取到的特征这是一种液体，另一个提取到的特征这是由水果制成的。
如有错误，请大佬指出。

(9)Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The GloVe model minimizes this objective:
$\min \sum_{i=1}^{10000}{\sum_{j=1}^{10000}{f\left( X_{ij} \right) \left( \theta _i^Te_j+b_i+b_j'-\log X_{ij} \right) ^2}}$
Which of these statements are correct? Check all that apply.
[A] $\theta_i$ and $e_j$ should be initialized to 0 at the beginning of training.
[B] $\theta_i$ and $e_j$ should be initialized randomly at the beginning of training.
[C] $X_{ij}$ is the number of times word i appears in the context of word j.
[D]The weighting function $f (.)$ must satisfy $f (0) = 0$
答案：B，C，D

(10)You have trained word embeddings using a text dataset of m1 words. You are considering using these word embeddings for a language task, for which you have separate labeled dataset of m2 words. keeping in mind that using word embeddings of a form of transfer learning, under which of these circumstance would you expect the word embeddings to be helpful?
[A] $m 1 > > m 2$
[B] $m 1 < < m 2$
答案：A