NLP-Task4——从one-hot到word2vec

最新推荐文章于 2025-06-17 23:54:32 发布

原创最新推荐文章于 2025-06-17 23:54:32 发布 · 359 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#nlp #word2vec

博客围绕词的表达和Word2Vec展开。词的表达中提到one - hot编码有无法表达单词关系和维度过高的缺点。Word2Vec包含CBOW和Skip - Gram模型，详细介绍了CBOW模型的网络结构、参数更新，以及Skip - Gram模型的网络结构和目标。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

词的表达

给定语料库 $D={D1,D2,⋯ ,DN}\mathbb{D}=\left\{\mathcal{D}_{1}, \mathcal{D}_{2}, \cdots, \mathcal{D}_{N}\right\}$ ，其中包含N篇文档。
每篇文档 $Di\mathcal{D_i}$ 包含单词序列 $(word⁡I1i,word⁡I2i,⋯ ,word⁡Inii)\left(\operatorname{word}_{I_{1}^{i}}, \operatorname{word}_{I_{2}^{i}}, \cdots, \operatorname{word}_{I_{n_{i}}^{i}}\right)$ ，其中 $Iji∈{1,2,⋯ ,V}I_{j}^{i} \in\{1,2, \cdots, V\}$ 表示单词的编号:
i表示第i篇文档
j表示文档中的第j个单词
$n_i$ 表示第i篇文档中包含 $n_i$ 个单词
$v=Ijiv=I^i_j$ 表示第i篇文档的第j个单词为 $word⁡v\operatorname{word}_v$
所有单词来自于词汇表 $V}\mathbb{V}=\left\{\mathrm{word}_{1}, \text { word }_{2}, \cdots, \text { word }_{V}\right\}$ ，其中V表示词汇表的大小。
词的表达任务要解决的问题是：如何表达每个单词的 $word⁡v\operatorname{word}_v$
最简单的表示方式是one-hot编码，对于词汇表的第v个单词 $word⁡v\operatorname{word}_v$ 将其表示为: $word⁡v→(0,0,⋯ ,0,1,0,⋯ ,0)T\operatorname{word}_{v} \rightarrow(0,0, \cdots, 0,1,0, \cdots, 0)^{T}$
即第v位取值为1，剩余位取值为0.
这种表示方法有两个主要缺点：
无法表达单词之间的关系，对于任意一对单词 $j)\left(\text { word }_{i}, \text { word }_{j}\right)$ ，其向量的距离为 $2\sqrt 2$ 。
单词维度过高。对于中文词汇表，其大小可能达到数十万，因此one-hot向量的维度也在数十万维，这对于存储、计算都消耗过大。
Bow:Bag of Words：词在文档中不考虑顺序，这称作词袋模型。

Word2Vec

CBOW模型

CBOW模型(Continuous bag-of-word)：根据上下文来预测一个单词。

一个单词上下文

网络结构

在一个单词上下文的CBOW模型中，输入的是前一个单词，输出是后一个单词；输入为输出的上下文，由于只有一个单词作为输入，因此称作一个单词的上下文。
一个单词上下文的CBOW模型如下：

其中：

N为隐层的大小，即隐向量 $h→∈RN\overrightarrow{\mathbf{h}} \in \mathbb{R}^{N}$ 。
网络输入 $x→=(x1,x2,⋯ ,xV)T∈RV\overrightarrow{\mathbf{x}}=\left(x_{1}, x_{2}, \cdots, x_{V}\right)^{T} \in \mathbb{R}^{V}$ ，它是输入单词（即上下文单词）的one-hot编码，其中只有一位为1，其他位都为0。
网络输出 $y→=(y1,y2,⋯ ,yV)T∈RV\overrightarrow{\mathbf{y}}=\left(y_{1}, y_{2}, \cdots, y_{V}\right)^{T} \in \mathbb{R}^{V}$ ，它是输出单词为词汇表个单词的概率。
相邻层之间为全连接：
- 输入层和隐层之间的权重矩阵为W，其尺寸 $\times N$ 。
- 隐层和输出层之间的权重矩阵为 $W′\mathbf{W}^{\prime}$ ，其尺寸为 $\times V$ 。

假设没有激活函数，没有偏置项。给定输入 $x→∈RV\overrightarrow{\mathbf{x}} \in \mathbb{R}^{V}$ ,则其对应的隐向量 $h→∈RN\overrightarrow{\mathbf{h}} \in \mathbb{R}^{N}$ 为： $h→=wTx→\overrightarrow{\mathbf{h}}=\mathbf{w}^{T} \overrightarrow{\mathbf{x}}$ 。令 $W=[w→1Tw→2T⋮w→VT]\mathbf{W}=\left[ \begin{array}{c}{\overrightarrow{\mathbf{w}}_{1}^{T}} \\ {\overrightarrow{\mathbf{w}}_{2}^{T}} \\ {\vdots} \\ {\overrightarrow{\mathbf{w}}_{V}^{T}}\end{array}\right]$
由于 $x→\overrightarrow{\mathbf{x}}$ 是个one-hot编码，假设它为此表 $V\mathbb{V}$ 中第k个单词 $word⁡k\operatorname{word}_k$ ，即：
$x1=0,x2=0,⋯ ,xk−1=0,xk=1,xk+1=0,⋯ ,xV=0x_{1}=0, x_{2}=0, \cdots, x_{k-1}=0, x_{k}=1, x_{k+1}=0, \cdots, x_{V}=0$
则有： $h→=w→k\overrightarrow{\mathbf{h}}=\overrightarrow{\mathbf{w}}_{k}$ 。
即：W的第k行 $wk→\overrightarrow{\mathbf{w}_k}$ 就是词表 $V\mathbb{V}$ 中的第k个单词 $word⁡k\operatorname{word}_k$ 的表达，称作单词 $word⁡k\operatorname{word}_k$ 的输入向量。
给定隐向量 $h→\overrightarrow{\mathbf{h}}$ ，其对应的输出向量 $u→∈Rv\overrightarrow{\mathbf{u}} \in \mathbb{R}^{v}$ 为： $u→=W′Th→\overrightarrow{\mathbf{u}}=\mathbf{W^{\prime}}^{T} \overrightarrow{\mathbf{h}}$
令： $W′=[w→1′,w→2′,⋯ ,w→V′]\mathbf{W}^{\prime}=\left[\overrightarrow{\mathbf{w}}_{1}^{\prime}, \overrightarrow{\mathbf{w}}_{2}^{\prime}, \cdots, \overrightarrow{\mathbf{w}}_{V}^{\prime}\right]$
则有： $uj=w→j′⋅h→u_{j}=\overrightarrow{\mathbf{w}}_{ j}^{\prime} \cdot \overrightarrow{\mathbf{h}}$ ，表示词表 $V\mathbb{V}$ 中，第j个单词 $word⁡j\operatorname{word}_j$ 的得分。
$wj→′\overrightarrow{\mathbf{w}_j}^{\prime}$ 为矩阵 $W′\mathbf{W}^{\prime}$ 的第j列，称作单词 $word⁡j\operatorname{word}_j$ 的输出向量。
$u→\overrightarrow{\mathbf{u}}$ 之后接入一层softmax层，则有：
$j∣x→)=exp⁡(uj)∑j′=1Vexp⁡(uj′),j=1,2,⋯ ,Vy_{j}=p\left(\text { word }_{j} | \overrightarrow{\mathbf{x}}\right)=\frac{\exp \left(u_{j}\right)}{\sum_{j^{\prime}=1}^{V} \exp \left(u_{j^{\prime}}\right)}, \quad j=1,2, \cdots, V$
即 $y_j$ 表示词汇表 $V\mathbb{V}$ 中第j个单词 $word⁡j\operatorname{word}_j$ 为真实输出单词的概率。
假设给定一个单词 $word⁡I\operatorname{word}_{I}$ （它称作上下文），观测到它的下一个单词为 $word⁡O\operatorname{word}_{O}$ 。
假设 $word⁡O\operatorname{word}_{O}$ 对应的输出编号是 $j^{*}$ ，则网络的优化目标是：
$\begin{array}{c}{\max _{\mathbf{W}, \mathbf{W}^{\prime}} p\left(\operatorname{word}_{O} | \text { word }_{I}\right)=\max _{\mathbf{W}, \mathbf{W}^{\prime}} y_{j^{*}}=\max _{\mathbf{W}, \mathbf{W}^{\prime}} \log \frac{\exp \left(\overrightarrow{\mathbf{w}}_{j^{*}}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)}{\sum_{i=1}^{V} \exp \left(\overrightarrow{\mathbf{w}}_{i}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)}} \\ {=\max _{\mathbf{W}, \mathbf{W}^{\prime}}\left[\overrightarrow{\mathbf{w}}_{j^{*}}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}-\log \sum_{i=1}^{V} \exp \left(\overrightarrow{\mathbf{w}}_{i}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)\right]}\end{array}$
其中 $w→I\overrightarrow{\mathbf{w}}_I$ 为输入单词 $word⁡I\operatorname{word}_I$ 的输入向量。
考虑到 $uj=w→j′⋅w→Iu_{j}=\overrightarrow{\mathbf{w}}_{j}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}$ ，定义：
$\begin{array}{c}{E=-\log p\left(\operatorname{word}_{O} | \operatorname{word}_{I}\right)=-\left[\overrightarrow{\mathbf{w}}_{j^{*}}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}-\log \sum_{i=1}^{V} \exp \left(\overrightarrow{\mathbf{w}}_{i}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)\right]} \\ {=-\left[u_{j^{*}}-\log \sum_{i=1}^{V} \exp \left(u_{i}\right)\right]}\end{array}$
则优化目标： $min⁡E\operatorname{min} E$

参数更新

定义 $tj=I(j=j∗)t_j = \mathbb{I}(j=j^{*})$ ，即第j个输出单元对应于真实的输出单词 $word⁡O\operatorname{word}_O$ 时，它为1，否则为0.
定义： $ej=∂E∂uj=yj−tje_{j}=\frac{\partial E}{\partial u_{j}}=y_{j}-t_{j}$
它刻画了每个输出单元的预测误差：

当 $j=j^*$ 时： $e_j=y_j-1$ ，它刻画了输出概率（ $y_j$ ）与真实概率1之间的差距
当 $\neq j^{*}$ 时： $e_j=y_j$ ，它刻画了输出概率与真实概率之间的差距

根据：
$u_{j}=\overrightarrow{\mathbf{w}}_{j}^{\prime} \cdot \overrightarrow{\mathbf{h}} \quad \rightarrow \quad \frac{\partial u_{j}}{\partial \overrightarrow{\mathbf{w}}_{j}^{\prime}}=\overrightarrow{\mathbf{h}}$
则有：
$∂E∂w→j′=∂E∂uj×∂uj∂w→j′=ejh→\frac{\partial E}{\partial \overrightarrow{\mathbf{w}}_{j}^{\prime}}=\frac{\partial E}{\partial u_{j}} \times \frac{\partial u_{j}}{\partial \overrightarrow{\mathbf{w}}_{j}^{\prime}}=e_{j} \overrightarrow{\mathbf{h}}$
则 $wj→′\overrightarrow{\mathbf{w}_j}^{\prime}$ 更新规则为：
$w→j′(new)=w→j′(old)−ηejh→\overrightarrow{\mathbf{w}}_{j}^{\prime(n e w)}=\overrightarrow{\mathbf{w}}_{j}^{\prime(o l d)}-\eta e_{j} \overrightarrow{\mathbf{h}}$
其物理意义为：

当估计过量（ $ej>0→yj>tje_{j}>0 \rightarrow y_{j}>t_{j}$ ）时， $wj→′\overrightarrow{\mathbf{w}_j}^{\prime}$ 会减去一定比例的 $h→\overrightarrow{\mathbf{h}}$ 。
这发生在第j个输出单元不对应于真实的输出单元时。
当估计不足( $ej<0→yj<tje_{j}<0 \rightarrow y_{j}<t_{j}$ )时， $wj→′\overrightarrow{\mathbf{w}_j}^{\prime}$ 会加上一定比例的 $h→\overrightarrow{\mathbf{h}}$ 。
这发生在第j个输出单元刚好对应于真实的输出单词时。
当 $yj≃tjy_{j} \simeq t_{j}$ 时，更新的幅度将非常微小。

定义： $EH→=∂E∂h→=(∂u→∂h→)T∂E∂u→\overrightarrow{\mathbf{E H}}=\frac{\partial E}{\partial \overrightarrow{\mathbf{h}}}=\left(\frac{\partial \overrightarrow{\mathbf{u}}}{\partial \overrightarrow{\mathbf{h}}}\right)^{T} \frac{\partial E}{\partial \overrightarrow{\mathbf{u}}}$
根据： $u→=W′Th→→(∂u→∂h→)T=W′\overrightarrow{\mathbf{u}}=\mathbf{W}^{\prime T} \overrightarrow{\mathbf{h}} \quad \rightarrow \quad\left(\frac{\partial \overrightarrow{\mathbf{u}}}{\partial \overrightarrow{\mathbf{h}}}\right)^{T}=\mathbf{W}^{\prime}$
则有： $EH→=W′e→=∑j=1Vejw→j′\overrightarrow{\mathbf{E H}}=\mathbf{W}^{\prime} \overrightarrow{\mathbf{e}}=\sum_{j=1}^{V} e_{j} \overrightarrow{\mathbf{w}}_{j}^{\prime}$
$EH→\overrightarrow{\mathbf{EH}}$ 的物理意义为：词汇表 $V\mathbb{V}$ 中所有单词的输出向量的加权和，其权重为 $e_j$ 。
考虑到 $h→=WTx→\overrightarrow{\mathbf{h}}=\mathbf{W}^{T} \overrightarrow{\mathbf{x}}$ ，则有： $∂E∂wk,i=∂E∂hi×∂hi∂wk,i=EHi×xk\frac{\partial E}{\partial w_{k, i}}=\frac{\partial E}{\partial h_{i}} \times \frac{\partial h_{i}}{\partial w_{k, i}}=E H_{i} \times x_{k}$
由于 $x→\overrightarrow{\mathbf{x}}$ 是one-hot编码，所以它只有一个分量非零，因此 $∂E∂W\frac{\partial E}{\partial \mathbf{W}}$ 只有一行非零，且该非零行就等于 $EH→\overrightarrow{\mathbf{EH}}$ 。因此更新方程：
$w→I(new)=w→I(old)−ηEH→\overrightarrow{\mathbf{w}}_{I}^{(n e w)}=\overrightarrow{\mathbf{w}}_{I}^{(o l d)}-\eta \overrightarrow{\mathbf{E} \mathbf{H}}$
其中 $w→I\overrightarrow{\mathbf{w}}_{I}$ 为非零分量对应的W中的行，而W的其他行在本次更新中都保持不变。
考虑更新行的第k列，则：
$wI,k(new)=wI,k(old)−η∑j=1Vejwj,k′w_{I, k}^{(n e w)}=w_{I, k}^{(o l d)}-\eta \sum_{j=1}^{V} e_{j} w_{j, k}^{\prime}$
当 $yj≃tjy_{j} \simeq t_{j}$ 时，更新的幅度非常微小。
当 $y_j$ 与 $t_j$ 差距越大，则更新的幅度越大。
当给定许多训练样本，每个样本由两个单词组成，上述更新不断进行，更新的效果不断积累。

根据单词的共现效果，输出向量与输入向量相互作用并达到平滑。
- 输出向量 $w→′\overrightarrow{\mathbf{w}}^{\prime}$ 的更新依赖于输入向量 $w→I:w→j′(new)=w→j′(old)−ηejh→\overrightarrow{\mathbf{w}}_{I} : \quad \overrightarrow{\mathbf{w}}_{j}^{\prime(n e w)}=\overrightarrow{\mathbf{w}}_{j}^{\prime(o l d)}-\eta e_{j} \overrightarrow{\mathbf{h}}$ 。
  这里隐向量 $h→\overrightarrow{\mathbf{h}}$ 等于输入向量 $w→I\overrightarrow{\mathbf{w}}_I$ 。
- 输入向量 $w→I\overrightarrow{\mathbf{w}}_I$ 的更新依赖于输出向量 $w→′:w→I(new)=w→I(old)−ηEH→\overrightarrow{\mathbf{w}}^{\prime} : \quad \overrightarrow{\mathbf{w}}_{I}^{(n e w)}=\overrightarrow{\mathbf{w}}_{I}^{(o l d)}-\eta \overrightarrow{\mathbf{E} \mathbf{H}}$ 。
  这里 $EH→=∑j=1Vejw→j′\overrightarrow{\mathbf{E H}}=\sum_{j=1}^{V} e_{j} \overrightarrow{\mathbf{w}}_{j}^{\prime}$ 为词汇表 $V\mathbb{V}$ 中所有单词的输出向量的加权和，其权重为 $e_j$ 。
平衡的速度与效果取决于单词的共现分布，以及学习率。

Skip-Gram

CBOW模型是根据前几个单词（即上下文）来预测下一个单词，而Skip-Gram模型是根据一个单词来预测前几个单词（即上下文）。
在CBOW模型中：

同一个单词的表达（即输入向量 $w→I\overrightarrow{\mathbf{w}}_I$ ）是相同的，因为参数 $W\mathbf{W}$ 是共享的。
同一个单词的输出向量 $w→O′\overrightarrow{\mathbf{w}}_{O}^{\prime}$ 是不同的，因为输入向量随着上下文不同而不同。

在Skip-Gram模型中：

同一个单词的表达（即输入向量 $w→O′\overrightarrow{\mathbf{w}}_{O}^{\prime}$ ）是相同的，因为参数 $W′\mathbf{W}^{\prime}$ 是共享的
同一个单词的输入向量 $w→I\overrightarrow{\mathbf{w}}_{I}$ 是不同的，因为输入向量随着上下文不同而不同。

网络结构

Skip-Gram网络模型如下。其中：

网络输入 $x→=(x1,x2,⋯ ,xV)T∈RV\overrightarrow{\mathbf{x}}=\left(x_{1}, x_{2}, \cdots, x_{V}\right)^{T} \in \mathbb{R}^{V}$ ，它是输入单词的one-hot编码，其中只有一位为1，其他都为0。
网络输出 $y→1,y→2,⋯ ,y→C\overrightarrow{\mathbf{y}}_{1}, \overrightarrow{\mathbf{y}}_{2}, \cdots, \overrightarrow{\mathbf{y}}_{C}$ ，其中 $y→c=(y1c,y2c,⋯ ,yVc)T∈RV\overrightarrow{\mathbf{y}}_{c}=\left(y_{1}^{c}, y_{2}^{c}, \cdots, y_{V}^{c}\right)^{T} \in \mathbb{R}^{V}$ 是第c个输出单词为词汇表各单词的概率。
对于网络中的每个输出 $yc→\overrightarrow{y_c}$ ，其权重矩阵都相同，为 $W′W^{\prime}$ ，这称作权重共享。
这里的权重共享隐含着：每个单词的输出向量是固定的、唯一的，与其他单词的输出无关。

Skip-Gram网络模型中，设网络第c个输出的第j个分量为 $ujc=w→j′⋅h→u_{j}^{c}=\overrightarrow{\mathbf{w}}_{j}^{\prime} \cdot \overrightarrow{\mathbf{h}}$ ，则有：
$y_{j}^{c}=p\left(\operatorname{word}_{j}^{c} | \overrightarrow{\mathbf{x}}\right)=\frac{\exp \left(u_{j}^{c}\right)}{\sum_{k=1}^{V} \exp \left(u_{k}^{c}\right)} ; \quad c=1,2, \cdots, C ; \quad j=1,2, \cdots, V$
$y_{j}^{c}$ 表示第c个输出中，词汇表 $V\mathbb{V}$ 中第j个单词 $word⁡j\operatorname{word}_j$ 为真实输出单词的概率。
因为 $W′W^{\prime}$ 在多个单元之间共享，所以对于网络每个输出，其得分分布 $u→c=(u1c,u2c,⋯ ,uVc)T\overrightarrow{\mathbf{u}}_{c}=\left(u_{1}^{c}, u_{2}^{c}, \cdots, u_{V}^{c}\right)^{T}$ 是相同的，但是这并不意味着网络的每个输出都是同一个单词。

并不是网络每个输出中，得分最高的为预测单词。因为每个输出中，概率分布都相同，即 $y→1=y→2=⋯=y→C\overrightarrow{\mathbf{y}}_{1}=\overrightarrow{\mathbf{y}}_{2}=\cdots=\overrightarrow{\mathbf{y}}_{C}$ 。
Skip-Grame网络的目标是：网络的多个输出之间的联合概率最大。

假设输入为单词 $word⁡I\operatorname{word}_I$ ，输出单词序列为 $word⁡O1,word⁡O2,⋯ ,\operatorname{word}_{O_{1}}, \operatorname{word}_{O_{2}}, \cdots,$ word $o_{C}$ 。定义损失函数为： $E=-\log p\left(\operatorname{word}_{O_{1}}, \operatorname{word}_{O_{3}}, \cdots, \operatorname{word}_{O_{C}} | \text { word }_{I}\right)=-\log \prod_{c=1}^{C} \frac{\exp \left(u_{j_{c}^{*}}^{e}\right)}{\sum_{k=1}^{V} \exp \left(u_{k}^{c}\right)}$
其中 $j1∗,j2∗,⋯ ,jC∗j_{1}^{*}, j_{2}^{*}, \cdots, j_{C}^{*}$ 为输出单词序列对应于词典 $V\mathbb{V}$ 中的下标序列。
由于网络每个输出得分分布都相同，令 $uk=ukc=w→k′⋅h→u_{k}=u_{k}^{c}=\overrightarrow{\mathbf{w}}_{k}^{\prime} \cdot \overrightarrow{\mathbf{h}}$ ，则上式化简为
$E=-\sum_{c=1}^{C} u_{j_{c}}^{c}+C \log \sum_{k=1}^{V} \exp \left(u_{k}\right)$