第四章: 大模型(LLM)
第一部分:Embedding is all you need
第五节:Word2vec代码实现及应用
一、Word2vec简介
Word2vec 是 Google 提出的将词语转为向量的模型,常见的两种架构:
-
CBOW(Continuous Bag of Words):通过上下文预测目标词。
-
Skip-Gram:通过目标词预测上下文。
它通过神经网络学习词与词之间的上下文关系,使得语义相近的词在向量空间中距离更近。
二、代码实现(以 Skip-Gram 为例)
1. 数据准备与预处理
import torch
from collections import Counter
import numpy as np
text = "we love deep learning and deep learning loves us".split()
vocab = list(set(text))
word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for word, i in word2idx.items()}
corpus = [word2idx[word] for word in text]
2. 构建Skip-Gram训练样本
def generate_skipgram(corpus, window_size=2):
pairs = []
for i, center in enumerate(corpus):
for j in range(-window_size, window_size + 1):
if j == 0 or i + j < 0 or i + j >= len(corpus):
continue
context = corpus[i + j]
pairs.append((center, context))
return pairs
pairs = generate_skipgram(corpus)
3. 构建模型
import torch.nn as nn
class SkipGramModel(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.in_embed = nn.Embedding(vocab_size, embedding_dim)
self.out_embed = nn.Embedding(vocab_size, embedding_dim)
def forward(self, center, context):
center_vec = self.in_embed(center)
context_vec = self.out_embed(context)
score = torch.mul(center_vec, context_vec).sum(dim=1)
return torch.sigmoid(score)
4. 训练模型
import torch.optim as optim
model = SkipGramModel(len(vocab), 100)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
for epoch in range(100):
total_loss = 0
for center, context in pairs:
center = torch.LongTensor([center])
context = torch.LongTensor([context])
label = torch.ones(1)
pred = model(center, context)
loss = criterion(pred, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {total_loss:.4f}")
Epoch 0, Loss: 40.0532
Epoch 10, Loss: 0.0395
Epoch 20, Loss: 0.0177
Epoch 30, Loss: 0.0102
Epoch 40, Loss: 0.0067
Epoch 50, Loss: 0.0047
Epoch 60, Loss: 0.0034
Epoch 70, Loss: 0.0026
Epoch 80, Loss: 0.0020
Epoch 90, Loss: 0.0016
三、词向量可视化
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
embeddings = model.in_embed.weight.data.numpy()
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)
plt.figure(figsize=(8,6))
for i, label in enumerate(vocab):
x, y = reduced[i]
plt.scatter(x, y)
plt.text(x+0.01, y+0.01, label)
plt.title("Word2Vec Embedding Visualization")
plt.show()
四、实际应用场景
-
相似词推荐:通过计算词向量余弦相似度找到语义接近的词。
-
情感分析、搜索推荐、命名实体识别(NER)等自然语言处理任务中作为词向量基础。
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def most_similar(word, topn=3):
vec = model.in_embed(torch.LongTensor([word2idx[word]])).detach().numpy()[0]
sims = []
for i in range(len(vocab)):
other = model.in_embed(torch.LongTensor([i])).detach().numpy()[0]
sims.append((idx2word[i], cosine_similarity(vec, other)))
sims.sort(key=lambda x: -x[1])
return sims[1:topn+1]
print(most_similar("deep"))
[('loves', 0.03819855), ('love', 0.019423524), ('we', -0.09409246)]
总结
项目 | 内容 |
---|---|
模型结构 | 两层网络:输入层和输出层 |
应用方式 | CBOW / Skip-Gram |
优势 | 语义可解释、计算高效 |
工具 | PyTorch / Gensim / TensorFlow |