【第四章:大模型(LLM)】01.Embedding is all you need-(5)Word2vec代码实现及应用

第四章: 大模型(LLM)
第一部分:Embedding is all you need
第五节:Word2vec代码实现及应用


一、Word2vec简介

Word2vec 是 Google 提出的将词语转为向量的模型,常见的两种架构:

  • CBOW(Continuous Bag of Words):通过上下文预测目标词。

  • Skip-Gram:通过目标词预测上下文。

它通过神经网络学习词与词之间的上下文关系,使得语义相近的词在向量空间中距离更近。


二、代码实现(以 Skip-Gram 为例)

1. 数据准备与预处理
import torch
from collections import Counter
import numpy as np

text = "we love deep learning and deep learning loves us".split()
vocab = list(set(text))
word2idx = {word: i for i, word in enumerate(vocab)}
idx2word = {i: word for word, i in word2idx.items()}
corpus = [word2idx[word] for word in text]
2. 构建Skip-Gram训练样本
def generate_skipgram(corpus, window_size=2):
    pairs = []
    for i, center in enumerate(corpus):
        for j in range(-window_size, window_size + 1):
            if j == 0 or i + j < 0 or i + j >= len(corpus):
                continue
            context = corpus[i + j]
            pairs.append((center, context))
    return pairs

pairs = generate_skipgram(corpus)
3. 构建模型
import torch.nn as nn

class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.in_embed = nn.Embedding(vocab_size, embedding_dim)
        self.out_embed = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, center, context):
        center_vec = self.in_embed(center)
        context_vec = self.out_embed(context)
        score = torch.mul(center_vec, context_vec).sum(dim=1)
        return torch.sigmoid(score)
4. 训练模型
import torch.optim as optim

model = SkipGramModel(len(vocab), 100)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(100):
    total_loss = 0
    for center, context in pairs:
        center = torch.LongTensor([center])
        context = torch.LongTensor([context])
        label = torch.ones(1)

        pred = model(center, context)
        loss = criterion(pred, label)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {total_loss:.4f}")
Epoch 0, Loss: 40.0532
Epoch 10, Loss: 0.0395
Epoch 20, Loss: 0.0177
Epoch 30, Loss: 0.0102
Epoch 40, Loss: 0.0067
Epoch 50, Loss: 0.0047
Epoch 60, Loss: 0.0034
Epoch 70, Loss: 0.0026
Epoch 80, Loss: 0.0020
Epoch 90, Loss: 0.0016

三、词向量可视化

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

embeddings = model.in_embed.weight.data.numpy()
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

plt.figure(figsize=(8,6))
for i, label in enumerate(vocab):
    x, y = reduced[i]
    plt.scatter(x, y)
    plt.text(x+0.01, y+0.01, label)
plt.title("Word2Vec Embedding Visualization")
plt.show()

 


四、实际应用场景

  • 相似词推荐:通过计算词向量余弦相似度找到语义接近的词。

  • 情感分析、搜索推荐、命名实体识别(NER)等自然语言处理任务中作为词向量基础。

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def most_similar(word, topn=3):
    vec = model.in_embed(torch.LongTensor([word2idx[word]])).detach().numpy()[0]
    sims = []
    for i in range(len(vocab)):
        other = model.in_embed(torch.LongTensor([i])).detach().numpy()[0]
        sims.append((idx2word[i], cosine_similarity(vec, other)))
    sims.sort(key=lambda x: -x[1])
    return sims[1:topn+1]

print(most_similar("deep"))
[('loves', 0.03819855), ('love', 0.019423524), ('we', -0.09409246)]

总结

项目内容
模型结构两层网络:输入层和输出层
应用方式CBOW / Skip-Gram
优势语义可解释、计算高效
工具PyTorch / Gensim / TensorFlow

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值