Pytorch-手动实现Bert的训练过程（简写版）

最新推荐文章于 2025-04-02 23:22:52 发布

Douzi1024

最新推荐文章于 2025-04-02 23:22:52 发布

阅读量668

点赞数

文章标签： python nlp 自然语言处理深度学习大数据

本文链接：https://blog.youkuaiyun.com/Xiao_CangTian/article/details/113706458

版权

该博客介绍了如何在PyTorch中手动实现Bert的训练过程，重点在于数据预处理，包括构造单词表和映射，以及设置超参数。接着详细阐述了Dataloader的实现，包括数据生成和DataLoader的编写，特别是如何进行随机mask操作以模拟Bert的训练策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

视频讲解

直接看这个-->Github

导包：

import re
import math
import torch
import numpy as np
from random import *
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data

1. 数据预处理

1.1 构造单词表和映射

text = (
    'Hello, how are you? I am Romeo.\n'                   # R
    'Hello, Romeo My name is Juliet. Nice to meet you.\n' # J
    'Nice to meet you too. How are you today?\n'          # R
    'Great. My baseball team won the competition.\n'      # J
    'Oh Congratulations, Juliet\n'                        # R
    'Thank you Romeo\n'                                   # J
    'Where are you going today?\n'                        # R
    'I am going shopping. What about you?\n'              # J
    'I am going to visit my grandmother. she is not very well' # R
)
sentences = re.sub("[.,!?\\-]", '', text.lower()).split('\n')    # filter '.', ',', '?', '!'

# 所有句子的单词list
word_list = list(set(" ".join(sentences).split()))               # ['hello', 'how', 'are', 'you',...]

# 给单词表中所有单词设置序号
word2idx = {'[PAD]' : 0, '[CLS]' : 1, '[SEP]' : 2, '[MASK]' : 3}
for i, w in enumerate(word_list):
    word2idx[w] = i + 4

# 用于 idx 映射回 word
idx2word = {i: w for i, w in enumerate(word2idx)}
vocab_size = len(word2idx)         # 40

# token: 就是每个单词在词表中的index
token_list = list()                # token_list存储了每一句的token
for sentence in sentences:
    arr = [word2idx[s] for s in sentence.split()]
    token_list.append(arr)

展示一下

最低0.47元/天解锁文章