开源项目minGPT应用案例详解（PyTorch）

最新推荐文章于 2025-11-15 02:07:58 发布

原创最新推荐文章于 2025-11-15 02:07:58 发布 · 964 阅读

·

28

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

文章标签：

#pytorch #人工智能 #GPT #生成式预训练模型 #深度学习 #开源

学习中的一点总结同时被 2 个专栏收录

98 篇文章

订阅专栏

33 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

代码视频讲解：minGPT应用案例代码详解

开源代码地址1：https://github.com/karpathy/minGPT

开源代码地址2：https://github.com/KeepTryingTo/DeepLearning/tree/main/LLM/minGPT

基于Transformer的机器翻译，使用Pytorch深度学习框架实现和gradio实现一个小小的页面

有趣的是，在这几天之前我也没有想到会看到minGPT这个项目，起因是因为我在看VQ-GAN论文代码的时候，里面使用transformer架构的时候使用到了minGPT，所以当看到这个项目的时候挺兴奋的，但有很遗憾这么久才看到这个项目。

但是前提是在看这篇博文之前请读者先去看一下《基于transformer的机器翻译》那篇博文，看完之后再来看这篇博文（minGPT）相信会有更多的感悟和理解，因为都是在基于前面的内容生成后面的内容，因为本身GPT就是生成式预训练模型。

作者介绍：个人主页：Andrej Karpathy 开源项目主页：

https://github.com/karpathy

从这位作者的个人主页就已经说明了一切，牛！

minGPT 是GPT的 PyTorch 重新实现，包含训练和推理功能。minGPT 力求精简、简洁、可解释且具有教育意义，因为目前大多数 GPT 模型实现可能略显臃肿。GPT 并非一个复杂的模型，该实现大约有 300 行代码。其工作原理是将一系列索引输入到Transformer中，然后得出序列中下一个索引的概率分布。其复杂性主要体现在巧妙地进行批处理（跨样本和跨序列长度），以提高效率。重写的nanoGPT，从单纯的教育导向转向一个仍然简单易行但又有实际意义的东西（复现中等规模的行业基准，接受一些权衡以提高运行时效率等）——摘自开源项目。

目录

1 实例化GPT-2以及训练方法

2.GPT相关内容

1. 为什么是“生成式”？

2. 为什么是“预训练”？

3.与判别式模型的对比

4.GPT的发展历程

3.模型核心代码模块

4.内容生成核心部分

5.验证核心部分（以“加法”案例为例子介绍）

6.测试代码（加法案例）

1 实例化GPT-2以及训练方法

from mingpt.model import GPT
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2'
model_config.vocab_size = 50257 # openai's model vocabulary
model_config.block_size = 1024  # openai's model block_size (i.e. input context length)
model = GPT(model_config)

# your subclass of torch.utils.data.Dataset that emits example
# torch LongTensor of lengths up to 1024, with integers from [0,50257)
train_dataset = YourDataset()

from mingpt.trainer import Trainer
train_config = Trainer.get_default_config()
train_config.learning_rate = 5e-4 # many possible options, see the file
train_config.max_iters = 1000
train_config.batch_size = 32
trainer = Trainer(train_config, model, train_dataset)
trainer.run()

2.GPT相关内容

1. 为什么是“生成式”？

生成任务导向：GPT 通过 自回归（Autoregressive） 方式逐词生成文本（如续写文章、对话回复等），本质是建模序列数据的概率分布 P(下一个词∣上文)。
训练方式：采用 语言建模目标（如交叉熵损失），通过预测文本中下一个词来学习语言规律，而非判别式任务（如分类、标注）。

2. 为什么是“预训练”？

两阶段训练：
1. 预训练阶段：在海量无标注文本上训练，学习通用语言表示（如 GPT-3 训练于数千亿词）。
2. 微调阶段：针对具体任务（如问答、摘要）用少量标注数据适配模型。
参数冻结：预训练后的模型参数可作为强特征提取器，部分场景下无需微调（即 Zero-shot/Prompt-based 推理）。

3.与判别式模型的对比

特性	生成式（如 GPT）	判别式（如 BERT）
目标	生成新文本	分类/标注已有文本
训练任务	语言模型（预测下一个词）	掩码语言模型+句子关系预测
典型应用	对话、创作、翻译	文本分类、实体识别
数据依赖	无标注文本为主	需任务相关标注数据

4.GPT的发展历程

GPT-1（2018）：验证生成式预训练的有效性。
GPT-2（2019）：扩大参数量，展示 Zero-shot 能力。
GPT-3（2020）：千亿参数，突破 Few-shot 学习。
GPT-4（2023）：多模态支持，推理能力增强。

3.模型核心代码模块

class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    """

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # TODO regularization
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        # TODO 因果掩码，以确保注意力只应用于输入序列中的左侧
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        #TODO 批量大小，序列长度，嵌入维数（n_embd）
        B, T, C = x.size() #

        # TODO 计算批处理中所有头部的查询、键、值，并将头部向前移动为批处理dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # TODO causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # TODO (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # TODO re-assemble all head outputs side by side

        # TODO output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU(),
            dropout = nn.Dropout(config.resid_pdrop),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlpf(self.ln_2(x))
        return x

4.内容生成核心部分

    def generate(self, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
        """TODO
            取索引idx（形状为（b,t）的LongTensor）的条件序列，并完成序列max_new_tokens，
            每次将预测反馈回模型。最有可能的是，您需要确保在model.eval（）模式下执行此操作。
        """
        for _ in range(max_new_tokens):
            # TODO 如果输入的文本长度大于指定的上下文长度，就需要将部分丢弃
            idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:]
            # TODO 计算分类结果
            logits, _ = self(idx_cond)
            # TODO 在最后一步取对数，并按所需temperature缩放
            #  形状从 (batch_size, seq_len, vocab_size) 变为 (batch_size, vocab_size)
            logits = logits[:, -1, :] / temperature
            # TODO 可选地将logits裁剪为仅前k个选项
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('Inf') # TODO 将非前k的logits设为负无穷
            # TODO 应用softmax将对数转换为（规范化）概率
            probs = F.softmax(logits, dim=-1)
            # TODO 要么从分布中抽样，要么取最可能的元素
            if do_sample:
                idx_next = torch.multinomial(probs, num_samples=1)
            else:
                #TODO 选择预测概率最大的
                _, idx_next = torch.topk(probs, k=1, dim=-1)
            # TODO 将抽样索引附加到运行序列并继续推理，直到结束生成
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

5.验证核心部分（以“加法”案例为例子介绍）

    def eval_split(trainer, split, max_batches=None):
        dataset = {'train':train_dataset, 'test':test_dataset}[split]
        ndigit = config.data.ndigit
        results = []
        mistakes_printed_already = 0

        factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]]).to(trainer.device)
        loader = DataLoader(dataset, batch_size=100, num_workers=0, drop_last=False)
        for b, (x, y) in enumerate(loader):
            x = x.to(trainer.device)
            # TODO 获得操作数1和操作数2 isolate the first two digits of the input sequence alone
            d1d2 = x[:, :ndigit*2]
            # TODO 输入操作数1和操作数2 ，让后进行生成结果，让模型对序列的其余部分进行采样
            d1d2d3 = model.generate(d1d2, ndigit+1, do_sample=False) # using greedy argmax, not sampling
            #TODO  分离采样序列的最后ndigit位数，也就是去最后生成的数字
            d3 = d1d2d3[:, -(ndigit+1):]
            d3 = d3.flip(1) # reverse the digits to their "normal" order
            # TODO 从字符串解码操作数1和操作数2  decode the integers from individual digits
            d1i = (d1d2[:,:ndigit] * factors[:,1:]).sum(dim = 1)
            d2i = (d1d2[:,ndigit:ndigit*2] * factors[:,1:]).sum(dim = 1)
            #TODO 对生成的结果进行解码
            d3i_pred = (d3 * factors).sum(1)
            d3i_gt = d1i + d2i # manually calculate the ground truth
            # TODO 验证生成的结果和真实值是否相同 evaluate the correctness of the results in this batch
            correct = (d3i_pred == d3i_gt).cpu() # Software 1.0 vs. Software 2.0 fight RIGHT on this line haha
            for i in range(x.size(0)):
                results.append(int(correct[i]))
                if not correct[i] and mistakes_printed_already < 5: # TODO 只打印最多5个错误
                    mistakes_printed_already += 1
                    print("GPT claims that %d + %d = %d but gt is %d" % (d1i[i], d2i[i], d3i_pred[i], d3i_gt[i]))
            if max_batches is not None and b+1 >= max_batches:
                break
        rt = torch.tensor(results, dtype=torch.float)
        print("%s final score: %d/%d = %.2f%% correct" % (split, rt.sum(), len(results), 100*rt.mean()))
        return rt.sum()

6.测试代码（加法案例）

def demo():
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    config = get_config()

    train_dataset = AdditionDataset(config.data, split='train')
    config.model.vocab_size = train_dataset.get_vocab_size()  # TODO 这里的词汇表就是0-9，因为数字都是0-9之间的数随机组成的
    config.model.block_size = train_dataset.get_block_size()
    ndigit = int(config.data.ndigit)
    model = GPT(config.model).to(device)

    weight_path = r'outputs/adder/model.pt'
    checkpoint = torch.load(weight_path, map_location='cpu')
    model.load_state_dict(checkpoint)

    def add(a, b, ndigit):
        c = a + b
        # TODO encode the digits of a, b, c into strings
        astr = f'%0{ndigit}d' % a
        bstr = f'%0{ndigit}d' % b
        cstr = (f'%0{ndigit + 1}d' % c)[::-1]  # TODO reverse c to make addition easier
        render = astr + bstr + cstr
        dix = [int(s) for s in render]  # TODO 转换字符到对应的token convert each character to its token index
        # TODO x作为编码器的输入，y作为期望的输出，如果了解基于transformer的机器翻译就知道，解码器部分和编码器之间的关系
        #  x will be input to GPT and y will be the associated expected outputs
        x = torch.tensor(dix[:-1], dtype=torch.long)  # TODO 去掉最后一个元素
        y = torch.tensor(dix[1:], dtype=torch.long)  # TODO 去掉第一个元素 predict the next token in the sequence
        # TODO
        y[:ndigit * 2 - 1] = -1  # TODO 我们将只在输出地点进行训练。-1将把损失掩盖为零

        return x,y
    #TODO [100,10,1]
    factors = torch.tensor([[10 ** i for i in range(ndigit + 1)][::-1]]).to(device)
    while True:
        a = int(input("请输入a的值: "))
        b = int(input("请输入b的值: "))
        x, y = add(a,b,ndigit)
        x = x.unsqueeze(dim = 0).to(device)

        # TODO 获得操作数1和操作数2 isolate the first two digits of the input sequence alone
        d1d2 = x[:, :ndigit * 2]
        # TODO 输入操作数1和操作数2 ，让后进行生成结果，让模型对序列的其余部分进行采样
        d1d2d3 = model.generate(d1d2, ndigit + 1, do_sample=False)  # using greedy argmax, not sampling
        # TODO  分离采样序列的最后ndigit位数，也就是去最后生成的数字
        d3 = d1d2d3[:, -(ndigit + 1):]
        d3 = d3.flip(1)  # reverse the digits to their "normal" order
        # TODO 从字符串解码操作数1和操作数2  decode the integers from individual digits
        d1i = (d1d2[:, :ndigit] * factors[:, 1:]).sum(dim=1)
        d2i = (d1d2[:, ndigit:ndigit * 2] * factors[:, 1:]).sum(dim=1)
        # TODO 对生成的结果进行解码
        d3i_pred = (d3 * factors).sum(1)
        d3i_gt = d1i + d2i  # manually calculate the ground truth

        print(f'{a} + {b} prediciton = {d3i_pred.item()} and gt is {d3i_gt.item()}')

您可能感兴趣的与本文相关的镜像

GPT-oss:20b

GPT-oss:20b

图文对话

Gpt-oss

GPT OSS 是OpenAI 推出的重量级开放模型，面向强推理、智能体任务以及多样化开发场景

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。