llama2 代码实验记录

原创

已于 2024-03-24 10:27:10 修改 · 1.2k 阅读

21 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #机器学习 #深度学习 #算法

于 2024-03-13 13:16:14 首次发布

torchrun分布式启动，所以要想在云端的环境下在本地的IDE上debug，需要设置一下，具体可以参考这里，需要传入的路径参数全部使用绝对路径。

1、传入的句子

2、tokenizer 的tokenization

3、model的主要组成部分

4、过程中自己的小实验

总结

模型结构（测试只搭建了16层 model）

1、传入的句子

即prompts，实例中的如下所示

['I believe the meaning of life is', 'Simply put, the theory of relativity states that ', 'A brief message congratulating the team on the launch:\n\n        Hi everyone,\n        \n        I just ', 'Translate English to French:\n        \n        sea otter => loutre de mer\n        peppermint => menthe poivrée\n        plush girafe => girafe peluche\n        cheese =>']

2、tokenizer 的tokenization

示例如下

# 传入
'I believe the meaning of life is'

# tokenizer
[306, 4658, 278, 6593, 310, 2834, 338]

# 加入bos后
[1, 306, 4658, 278, 6593, 310, 2834, 338]

# 传入
'Simply put, the theory of relativity states that '

# 
[3439, 17632, 1925, 29892, 278, 6368, 310, 14215, 537, 5922, 393, 29871]

#
[1, 3439, 17632, 1925, 29892, 278, 6368, 310, 14215, 537, 5922, 393, 29871]

每个句子都被tokenizer用向量来表示。不过发现并不是逐词对应一个编码，有的对应了两个。其实这就涉及了sentencepiece的原理，见这里。由于兴趣使然，继续探索了一下

'I believe the meaning of life is'
[306, 4658, 278, 6593, 310, 2834, 338]

'I believe, the meaning of life is '
[306, 4658, 29892, 278, 6593, 310, 2834, 338, 29871]

嗯，可以确定除了单词外，标点符号和空格也算做句子的一个部分。

'Simply' >> [3439, 17632]
'simply' >>[3763]

'Big' >>[7997]
'big' >>[4802]

不再继续探索了，详细的去查看sentencepiece原理，以及字节对编码 (BPE) (byte-pair-encoding (BPE)）

3、model的主要组成部分

1、RSMNorm

一种标准化方法，详见这里

2、注意力运算

transformer decoder，应用掩码mask，主要的是采用RoPE位置编码方式，见

最低0.47元/天解锁文章

200万优质内容无限畅学