准备工作:
训练样本(promot+target):
context: Instruction: 你现在是一个很厉害的阅读理解器,严格按照人类指令进行回答。
Input: 句子中包含了哪些信息,输出json:
如何演好自己的角色,请读《演员自我修养》《喜剧之王》周星驰崛起于穷困潦倒之中的独门秘笈。
Answer:
target: ```json
[{"predicate": "主演", "object_type": "人物", "subject_type": "影视作品", "object": "周星驰", "subject": "喜剧之王"}]
```
模型特殊分隔符config.json
"bos_token_id": 130004,
"eos_token_id": 130005,
"mask_token_id": 130000,
"gmask_token_id": 130001,
"pad_token_id": 3,
组装输入
第一步、对promot编码
prompt文本为:
Instruction: 你现在是一个很厉害的阅读理解器,严格按照人类指令进行回答。\nInput: 句子中包含了哪些信息,输出json:\n\n如何演好自己的角色,请读《演员自我修养》《喜剧之王》周星驰崛起于穷困潦倒之中的独门秘笈。\nAnswer:
prompts_ids为:
[37010, 12, 5, 76331, 83362, 92831, 103593, 64464, 6, 77115, 65077, 72863, 63891, 66207, 63823, 4, 3430, 12, 5, 74351, 63833, 77756, 66263, 64086, 6, 66835, 2031, 12, 4, 4, 64185, 66686, 92394, 66316, 6, 64157, 64406, 63845, 65721, 64936, 70796, 66050, 73649, 75305, 63846, 87470, 72919, 63888, 66597, 67680, 117819, 64871, 65171, 63825, 65575, 64066, 105400, 63823, 4, 13049, 12]
注意,prompt的结尾是到“Answer:”,也就是为了引出下文。
第二步、对target编码
target文本:
```json\n[{"predicate": "主演", "object_type": "人物", "subject_type": "影视作品", "object": "周星驰", "subject": "喜剧之王"}]\n```
target_ids为:
[5, 125827, 2031, 4, 127903, 38861, 83, 28, 72426, 57, 28, 1932, 24, 317, 83, 28, 65210, 57, 28, 9832, 24, 317, 83, 28, 92609, 57, 28, 1932, 83, 28, 87470, 57, 28, 9832, 83, 28, 73649, 75305, 127731, 4, 125827]
第三步、合并prompt和target得到input_ids
执行input_ids = tokenizer.build_inputs_with_special_tokens(prompts_ids, target_ids),这个方法是合并prompt和target,然后填充编码特殊标识符,得到input_ids。
input_ids: [prompt的编码,<gmask>,<bos>,target的编码,<eos>]
[37010, 12, 5, 76331, 83362, 92831, 103593, 64464, 6, 77115, 65077, 72863, 63891, 66207, 63823, 4, 3430, 12, 5, 74351, 63833, 77756, 66263, 64086, 6, 66835, 2031, 12, 4, 4, 64185, 66686, 92394, 66316, 6, 64157, 64406, 63845, 65721, 64936, 70796, 66050, 73649, 75305, 63846, 87470, 72919, 63888, 66597, 67680, 117819, 64871, 65171, 63825, 65575, 64066, 105400, 63823, 4, 13049, 12, 130001, 130004, 5, 125827, 2031, 4, 127903, 38861, 83, 28, 72426, 57, 28, 1932, 24, 317, 83, 28, 65210, 57, 28, 9832, 24, 317, 83, 28, 92609, 57, 28, 1932, 83, 28, 87470, 57, 28, 9832, 83, 28, 73649, 75305, 127731, 4, 125827, 130005]
第四步、从input_ids拆分出label_ids
label_ids: [prompt和gmask的位置填充-100,<bos>,target的编码,<eos>]
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 125827, 2031, 4, 127903, 38861, 83, 28, 72426, 57, 28, 1932, 24, 317, 83, 28, 65210, 57, 28, 9832, 24, 317, 83, 28, 92609, 57, 28, 1932, 83, 28, 87470, 57, 28, 9832, 83, 28, 73649, 75305, 127731, 4, 125827, 130005]
说明:先找到gmask_position在input_ids中的位置,然后这之前的都置为-100即可。
第五步、input_ids和label_ids进行右padding
input_ids:
[37010, 12, 5, 76331, 83362, 92831, 103593, 64464, 6, 77115, 65077, 72863, 63891, 66207, 63823, 4, 3430, 12, 5, 74351, 63833, 77756, 66263, 64086, 6, 66835, 2031, 12, 4, 4, 64185, 66686, 92394, 66316, 6, 64157, 64406, 63845, 65721, 64936, 70796, 66050, 73649, 75305, 63846, 87470, 72919, 63888, 66597, 67680, 117819, 64871, 65171, 63825, 65575, 64066, 105400, 63823, 4, 13049, 12, 130001, 130004, 5, 125827, 2031, 4, 127903, 38861, 83, 28, 72426, 57, 28, 1932, 24, 317, 83, 28, 65210, 57, 28, 9832, 24, 317, 83, 28, 92609, 57, 28, 1932, 83, 28, 87470, 57, 28, 9832, 83, 28, 73649, 75305, 127731, 4, 125827, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 125827, 2031, 4, 127903, 38861, 83, 28, 72426, 57, 28, 1932, 24, 317, 83, 28, 65210, 57, 28, 9832, 24, 317, 83, 28, 92609, 57, 28, 1932, 83, 28, 87470, 57, 28, 9832, 83, 28, 73649, 75305, 127731, 4, 125827, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
说明:输入长度为200,因此需要input_ids和label_ids都填充到200的长度,input_id填充3,label_ids填充-100。
说明
-
模型生成时的top_k:生成每个词时只考虑概率top50个,重新采样得到结果。
-
tokenizer.special_tokens_map:记录这这个分词器中记录着bos_token等标识的编
-
ChatGLM-6B:有65024个token,包括汉字、词、英文(词块)、标点符号等。promot中既包含了分类任务,也包含了核心词提取任务。 而且每类任务都设置了3 4套不同的promot话术(防止过拟合?)
-
SPO提取:类似于NER提取。主实体(subject)、实体间关系(predicate)、目标实体(object),三者之间的关系概况为:“subject的predicate是object”。
-
ChatGLM中的Mask有两种mask:某个词的mask用[MASK]表示,多个连续的词用[gMASK]表示。
-
分词BPE:Byte-Pair Encoding (字节对编码) 通过反复合并最频繁的字符对来生成子词单元,从而 减少词汇表大小并有效地处理未见词。初始阶段从字符级别开始,根据相邻字符出现的频率,将其合并为新的 token,以构建词汇表。个人理解:先按字符级别切分,然后根据相邻字符出现的频率,合并成词,降低词表大小。
-
hatGLM-6B使用lora后有6.1B的参数,只有300w的可以训练,即6%的可训练参数。大模型在训练的项目中,对参数名包含['bias','LayerNorm.weight']都不需行参数衰减,其他都进行参数衰减,防止过拟合。
-
如果使用loar,在模型保存时,需要合并lora参数和大模型参数,merge_and_unload()得到新模型再保存。
-
peft包可以很方便的给模型增加lora结构。