文本解码原理-以MindNLP为例

最新推荐文章于 2025-01-16 15:34:30 发布

学计算机视觉的阿斌

最新推荐文章于 2025-01-16 15:34:30 发布

阅读量683

点赞数 15

文章标签：学习深度学习 transformer 人工智能 gpt

本文链接：https://blog.youkuaiyun.com/weixin_44242403/article/details/143989423

版权

基于MindNLP的文本解码原理学习

一背景知识
二解码方式
三学习心得

一背景知识

本次主要学习文本解码的选择策略，GPT模型（以及其他基于Transformer的生成模型）的一个核心是根据给定的输入上下文预测下一个词（或token）的概率分布。这个过程是逐个token进行的：

①输入: 一段文本序列（通常称为提示或上下文）

②处理: 模型通过多层Transformer层处理输入，生成每个可能token的概率。

③采样: 使用某种解码策略（如贪婪搜索、束搜索或采样方法）从这个概率分布中选择一个token。

④输出: 选择的token被添加到当前的序列中，进入下一次循环，或者结束。
其中，对于每一次选择都会有下一个token选取的问题：
在这里插入图片描述
每一个token都会影响下一次的输出，因而针对如何选择合适的选择策略也很大影响生成质量，在mindnlp中提供了以下选择策略：

二解码方式

1 贪心搜索

#greedy_search

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
#使用gpt2的词转序列工具
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

直接调用GPT2LMHeadModel进行贪心决策方法选择每次最大概率作为选择策略，并限制最大生成长度：

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. 
I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll

优点：简单

缺点: 错过了隐藏在低概率词后面的高概率词，也就是局部最优解，没有考虑连续性

2 Beam search

在每个时间步保留最可能的 num_beams 个词，并从中最终选择出概率最高的序列来降低丢失潜在的高概率序列的风险。
在这里插入图片描述
模型同样使用model = GPT2LMHeadModel.from_pretrained(“gpt2”, pad_token_id=tokenizer.eos_token_id)内置的gpt模型，通过参数来传入控制参数，包括每次保留的前几项num_beams，输出的重复性约束no_repeat_ngram_size，return_num_sequences 返回的语句

# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

针对不同策略，产生一些差异性的结果

I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll
----------------------------------------------------------------------------------------------------
Beam search with ngram, Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
----------------------------------------------------------------------------------------------------
return_num_sequences, Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to move on.
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it might be time for me to take a
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a step
----------------------------------------------------------------------------------------------------

优点：能够一定程度保留句子的最优路径

缺点：1. 无法解决重复问题，出现重复输出；2. 开放域生成效果差

3 Sample

根据当前条件概率分布随机选择输出词𝑤_𝑡，带有随机性
在这里插入图片描述
针对于每次的输出概率分布，可以限制在概率最高的前n个中，进行重新概率分配，然后依据概率进行随机选择。
优点：文本生成多样性高
缺点：生成文本不连续，可能会出现~~胡言乱语~~
为了避免偏差过大，在累积概率超过概率 p 的最小单词集中进行采样，重新归一化
在这里插入图片描述

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=5,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

输出结果确实更自由~

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog.

My dog loves to run, and when I was in the gym I had my dog play with a toy box and a box of treats. I had to make a decision. My wife and I were
1: I enjoy walking with my cute dog, but I also love to watch a movie and have a great time. I love to watch the movies, so I'm always looking forward to seeing what they have planned for me.

My dog is a
2: I enjoy walking with my cute dog.

在这里插入图片描述

三学习心得

gpt这类语言模型的构建已经不再像是一个简单的模型工作，反而像是一个工程，每一个部分均可以进行调整优化，并且不同的装配也能获得不同的生成效果，此次主要选了三种经典的选择方法，贪心，集束以及概率选取，针对每一类都有不同的变种产生，后续可以再进行探索，同时了解到了模型同类输入的缓存加速设计，这些技术的应用正是推动大模型落地的有效策略。
在这里插入图片描述