mindspore打开第十四天文本解码原理1-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_39306047/article/details/140193657

## __文本解码原理\-\-以MindNLP为例__

### 回顾：自回归语言模型

__根据前文预测下一个单词__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8f6e55c907ef7d2b616e8e3c4da76b065633c2ae/Season2.step_into_llm/03.Decoding/example.png"></div>

__一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8d9110b0d7b8840d536880b7fceaf0df1b3e8354/Season2.step_into_llm/03.Decoding/P%e5%85%ac%e5%bc%8f.png"></div>

* 𝑊_0:初始上下文单词序列
* 𝑇: 时间步
* 当生成EOS标签时，停止生成。

__MindNLP/huggingface Transformers提供的文本生成方法__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8f6e55c907ef7d2b616e8e3c4da76b065633c2ae/Season2.step_into_llm/03.Decoding/3possible.png"></div>

__Greedy search__

在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:

𝑤_𝑡=𝑎𝑟𝑔𝑚𝑎𝑥_𝑤 𝑃(𝑤|𝑤_(1:𝑡−1))

按照贪心搜索输出序列("The","nice","woman") 的条件概率为：0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词，如：dog=0.5, has=0.9
![image.png](attachment:image.png =600x600)

#greedy_search

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# activate beam search and early_stopping
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set no_repeat_ngram_size to 2
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
no_repeat_ngram_size=2,
early_stopping=True
)

print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set return_num_sequences > 1
beam_outputs = model.generate(
input_ids,
max_length=50,
num_beams=5,
no_repeat_ngram_size=2,
num_return_sequences=5,
early_stopping=True
)

# now we have 3 output sequences
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')