BERT学习笔记二：create_pretraining_data.py

最新推荐文章于 2024-09-06 17:00:43 发布

陌筱北

最新推荐文章于 2024-09-06 17:00:43 发布

阅读量2.4k

点赞数 1

CC 4.0 BY-SA版权

分类专栏：笔记文章标签： BERT

本文链接：https://blog.youkuaiyun.com/moxiaobeiMM/article/details/84545788

笔记专栏收录该内容

16 篇文章

订阅专栏

以下三个参数必须输入

flags.DEFINE_string("input_file", None,
                    "Input raw text file (or comma-separated list of files).")

flags.DEFINE_string(
    "output_file", None,
    "Output TF example file (or comma-separated list of files).")

flags.DEFINE_string("vocab_file", None,
                    "The vocabulary file that the BERT model was trained on.")

input_file：预训练原始文本的文件名（或以逗号分隔的文件名列表）。
官方下载的代码中有示例：sample_text.txt
输入文件格式：
（1）一句一行。这些应该是实际的句子，而不是整个段落或任意的文本跨度。（需要使用句子边界来做“下一句预测”任务）。
（2）文档间空白行。文档边界是必要的，以便“下一句预测”任务不跨越文档之间。
在这里插入图片描述

output_file：输出文件。输出TF示例文件（或以逗号分隔的文件列表）

vocab_file：字典文件。可调用官方数据中的
your_path\uncased_L-12_H-768_A-12\vocab.txt（多语言）
your_path\chinese_L-12_H-768_A-12\vocab.txt（仅中文）

也可以调用自己的字典

加载vocab截图
在这里插入图片描述

创建训练数据

def create_training_instances(input_files, tokenizer, max_seq_length,
                              dupe_factor, short_seq_prob, masked_lm_prob,
                              max_predictions_per_seq, rng)

将input_file的数据读入后，按以下格式存储
在这里插入图片描述

将数据标记化处理（tokenization），加入[MASK]以及分句标识[CLS][SEP]

create_instances_from_document(
              all_documents, document_index, max_seq_length, short_seq_prob,
              masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

在这里插入图片描述