bert文本预处理_循序渐进bert解释实现部分1预处理-优快云博客

bert文本预处理

In Natural Language Processing (NLP) field, shortage of training data is one of the biggest challenges. Because NLP problems are very diverse and distinct, task-specific datasets lack human-labeled examples. However, deep learning based models benefit much from millions, if not billions, of data to achieve close or par level of human accuracy. To circumvent this lack of data issue, researchers have developed numerous techniques for training general purpose language representation models using enormous amount of unlabeled corpus on the web. This is known as pre-training and pre-trained models can be fine-tuned with much smaller datasets for specific NLP tasks such as question-answering and sentiment analysis. Recent results from many of such tasks have shown huge improvement over the traditional way of training with huge amount of datasets.

在自然语言处理(NLP)领域，培训数据的短缺是最大的挑战之一。由于NLP问题非常多样且截然不同，因此特定于任务的数据集缺少人工标记的示例。但是，基于深度学习的模型可从数百万甚至数十亿的数据中受益匪浅，以达到接近或同等水平的人类准确性。为了避免缺乏数据的问题，研究人员开发了许多技术，可以在网络上使用大量未标记的语料来训练通用语言表示模型。这就是所谓的预训练，可以针对特定的NLP任务(例如，问题解答和情感分析)使用更小的数据集对预训练模型进行微调。许多此类任务的最新结果表明，与使用大量数据集进行的传统训练方法相比，已有很大的改进。

In Nov 2018, Google open-sourced a new technique for NLP pre-training called, Bidirectional Encoder Representations from Transformers, or BERT. BERT builds upon previous pre-trained representation — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. Unlike these models, however, BERT is the first deeply bidirectional unsupervised language representation.

在2018年11月，Google开源了一项用于NLP预训练的新技术，称为Transformers的双向编码器表示或BERT。 BERT建立在以前的预训练表示的基础上-包括半监督序列学习，生成式预训练， ELMo和ULMFit 。但是，与这些模型不同，BERT是第一个深度双向无监督语言表示形式。

Pre-trained representations can be context-free or contextual, and contextual representations can, further, be unidirectional or bidirectional. For instance, in a following sentence, “I would like to sell my investment account”, the word investment means the same as it would in “top of an investment club” in a context-free model. In a unidirectional model, investment is represented based on “I would like to sell my” but not “account”. However, BERT uses both previous and next contexts to represent the word investment. This process starts from the very bottom of a deep neural network, making it deeply bidirectional.

预训练的表示可以是无上下文的或上下文的，并且上下文的表示可以进一步是单向的或双向的。例如，在下面的句子“我想出售我的投资账户”中，投资一词的含义与上下文无关模型中“投资俱乐部的顶层”中的含义相同。在单向模型中，投资基于“我想出售我的我”而不是“帐户”来表示。但是，BERT使用上一个和下一个上下文来表示投资一词。这个过程从深度神经网络的最底层开始，使其成为深度双向的。

An image below shows a visualization of BERT representation vs. OpenAI GPT and ELMO.

下图显示了BERT表示与OpenAI GPT和ELMO的关系。

Unidirectional models are trained by predicting each word conditioned on the previous words. However, this technique cannot be used for bidirectional models as we cannot condition on from front and back. This would indirectly cause a mirror effect in a multilayer model. To overcome this problem, masking technique is used on random 15% of the data being trained. For example:

通过预测以先前单词为条件的每个单词来训练单向模型。但是，这种技术不能用于双向模型，因为我们不能从正面和背面进行条件调整。这将间接导致多层模型中的镜像效应。为了克服这个问题，在随机训练的15％数据上使用了屏蔽技术。例如：

Now that we understand why one would implement BERT to solve a task-specific NLP problem, let’s dive right in.

现在，我们了解了为什么要实施BERT来解决特定于任务的NLP问题，让我们深入研究。

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
MAX_LEN = 128

First, let’s import necessary libraries and if you don’t have these libraries, just simply do pip install {library name} or conda install {library name} if you are using conda to install them. We set the maximum length of truncation to be 128 length. We will come back to “truncation” later in Part 2. We are using PyTorch for this implementation. For Tensorflow implementation, I will post another blog to cover this.

首先，让我们导入必要的库，如果您没有这些库，只需简单地执行pip install {library name}或conda install {library name}(如果您使用conda来安装它们)。我们将截断的最大长度设置为128个长度。我们将在第2部分的稍后部分回到“截断”部分。我们将PyTorch用于此实现。对于Tensorflow实施，我将发布另一个博客来介绍。

def bert_id_cat(df):
    df['category_id'] = df['Scenario'].factorize()[0]
    category_id_df = df[['Scenario', 'category_id']].drop_duplicates().sort_values('category_id')
    category_to_id = dict(category_id_df.values)
    id_to_category = dict(category_id_df[['category_id', 'Scenario']].values)
    return category_to_id, id_to_category

Here, I’m using a dataset that has Scenario column as label column. We need to build a dictionary that converts labels into numerical labels and vise versa for easy conversion.

在这里，我使用的是具有“方案”列作为标签列的数据集。我们需要建立一个字典，将标签转换为数字标签，反之亦然，以便于转换。

def save_dict(category_to_id, id_to_category):
    f = open("bert_model/category_to_id_bert.pickle", "wb")
    f.write(pickle.dumps(category_to_id))
    f.close()
    f = open("bert_model/id_to_category_bert.pickle", "wb")
    f.write(pickle.dumps(id_to_category))
    f.close()

Because we want to save the created dictionaries, we define a save function to store these dictionaries.

因为我们要保存创建的词典，所以我们定义了一个保存函数来存储这些词典。

def bert_preprocess(sentences):
    sentences_list = []
    for sent in sentences:
        sent_find = sent.find("www")
        if sent_find != -1:
            sent = sent[:sent_find - 1]
        sent = pretext.remove_special_characters(str(sent))
        sent = pretext.expand_contractions(sent)
        sent = "[CLS] " + str(sent) + " [SEP]"
        sentences_list.append(sent)
    return sentences_list

In order to feed correct form of data into BERT model, we separate each datum starting with [CLS] and ending with [SEP]. In addition, I’m using my own pretext library to clean each sentence. You don’t need this library to preprocess the texts but if you need this file let me know in the comment down below.

为了将正确的数据形式输入BERT模型，我们将每个数据以[CLS]开头，以[SEP]结尾。另外，我正在使用自己的预置库来清理每个句子。您不需要该库来预处理文本，但是如果您需要此文件，请在下面的注释中告诉我。

def tokenizedTexts(tokenizer, sentences):
    tokenized_texts = []
    for sent in sentences:
        sent = tokenizer.tokenize(sent)
        if len(sent) > 512:
            sent = sent[:511]
            sent.extend(['[SEP]'])
        else:
            pass
        tokenized_texts.append(sent)
    return tokenized_texts

We tokenized the preprocessed texts into tokens. PyTorch can only support a sentence of up to the length of 512 tokens, we limit our sentences to 512 tokens.

我们将预处理后的文本标记为标记。 PyTorch仅支持不超过512个令牌的句子，我们将句子限制为512个令牌。

Now, we are ready to create input ids and start on attention masking process. We will continue this on Part2 of the series.

现在，我们准备创建输入ID，并开始注意屏蔽过程。我们将在本系列的第2部分中继续进行此操作。

翻译自: https://medium.com/swlh/step-by-step-bert-explanation-implementation-part-1-preprocessing-dbfa19787302

bert文本预处理