NLP基础之——Contextual Word Representations and Pretraining

本文介绍了三种深度学习在自然语言处理领域的模型:ELMo、GPT和BERT。ELMo通过双向LSTM预训练得到上下文敏感的词嵌入;GPT采用Transformer结构进行无监督预训练;BERT则引入了双向Transformer,通过遮盖预测和下一句预测任务进行预训练,显著提升了NLP任务的性能。

一、ELMO
ELMO的基本思想是利用双向的LSTM结构,对于某个语言模型的目标,在大量文本上进行预训练,从LSTM layer中得到contextual embedding,其中较低层的LSTM代表了比较简单的语法信息,而上层的LSTM捕捉的是依赖于上下文的语义信息。ELMO的全称就是Embeddings from Language Models。对于下游的任务,再将这些不同层的向量线性组合,再做监督学习。
ELMo算法过程为:

  1. 先在大语料上以language model为目标训练出bidirectional LSTM模型;
  2. 然后利用LSTM产生词语的表征;
    ELMo模型包含多layer的bidirectional LSTM,可以这么理解:
    高层的LSTM的状态可以捕捉词语意义中和语境相关的那方面的特征(比如可以用来做语义的消歧),而低层的LSTM可以找到语法方面的特征(比如可以做词性标注)。
    在这里插入图片描述
    详细来说,对于N个标记(t1,t2,.....,tN)(t_1,t_2,.....,t_N)t1,t2,.....,tN ,forward language model学习的是根据(t1,t2,.....,tk−1)(t_1,t_2,.....,t_k-1)t1,t2,.....,tk1 的信息推测的概率:
    在这里插入图片描述
    而 backward language model学习的是依据(tk+1,.....,tN)(t_k+1,.....,t_N)tk+1,.....,tN 的信息推测tkt_ktk 的概率:
    在这里插入图片描述
    而bidirectional LSTM就是将两者结合起来,其目标是最大化
    在这里插入图片描述
    在这里插入图片描述在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    采用了ELMO预训练产生的contextual embedding之后,在各项下游的NLP任务中,准确率都有显著提高。
    二、GPT
    GPT全称是Generative Pre-Training, 和之后的BERT模型一样,它的基本结构也是Transformer。
    GPT的核心思想是利用Transformer模型对大量文本进行无监督学习,其目标函数就是语言模型最大化语句序列出现的概率,不过这里的语言模型仅仅是forward单向的,而不是双向的。得到这些embedding后,再对下游的task进行supervised fine-tuning。
    GPT的训练分为两个阶段:1)无监督预训练语言模型;2)各个任务的微调。
    模型结构图:
    在这里插入图片描述
    三、BERT
    BERT原理与GPT有相似之处,不过它利用了双向的信息,因而其全称是Bidirectional Encoder Representations from Transformers。

在这里插入图片描述
BERT做无监督的pre-training时有两个目标:
一个是将输入的文本中 k%的单词遮住,然后预测被遮住的是什么单词。
另一个是预测一个句子是否会紧挨着另一个句子出现。
预训练时在大量文本上对这两个目标进行优化,然后再对特定任务进行fine-tuning。BERT由于采用了Transformer结构能够更好的捕捉全局信息,并且利用了上下文的双向信息,所以其效果要优于之前的方法,它大大提高了各项NLP任务的性能。

### Mask Prediction in Computer Vision and NLP Mask prediction is a critical concept in both computer vision and natural language processing (NLP), where it involves predicting missing or masked parts of data. Below, the details of mask prediction in these two domains are outlined. #### Mask Prediction in Computer Vision In computer vision, mask prediction often refers to techniques used for tasks such as image segmentation, object detection, and inpainting. One prominent example is the use of masks in instance segmentation models like Mask R-CNN. These models predict pixel-level masks for each detected object in an image, enabling precise localization and classification of objects[^1]. The process typically involves: - Detecting regions of interest (RoIs) in the image. - Predicting class labels and bounding boxes for these regions. - Generating binary masks that correspond to the object shapes within the bounding boxes. #### Mask Prediction in Natural Language Processing In NLP, mask prediction is central to transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers). In this context, mask prediction involves predicting masked tokens in a sentence during pretraining. This is achieved through: - Randomly masking some tokens in the input sequence. - Using the transformer architecture to predict the original values of these masked tokens based on contextual information from surrounding tokens[^4]. The bidirectional nature of transformers allows them to effectively capture left-to-right and right-to-left contexts, improving the accuracy of predictions. #### Techniques and Applications Both fields employ advanced techniques to enhance mask prediction performance. For instance, gradient-based methods or leave-one-out approaches can be utilized to determine the importance of specific inputs in the prediction process[^3]. Additionally, large language models have demonstrated capabilities in various predictive tasks, including those involving masked inputs, showcasing their versatility in handling complex data patterns[^2]. ```python # Example of mask prediction in NLP using Hugging Face's transformers library from transformers import BertTokenizer, BertForMaskedLM import torch tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM.from_pretrained('bert-base-uncased') text = "The capital of France is [MASK]." input_ids = tokenizer.encode(text, return_tensors="pt") mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1] outputs = model(input_ids) predictions = outputs.logits predicted_token_id = torch.argmax(predictions[0, mask_token_index]).item() predicted_word = tokenizer.decode([predicted_token_id]) print(f"Predicted word: {predicted_word}") ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值