POS tagging(词性标注) 上

文章介绍了词性标注作为序列标记问题的重要性,并详细阐述了RNN如何利用记忆机制处理序列信息,特别是对于理解单词顺序敏感的任务如情感分析。通过建立POSTaggingDataset类处理CoNLL数据集,使用BiLSTM模型进行序列建模,将输入转化为ID并解码输出标签。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

词性标注是将给定文本中每个单词与其正确的词性联系起来的任务,本质是一个序列标记问题,也就是说,它为一个序列的每一个部分提供一个label。其他的例子还有tokenization,semantic Role Labeling(SLR),Word Sense Disambiguation。

 RNN

目前为止,当计算review的representation的时候,我们把它当成bag of word()。我们并没有考虑他们在text中出现的顺序,但同样word不同顺序可能完全不同

  1. 1 star review:This LG television is very bad, not as the previous model that was very good
  2. 5 star review:This LG television is very good, not as the previous model that was very bad

因为单词出现的顺序对理解语义十分重要,现在介绍一些序列模型。

RNN属于一类用于时间序列建模的神经网络,由于序列长度先前是不确定的,它不能用简单的FF来解决,相反,RNN使用memory机制来从序列中获取信息,直到当前的时间。即,当处理位置i单词时,我们将访问位置i-1以及之前的所有信息

为了实现memory机制,RNN使用了两种输出,一种表示当前时间,另一种用于表示处理过的序列部分。

RNN使用两个不同的权重矩阵来对memory和输入x建模,时间t的输出为:

RNN(x_{t})=act(x_{t}W_{t}+b_{x}+m_{t-1}W_{m}+b_{m})

POS tagging with Neural Networks

我们可以将一句话以指定window size分为很多子序列。如

She sells seashells on the seashore.

分为:

She sells seashells

on the seashore

. [PAD] [PAD]

移动窗口的步长是由它的长度决定的,步长也可以设置成很短如1,一次前进一步而不是window size步,进而产生更多子序列。

SETUP

!pip install conllu
!pip install torchtext==0.6.0

下载我们所需的data


!git clone https://github.com/pasinit/nlp2020_POStagging_data
!unzip nlp2020_POStagging_data/r2.2.zip  > /dev/null
training_file = "UD_English-EWT-r2.2/en_ewt-ud-train.conllu"
dev_file = "UD_English-EWT-r2.2/en_ewt-ud-dev.conllu"
test_file = "UD_English-EWT-r2.2/en_ewt-ud-test.conllu"

import

# here go all the imports
import torch
from torch import nn
from torch.utils.data import Dataset
from torchtext import data
from torchtext.vocab import Vectors

from conllu import parse as conllu_parse
from pprint import pprint
from tqdm import tqdm
from torchtext.vocab import Vocab
from collections import Counter
import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Data Preparation

 我们使用CoNLL的注释数据  

!head UD_English-EWT-r2.2/en_ewt-ud-train.conllu

 每一行都是一个token,每一列都是每个token的feature

  1. 从1开始的index
  2. word form
  3. word lamma
  4. Universal Part-of-Speech tag
  5. Language-specific Part-of-Speech tag
  6. many others ...

我们使用conllu包来建立一个POSTaggingDataSet来读取训练数据

class POSTaggingDataset(Dataset):

    def __init__(self,
                 input_file:str,
                 window_size:int,
                 window_shift:int=-1,
                 lowercase=True,
                 device="cuda"):
        """
        We assume that the dataset pointed by input_file is already tokenized
        and can fit in memory.
        Args:
            input_file (string): The path to the dataset to be loaded.
            window_size (integer): The maximum length of a sentence in terms of
            number of tokens.
            window_shift (integer): The number of tokens we shift the window
            over the sentence. Default value is -1 meaning that the window will
            be shifted by window_size.
            lowercase (boolean): whether the text has to be lowercased or not.
            device (string): device where to put tensors (cpu or cuda).
        """

        self.input_file = input_file
        self.window_size = window_size
        self.window_shift = window_shift if window_shift > 0 else window_size
        self.lowercase = lowercase
        with open(input_file) as reader:
            # read the entire file with reader.read() e parse it
            sentences = conllu_parse(reader.read())
        self.device = device
        self.data = self.create_windows(sentences)
        self.encoded_data = None

    def index_dataset(self, l_vocabulary, l_label_vocabulary):
        self.encoded_data = list()
        for i in range(len(self.data)):
            # for each window
            elem = self.data[i]
            # Hello, my name is Andrea
            # {hello:0, my: 1, name:2, ...}
            # encode_text() will return [0, 1, 2]
            encoded_elem = torch.LongTensor(self.encode_text(elem, l_vocabulary)).to(self.device)
            # for each element d in the elem window (d is a dictionary with the various fields from the CoNLL line)
            encoded_labels = torch.LongTensor([l_label_vocabulary[d["upostag"]] if d is not None
                              else l_label_vocabulary["<pad>"] for d in elem]).to(self.device)
            self.encoded_data.append({"inputs":encoded_elem,
                                      "outputs":encoded_labels})

    def create_windows(self, sentences):
        """
        Args:
            sentences (list of lists of dictionaries,
                          where each dictionary represents a word occurrence parsed from a CoNLL line)
        """
        data = []
        for sentence in sentences:
          if self.lowercase:
              for d in sentence:
                  # lowers the inflected form
                  d["form"] = d["form"].lower()
          for i in range(0, len(sentence), self.window_shift):
            window = sentence[i:i+self.window_size]
            if len(window) < self.window_size:
              window = window + [None]*(self.window_size - len(window))
            assert len(window) == self.window_size
            data.append(window)
        return data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        if self.encoded_data is None:
            raise RuntimeError("""Trying to retrieve elements but index_dataset
            has not been invoked yet! Be sure to invoke index_dataset on this object
            before trying to retrieve elements. In case you want to retrieve raw
            elements, use the method get_raw_element(idx)""")
        return self.encoded_data[idx]

    def get_raw_element(self, idx):
        return self.data[idx]

    @staticmethod
    def encode_text(sentence:list,
                l_vocabulary:Vocab):
        """
        Args:
            sentences (list): list of OrderedDict, each carrying the information about
            one token.
            l_vocabulary (Vocab): vocabulary with mappings from words to indices and viceversa.
        Return:
            The method returns a list of indices corresponding to the input tokens.
        """
        indices = list()
        for w in sentence:
            if w is None:
                indices.append(l_vocabulary["<pad>"])
            elif w["form"] in l_vocabulary.stoi: # vocabulary string to integer
                indices.append(l_vocabulary[w["form"]])
            else:
                indices.append(l_vocabulary.unk_index)
        return indices

    @staticmethod
    def decode_output(outputs:torch.Tensor,
                    l_label_vocabulary: Vocab):
        """
        Args:
            outputs (Tensor): a Tensor with shape (batch_size, max_len, label_vocab_size)
                containing the logits outputed by the neural network.
            l_label_vocabulary (Vocab): is the vocabulary containing the mapping from
            a string label to its corresponding index and vice versa
        Output:
            The method returns a list of batch_size length where each element is a list
            of labels, one for each input token.
        """
        max_indices = torch.argmax(outputs, -1).tolist() # resulting shape = (batch_size, max_len)
        predictions = list()
        for indices in max_indices:
            # vocabulary integer to string is used to obtain the corresponding word from the max index
            predictions.append([l_label_vocabulary.itos[i] for i in indices])
        return predictions

我们使用torchtext.vocab.Vocab将word映射到id

def build_vocab(dataset, min_freq=1):
    counter = Counter()
    for i in tqdm(range(len(dataset))):
        # for each token in the sentence viewed as a dictionary of items from the CoNLL line
        for token in dataset.get_raw_element(i):
            if token is not None:
                counter[token["form"]]+=1
    # we add special tokens for handling padding and unknown words at testing time.
    return Vocab(counter, specials=['<pad>', '<unk>'], min_freq=min_freq)

def build_label_vocab(dataset):
    counter = Counter()
    for i in tqdm(range(len(dataset))):
        for token in dataset.get_raw_element(i):
            if token is not None:
                counter[token["upostag"]]+=1
    # No <unk> token for labels.
    return Vocab(counter, specials=['<pad>'])

#input_file = "/content/drive/My Drive/conll_2018_ud/ud-en-treebank-v2.2/UD_English-EWT/en_ewt-ud-train.conllu"
window_size, window_shift = 100, 100
dataset = POSTaggingDataset(training_file, window_size, window_shift)
vocabulary = build_vocab(dataset, min_freq=2)
label_vocabulary = build_label_vocab(dataset)
dataset.index_dataset(vocabulary, label_vocabulary)
import torchtext
print(torchtext.__version__)
pprint(["{}:{}".format(x, y) for x, y in list(vocabulary.stoi.items())[:10]])
print("home index: ", vocabulary["home"])
print("<pad> index: ", vocabulary["<pad>"])
print("<unk> index", vocabulary["<unk>"])
print("word at index 52: ", vocabulary.itos[52])
print("unknown words are indexed at: ", vocabulary["alskfj"])

 

print("vocab size:", len(label_vocabulary))
print("string to index")
pprint(label_vocabulary.stoi)
print()
print("index to string")
pprint(label_vocabulary.itos)

 我们需要将input加密为id,将output解密为text

# inputs from the first sentence
input_tensor = dataset[0]["inputs"]

# forms of the first sentence
print([d["form"] if d is not None else "None" for d in dataset.get_raw_element(0)])
pprint(list(zip([d["form"] if d is not None else "None" for d in dataset.get_raw_element(0)], input_tensor.tolist())))
print("american has index: ", vocabulary["american"])

Model Definition

我们准备通过叠加BiLSTM块的模型构建模型,最后一层用一个线性层来分类。

所有神经网络类都应继承自torch.nn.Module类,它会负责register在__init__中分配给self属性的参数,只有register的参数才会在训练中被优化。

Embedding Layer

enmbedding是一个vocab_size行,embedding_dim列的矩阵。每一行代表着vocab中的一个word。例如:vocab为{"hello":0, "the":1, "dog":2},那么matrix第一行为hello的representation。

在pytorch中,你可以通过以下方法声明一个embedding layer

embedding = nn.Embedding(vocab_size, embedding_dim)

你可以通过调用embedding layer来检索单词的embedding

x_embeddings = embedding(x)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值