词性标注是将给定文本中每个单词与其正确的词性联系起来的任务,本质是一个序列标记问题,也就是说,它为一个序列的每一个部分提供一个label。其他的例子还有tokenization,semantic Role Labeling(SLR),Word Sense Disambiguation。
RNN
目前为止,当计算review的representation的时候,我们把它当成bag of word()。我们并没有考虑他们在text中出现的顺序,但同样word不同顺序可能完全不同
- 1 star review:This LG television is very bad, not as the previous model that was very good
- 5 star review:This LG television is very good, not as the previous model that was very bad
因为单词出现的顺序对理解语义十分重要,现在介绍一些序列模型。
RNN属于一类用于时间序列建模的神经网络,由于序列长度先前是不确定的,它不能用简单的FF来解决,相反,RNN使用memory机制来从序列中获取信息,直到当前的时间。即,当处理位置i单词时,我们将访问位置i-1以及之前的所有信息
为了实现memory机制,RNN使用了两种输出,一种表示当前时间,另一种用于表示处理过的序列部分。
RNN使用两个不同的权重矩阵来对memory和输入x建模,时间t的输出为:
POS tagging with Neural Networks
我们可以将一句话以指定window size分为很多子序列。如
She sells seashells on the seashore.
分为:
She sells seashells
on the seashore
. [PAD] [PAD]
移动窗口的步长是由它的长度决定的,步长也可以设置成很短如1,一次前进一步而不是window size步,进而产生更多子序列。
SETUP
!pip install conllu
!pip install torchtext==0.6.0
下载我们所需的data
!git clone https://github.com/pasinit/nlp2020_POStagging_data
!unzip nlp2020_POStagging_data/r2.2.zip > /dev/null
training_file = "UD_English-EWT-r2.2/en_ewt-ud-train.conllu"
dev_file = "UD_English-EWT-r2.2/en_ewt-ud-dev.conllu"
test_file = "UD_English-EWT-r2.2/en_ewt-ud-test.conllu"
import
# here go all the imports
import torch
from torch import nn
from torch.utils.data import Dataset
from torchtext import data
from torchtext.vocab import Vectors
from conllu import parse as conllu_parse
from pprint import pprint
from tqdm import tqdm
from torchtext.vocab import Vocab
from collections import Counter
import random
import numpy as np
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
Data Preparation
我们使用CoNLL的注释数据
!head UD_English-EWT-r2.2/en_ewt-ud-train.conllu
每一行都是一个token,每一列都是每个token的feature
- 从1开始的index
- word form
- word lamma
- Universal Part-of-Speech tag
- Language-specific Part-of-Speech tag
- many others ...
我们使用conllu包来建立一个POSTaggingDataSet来读取训练数据
class POSTaggingDataset(Dataset):
def __init__(self,
input_file:str,
window_size:int,
window_shift:int=-1,
lowercase=True,
device="cuda"):
"""
We assume that the dataset pointed by input_file is already tokenized
and can fit in memory.
Args:
input_file (string): The path to the dataset to be loaded.
window_size (integer): The maximum length of a sentence in terms of
number of tokens.
window_shift (integer): The number of tokens we shift the window
over the sentence. Default value is -1 meaning that the window will
be shifted by window_size.
lowercase (boolean): whether the text has to be lowercased or not.
device (string): device where to put tensors (cpu or cuda).
"""
self.input_file = input_file
self.window_size = window_size
self.window_shift = window_shift if window_shift > 0 else window_size
self.lowercase = lowercase
with open(input_file) as reader:
# read the entire file with reader.read() e parse it
sentences = conllu_parse(reader.read())
self.device = device
self.data = self.create_windows(sentences)
self.encoded_data = None
def index_dataset(self, l_vocabulary, l_label_vocabulary):
self.encoded_data = list()
for i in range(len(self.data)):
# for each window
elem = self.data[i]
# Hello, my name is Andrea
# {hello:0, my: 1, name:2, ...}
# encode_text() will return [0, 1, 2]
encoded_elem = torch.LongTensor(self.encode_text(elem, l_vocabulary)).to(self.device)
# for each element d in the elem window (d is a dictionary with the various fields from the CoNLL line)
encoded_labels = torch.LongTensor([l_label_vocabulary[d["upostag"]] if d is not None
else l_label_vocabulary["<pad>"] for d in elem]).to(self.device)
self.encoded_data.append({"inputs":encoded_elem,
"outputs":encoded_labels})
def create_windows(self, sentences):
"""
Args:
sentences (list of lists of dictionaries,
where each dictionary represents a word occurrence parsed from a CoNLL line)
"""
data = []
for sentence in sentences:
if self.lowercase:
for d in sentence:
# lowers the inflected form
d["form"] = d["form"].lower()
for i in range(0, len(sentence), self.window_shift):
window = sentence[i:i+self.window_size]
if len(window) < self.window_size:
window = window + [None]*(self.window_size - len(window))
assert len(window) == self.window_size
data.append(window)
return data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
if self.encoded_data is None:
raise RuntimeError("""Trying to retrieve elements but index_dataset
has not been invoked yet! Be sure to invoke index_dataset on this object
before trying to retrieve elements. In case you want to retrieve raw
elements, use the method get_raw_element(idx)""")
return self.encoded_data[idx]
def get_raw_element(self, idx):
return self.data[idx]
@staticmethod
def encode_text(sentence:list,
l_vocabulary:Vocab):
"""
Args:
sentences (list): list of OrderedDict, each carrying the information about
one token.
l_vocabulary (Vocab): vocabulary with mappings from words to indices and viceversa.
Return:
The method returns a list of indices corresponding to the input tokens.
"""
indices = list()
for w in sentence:
if w is None:
indices.append(l_vocabulary["<pad>"])
elif w["form"] in l_vocabulary.stoi: # vocabulary string to integer
indices.append(l_vocabulary[w["form"]])
else:
indices.append(l_vocabulary.unk_index)
return indices
@staticmethod
def decode_output(outputs:torch.Tensor,
l_label_vocabulary: Vocab):
"""
Args:
outputs (Tensor): a Tensor with shape (batch_size, max_len, label_vocab_size)
containing the logits outputed by the neural network.
l_label_vocabulary (Vocab): is the vocabulary containing the mapping from
a string label to its corresponding index and vice versa
Output:
The method returns a list of batch_size length where each element is a list
of labels, one for each input token.
"""
max_indices = torch.argmax(outputs, -1).tolist() # resulting shape = (batch_size, max_len)
predictions = list()
for indices in max_indices:
# vocabulary integer to string is used to obtain the corresponding word from the max index
predictions.append([l_label_vocabulary.itos[i] for i in indices])
return predictions
我们使用torchtext.vocab.Vocab将word映射到id
def build_vocab(dataset, min_freq=1):
counter = Counter()
for i in tqdm(range(len(dataset))):
# for each token in the sentence viewed as a dictionary of items from the CoNLL line
for token in dataset.get_raw_element(i):
if token is not None:
counter[token["form"]]+=1
# we add special tokens for handling padding and unknown words at testing time.
return Vocab(counter, specials=['<pad>', '<unk>'], min_freq=min_freq)
def build_label_vocab(dataset):
counter = Counter()
for i in tqdm(range(len(dataset))):
for token in dataset.get_raw_element(i):
if token is not None:
counter[token["upostag"]]+=1
# No <unk> token for labels.
return Vocab(counter, specials=['<pad>'])
#input_file = "/content/drive/My Drive/conll_2018_ud/ud-en-treebank-v2.2/UD_English-EWT/en_ewt-ud-train.conllu"
window_size, window_shift = 100, 100
dataset = POSTaggingDataset(training_file, window_size, window_shift)
vocabulary = build_vocab(dataset, min_freq=2)
label_vocabulary = build_label_vocab(dataset)
dataset.index_dataset(vocabulary, label_vocabulary)
import torchtext
print(torchtext.__version__)
pprint(["{}:{}".format(x, y) for x, y in list(vocabulary.stoi.items())[:10]])
print("home index: ", vocabulary["home"])
print("<pad> index: ", vocabulary["<pad>"])
print("<unk> index", vocabulary["<unk>"])
print("word at index 52: ", vocabulary.itos[52])
print("unknown words are indexed at: ", vocabulary["alskfj"])
print("vocab size:", len(label_vocabulary))
print("string to index")
pprint(label_vocabulary.stoi)
print()
print("index to string")
pprint(label_vocabulary.itos)
我们需要将input加密为id,将output解密为text
# inputs from the first sentence
input_tensor = dataset[0]["inputs"]
# forms of the first sentence
print([d["form"] if d is not None else "None" for d in dataset.get_raw_element(0)])
pprint(list(zip([d["form"] if d is not None else "None" for d in dataset.get_raw_element(0)], input_tensor.tolist())))
print("american has index: ", vocabulary["american"])
Model Definition
我们准备通过叠加BiLSTM块的模型构建模型,最后一层用一个线性层来分类。
所有神经网络类都应继承自torch.nn.Module类,它会负责register在__init__中分配给self属性的参数,只有register的参数才会在训练中被优化。
Embedding Layer
enmbedding是一个vocab_size行,embedding_dim列的矩阵。每一行代表着vocab中的一个word。例如:vocab为{"hello":0, "the":1, "dog":2},那么matrix第一行为hello的representation。
在pytorch中,你可以通过以下方法声明一个embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)
你可以通过调用embedding layer来检索单词的embedding
x_embeddings = embedding(x)