pytorch情感分析入门5-Multi-class Sentiment Analysis

本文链接：https://blog.youkuaiyun.com/m0_61688615/article/details/128846827

这一节用到的模型和上一节相同，任务的区别在于，上一节是二分类任务，即positive or negative。而这一节是多分类任务，是对问题的类别进行分类。我们以上一节的代码为基准，简单讲解不同之处。

数据预处理方面，首先的不同在于，处理多分类问题时，Pytorch希望标签的类型为 LongTensors，所以我们不再需要设置dtype。其次，我们用到的数据集是TREC，而不是IMDB，其中参数fine_grained允许我们使用fine_grained labels（有50个classes，或者不使用(在这种情况下是6个classes)。

import torch
from torchtext.legacy import data
from torchtext.legacy import datasets
import random

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm')

LABEL = data.LabelField()

train_data, test_data = datasets.TREC.splits(TEXT, LABEL, fine_grained=False)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

查看训练集中的示例。

vars(train_data[-1])

{'text': ['What', 'is', 'a', 'Cartesian', 'Diver', '?'], 'label': 'DESC'}

查看标签

print(LABEL.vocab.stoi)

defaultdict(None, {'HUM': 0, 'ENTY': 1, 'DESC': 2, 'NUM': 3, 'LOC': 4, 'ABBR': 5})

这六个标签对应数据集中的六种问题：

HUM for questions about humans

ENTY for questions about entities

DESC for questions asking you for a description

NUM for questions where the answer is numerical

LOC for questions where the answer is a location

ABBR for questions asking about abbreviations

模型和上一节相同，唯一的区别在于output_dim是C，而不是1。输出维度C即LABEL vocab的大小，而输入维度即TEXT vocab的大小。

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [2,3,4]
OUTPUT_DIM = len(LABEL.vocab)
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)

损失函数，之前我们使用的是BCEWithLogitsLoss，但这里我们使用CrossEntropyLoss。CrossEntropyLoss在模型的输出后执行softmax函数，loss由这个结果和label之间的cross entropy给出。

Generally：

CrossEntropyLoss : 适用于多分类问题，即我们的示例仅属于C类其中之一时。

BCEWithLogitsLoss : 适用于二分类问题，即我们的示例仅属于两类之一，或者多标签分类的情况。

import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.CrossEntropyLoss()

model = model.to(device)
criterion = criterion.to(device)

模型的输出为C维向量，其中每个元素的值是实例属于该类的belief。例如，在我们的标签中，我们有:'HUM' = 0， 'ENTY' = 1， 'DESC' = 2， 'NUM' = 3， 'LOC' = 4和'ABBR' = 5。如果我们的模型输出是:[5.1,0.3,0.1,2.1,0.2,0.6]，这意味着模型强烈地认为这个例子属于第0类，关于人类的问题，并轻微地认为这个例子属于第3类，数值问题。

我们这里计算准确性，首先利用argmax获得batch中每个示例的元素预测最大值的索引，然后计算这等于实际标签的次数。之后我们对整个batch取平均。

def categorical_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

用户输入方面，我们创建一个函数来预测给定问题的类别。我们使用argmax来获得最高的预测类索引，而不是使用sigmoid函数将输出压缩到0-1之间。然后，我们将这个索引与标签词汇表一起使用，以获得人类可读的标签。

import spacy
nlp = spacy.load('en_core_web_sm')

def predict_class(model, sentence, min_len = 4):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    preds = model(tensor)
    max_preds = preds.argmax(dim = 1)
    return max_preds.item()

pred_class = predict_class(model, "Who is Keyser Söze?")
print(f'Predicted class is: {pred_class} = {LABEL.vocab.itos[pred_class]}')

Predicted class is: 0 = HUM