Xinhui学习NLP的笔记本：基于MLP/CNN的姓氏分类系统-优快云博客

本文链接：https://blog.youkuaiyun.com/Ahu_Wxh_Ai/article/details/139564281

A Surname Classification System based on MLP

基于MLP的姓氏分类系统

This notebook serves as my learning journey into the Multilayer Perceptron (MLP), which is a fundamental type of Feedforward Neural Network. Throughout this article, I will be undertaking the following tasks and documenting my learning process:

Master the Application of Multi-layer Perceptron in Multi-class Classification:
- Using the example of “Surname Classification with Multi-layer Perceptron” to understand the practical implementation.
Understand the Impact of Different Neural Network Layers:
- Analyzing how each type of neural network layer affects the size and shape of the data tensors it processes.
Experiment with the SurnameClassifier Model:
- Introducing dropout into the model and observing the changes in the results.

In addition to these tasks, I will also attempt to explain some of the challenges I encountered and clarify the underlying principles as best as I can. Even though my English is not such good, I choose English to be my article’s language, so that I can improve my English.

0 说明

本博客最开始作为笔者一次课程实验的汇报成果而开始写作，能力和精力实在有限，文章质量欠佳，后续笔者会更新更加实用有料的博客，欢迎关注。各位大佬轻喷。博客使用英文写作，原因一笔者最近在学习英语，希望创造一些练习的机会；原因二是在课程实验实施的环境中，使用中文模式无法使用一些快捷键。在完成实验的前提下，我还尽我所能阐述了必要原理，并黏贴了一些我所参考或推荐的网址。下面，正文开始！

1 Look into MLP

1.1 Perceptron (which we learnt in Lab1)

The perceptron is a fundamental building block of neural networks. It’s a type of artificial neuron that can perform binary classifications. The perceptron takes a set of inputs, applies weights to them, sums them up, and passes the result through an activation function to produce an output.

Here’s a simple diagram of a perceptron:

  x1 ---- w1
         \
  x2 ---- w2
          \
  x3 ---- w3   ----> Σ (sum) ----> Activation Function ----> Output
          /
  ... ----
         /
  xn ---- wn

1.2 How a Perceptron Works

Inputs: Each input $x_1, x_2, ..., x_n)$ represents a feature of the data.
Weights: Each input has an associated weight $w_1, w_2, ..., w_n)$ . These weights are learned during the training process.
Summation: The perceptron computes a weighted sum of the inputs:
$\sum_{i=1}^{n} w_i x_i + b$
where $ b $ is the bias term.
Activation Function: The summation result $ z $ is then passed through an activation function & f &. For a simple perceptron, a common choice is the step function:

$\begin{cases} 1 & \text{if } z \ge 0 \\ 0 & \text{if } z < 0 \end{cases}$

The activation function determines the perceptron’s output, which is typically binary (0 or 1).

It can be mathematically represented as:

$\hat{y} = f\left(\sum_{i=1}^{n} w_i x_i + b\right)$

where $\hat{y}$ is the predicted output, $ x_i $ are the input features, $ w_i$are the weights, $ b $ is the bias term and $ f $ is the activation function.

在这里插入图片描述

Consider a simple binary classification problem where we want to determine if an email is spam (1) or not spam (0). The perceptron takes features of the email as inputs (e.g., presence of certain keywords, length of the email, etc.), computes the weighted sum, applies the activation function, and produces the output: spam or not spam. Of course, due to the structure is to simple, the result may not be such good. There’s a article by Medium which’s far more good than my arcticle, and this is its link: https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

1.3 Drawbacks of the Perceptron

While the perceptron is a fundamental concept in neural networks, it has some significant limitations:

Linear Separability: The perceptron can only solve problems that are linearly separable. This means it can only classify data points that can be separated by a straight line (in two dimensions), a plane (in three dimensions), or a hyperplane (in higher dimensions). For example, the XOR problem, which is not linearly separable, cannot be solved by a single perceptron.
Limited Expressiveness: Because it only involves a single layer of computation, a single-layer perceptron cannot capture complex patterns or relationships in the data. It lacks the ability to learn higher-order features.

1.4 How MLP Solves These Drawbacks

The Multilayer Perceptron (MLP), also known as a Feedforward Neural Network, addresses these drawbacks by introducing multiple layers of neurons. Here’s how MLP overcomes the limitations of a single-layer perceptron:

Non-linear Activation Functions: MLPs use non-linear activation functions (such as ReLU, sigmoid, or tanh) in hidden layers. These non-linearities allow the network to learn complex, non-linear relationships between the input and output.
Multiple Layers (Hidden Layers): By stacking multiple layers of neurons (hidden layers) between the input and output layers, MLPs can learn hierarchical representations of the data. Each layer captures different levels of abstraction:
- First Layer: Captures simple patterns or features.
- Subsequent Layers: Combine these simple patterns to capture more complex features.
Universal Approximation: An MLP with at least one hidden layer and non-linear activation functions can approximate any continuous function to any desired degree of accuracy, given sufficient neurons in the hidden layer. This is known as the Universal Approximation Theorem.

1.5 Architecture of an MLP

An MLP consists of an input layer, one or more hidden layers, and an output layer:

Input Layer  -->  Hidden Layer(s)  -->  Output Layer
  x1  x2        h1  h2  h3           y1  y2
   \  /          \  /  /             \  /
    h1            h1  h2  ...         h1  h2  ...
   /  \          /  /  \             /  /
  o1  o2        o1  o2  o3          o1  o2

The XOR problem is a classic example that a single-layer perceptron cannot solve because the data points are not linearly separable. However, an MLP with one hidden layer can solve the XOR problem.

While the perceptron is limited to solving only linearly separable problems, the MLP overcomes these limitations through its layered architecture and non-linear activation functions. This allows MLPs to learn and approximate complex, non-linear functions, making them much more powerful and versatile for a wide range of classification and regression tasks.

For instance, MLP model can solve the problems below whereas Percertron can’t.

2 Implementing MLPs in PyTorch

This section serves as a guide to getting started with MLP using PyTorch, a machine learning library based on the Torch library. PyTorch is widely recognized as one of the two most popular machine learning libraries, alongside TensorFlow. It offers free and open-source software released under the modified BSD license. According to our professor, PyTorch is currently prevailing in the field. In this part, I will focus more on providing code examples rather than extensive explanations.

# import package
import torch.nn as nn
import torch.nn.functional as F

class MultilayerPerceptron(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(MultilayerPerceptron, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the MLP

        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate = F.relu(self.fc1(x_in))
        output = self.fc2(intermediate)

        if apply_softmax:
            output = F.softmax(output, dim=1)
        return output

Let’s take a example:

batch_size = 2 # number of samples input at once
input_dim = 3 
hidden_dim = 100 
output_dim = 4

# Initialize model
mlp = MultilayerPerceptron(input_dim, hidden_dim, output_dim)
print(mlp)

MultilayerPerceptron(
  (fc1): Linear(in_features=3, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=4, bias=True)
)

Through the output, we can tell the MLP implemention by PuTorch consists of 2 fully connected layers: the first layer takes a 3-dimensional input and produces a 100-dimensional output, while the second layer takes this 100-dimensional input and generates a 4-dimensional output, representing the number of classification classes.

import torch # dl
def describe(x):
    """
    This function is used to describe tensor
    """
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

x_input = torch.rand(batch_size, input_dim)
describe(x_input)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0.4838, 0.0619, 0.5794],
        [0.9018, 0.9110, 0.3688]])

Let’s put the tensor x into our MLP model and see what’ll happen.

y_output = mlp(x_input, apply_softmax=False)
describe(y_output)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 4])
Values: 
tensor([[-0.3002, -0.0441, -0.0726, -0.1772],
        [-0.3877,  0.1662,  0.0653,  0.0621]], grad_fn=<AddmmBackward>)

We can conclude:

MLPs are linear layers that transform tensors into other tensors. Nonlinearities are incorporated between each pair of linear layers to introduce nonlinearity and enable the model to deform the vector space. In a classification scenario, this deformation should result in linear separability among classes. Alternatively, an MLP’s outputs can be interpreted as probabilities using a softmax function; however, it is not advisable to combine softmax with a specific loss function due to potential exploitation of advanced mathematical/computational shortcuts.

3 Our Project：Surname Classification with a MLP

In this section the MLP model will be implemented to predict an individual’s nationality based on their surname using The Surname Dataset.

3.1 About The Surname Dataset

The Surname dataset, which collects 10,000 surnames from 18 different countries, collected by the authors from different sources of names on the Internet. The first feature is that it is rather unbalanced. The second feature is that there is a valid and intuitive relationship between nationality and last name orthography. Some spelling variants are very strongly linked to the country of origin.

First, we will munging the dataset.

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

args = Namespace(
    raw_dataset_csv="surnames.csv",
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="surnames_with_splits.csv",
    seed=1337
)

# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0)

surnames.head()

	surname	nationality
0	Woodford	English
1	Coté	French
2	Kore	English
3	Koury	Arabic
4	Lebzak	Russian

The raw data’s header is shown above.

# Unique classes
set(surnames.nationality)

{'Arabic',
 'Chinese',
 'Czech',
 'Dutch',
 'English',
 'French',
 'German',
 'Greek',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Polish',
 'Portuguese',
 'Russian',
 'Scottish',
 'Spanish',
 'Vietnamese'}

The countries/regions are shown above.

# Splitting train by nationality
# Create dict
by_nationality = collections.defaultdict(list)
for _, row in surnames.iterrows():
    by_nationality[row.nationality].append(row.to_dict())
    
# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
    np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_proportion*n)
    n_val = int(args.val_proportion*n)
    n_test = int(args.test_proportion*n)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  
    
    # Add to final list
    final_list.extend(item_list)
    
# Write split data to file
final_surnames = pd.DataFrame(final_list)
final_surnames.split.value_counts()

train    7680
test     1660
val      1640
Name: split, dtype: int64

# Write munged data to CSV
final_surnames.to_csv(args.output_munged_csv, index=False)

Now we have got the splited dataset by Nationality.

3.2 Vocabulary, Vectorizer, and DataLoader

To classify surnames using characters, we convert surname strings into vectorized minibatches using a vocabulary, a vectorizer, and a DataLoader.

3.2.1 The Vocabulary Class

The Vocabulary class is a utility for handling text data in NLP tasks. It manages the mapping between tokens and indices, handles unknown tokens, and supports serialization and deserialization for saving and loading the vocabulary. This class is essential for converting text data into a numerical format suitable for machine learning models.

# import package
from argparse import Namespace
from collections import Counter
import json
import os
import string

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

class Vocabulary(object):
    """Class to process text and extract vocabulary for mapping"""

    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices
            add_unk (bool): a flag that indicates whether to add the UNK token
            unk_token (str): the UNK token to add into the Vocabulary
        """

        if token_to_idx is None:
            token_to_idx = {
   }
        self._token_to_idx = token_to_idx

        self._idx_to_token = {
   idx: token 
                              for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
        
        
    def to_serializable(self):
        """ returns a dictionary that can be serialized """
        return {
   'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token}

    @classmethod
    def from_serializable(cls, contents):
        """ instantiates the Vocabulary from a serialized dictionary """
        return cls(**contents)

    def add_token(self, token):
        """Update mapping dicts based on the token.

        Args:
            token (str): the item to add into the Vocabulary
        Returns:
            index (int): the integer corresponding to the token
        """
        try:
            index = self._token_to_idx[token]
        except KeyError:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    
    def add_many(self, tokens):
        """Add a list of tokens into the Vocabulary
        
        Args:
            tokens (list): a list of string tokens
        Returns:
            indices (list): a list of indices corresponding to the tokens
        """
        return [self.add_token(token) for token in tokens]

    def lookup_token(self, token):
        """Retrieve the index associated with the token 
          or the UNK index if token isn't present.
        
        Args:
            token (str): the token to look up 
        Returns:
            index (int): the index corresponding to the token
        Notes:
            `unk_index` needs to be >=0 (having been added into the Vocabulary) 
              for the UNK functionality 
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

    def lookup_index(self, index):
        """Return the token associated with the index
        
        Args: 
            index (int): the index to look up
        Returns:
            token (str): the token corresponding to the index
        Raises:
            KeyError: if the index is not in the Vocabulary
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self._token_to_idx)

3.2.2 THE SURNAME VECTORIZER

The SurnameVectorizer class is designed to handle the conversion of surnames and nationalities into numerical formats suitable for machine learning models. It coordinates two Vocabulary instances:

One for converting surname characters into indices.
One for converting nationalities into indices.

The vectorize method converts surnames into one-hot encoded vectors, making them ready for model input. The class methods from_dataframe and from_serializable provide ways to create a SurnameVectorizer from a pandas DataFrame and a serialized dictionary, respectively. The to_serializable method allows the SurnameVectorizer to be easily saved and loaded.

class SurnameVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
    def __init__(self, surname_vocab, nationality_vocab):
        """
        Args:
            surname_vocab (Vocabulary): maps characters to integers
            nationality_vocab (Vocabulary): maps nationalities to integers
        """
        self.surname_vocab = surname_vocab
        self.nationality_vocab = nationality_vocab

    def vectorize(self, surname):
        """
        Args:
            surname (str): the surname

        Returns:
            one_hot (np.ndarray): a collapsed one-hot encoding 
        """
        vocab = self.surname_vocab
        one_hot = np.zeros(len(vocab), dtype=np.float32)
        for token in surname:
            one_hot[vocab.lookup_token(token)] = 1

        return one_hot

    @classmethod
    def from_dataframe(cls, surname_df):
        """Instantiate the vectorizer from the dataset dataframe
        
        Args:
            surname_df (pandas.DataFrame): the surnames dataset
        Returns:
            an instance of the SurnameVectorizer
        """
        surname_vocab = Vocabulary(unk_token="@")
        nationality_vocab = Vocabulary(add_unk=False)

        for index, row in surname_df.iterrows():
            for letter in row.surname:
                surname_vocab.add_token(letter)
            nationality_vocab.add_token(row.nationality)

        return cls(surname_vocab, nationality_vocab)

    @classmethod
    def from_serializable(cls, contents):
        surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
        nationality_vocab =  Vocabulary.from_serializable(contents['nationality_vocab'])
        return</