Building a surname classifier using PyTorch

python
nlp
pytorch
Author

Pankaj Chejara

Published

January 20, 2024

Building a Surname Classifier with PyTorch

This post presents an NLP (Natural language processing) example based on the book NLP with PyTorch:Building intelligent language applications using deep learning. In the presented example, we will build a neural network to predict nationality based on surnames.

The post is targetted for beginners in the field of NLP. It will offer a hands-on experience of building a simple NLP application.

We will cover the following in this post.

  • Basics of transforming a textual data into a numeric form
  • Preparing dataset ready for training
  • Building a neural network using PyTorch
  • Training & testing of the neural network model

Surname Dataset

The dataset is a preprocessed version of the original dataset for model-building purposes. In this version of the dataset, we have nationality, nationality_index, split, and surname columns.

  • nationality: nationality associated with the surname
  • nationality_index: this is just a number representing the index in a list of nationality (18 nationalities in the case of the dataset)
  • split: this determines which set the current record belongs to (e.g., training, testing, validation)
  • surname: the surname entry

We will see here some data samples from the dataset to gain a better understanding of the dataset.

The dataset used in this example can be downloaded from this link.

import pandas as pd
import numpy as np
df = pd.read_csv('surnames_with_splits.csv')

print('Columns:\n\t',df.columns)
print('Nationality:\n\t',df.nationality.unique())
df.head()
Columns:
     Index(['nationality', 'nationality_index', 'split', 'surname'], dtype='object')
Nationality:
     ['Arabic' 'Chinese' 'Czech' 'Dutch' 'English' 'French' 'German' 'Greek'
 'Irish' 'Italian' 'Japanese' 'Korean' 'Polish' 'Portuguese' 'Russian'
 'Scottish' 'Spanish' 'Vietnamese']
nationality nationality_index split surname
0 Arabic 15 train Totah
1 Arabic 15 train Abboud
2 Arabic 15 train Fakhoury
3 Arabic 15 train Srour
4 Arabic 15 train Sayegh
We have surnames from 18 different nationalities in the dataset.

Basics of transforming a textual data into a numeric form

The first step towards building a machine (or deep) learning model using textual data is to transform the data into a numeric form. This step is known as Vectorization.

There are multiple ways to perform vectorization and in this post, we are using the simplest one using one-hot encoding technique. In one-hot encoding, text data is transformed into a vector of 0/1s.

Let’s take an example to understand the one-hot encoding. Consider we have a dictionary of 500 words and each word is defined with examples on one page.

Now, we have a task to encode a sentence that is formulated using words from the dictionary, into a numeric form. The sentence is “This is a good example”. This sentence could be transformed into a series of 0s occurring 500 times. The position of each zero presents the corresponding page in the dictionary. For example, the 5th position corresponds to the word on the 5th page in the dictionary.

Now, we can check the page numbers for each of the words in the sentence. Let’s say the word This is on the 120th page. We would then put 1 in the series on the 120th position. We would repeat this process for the rest of the words, and that’s it.

We have successfully transformed sentences into a numeric form.

Tip

In our example, the dictionary had an entry for each word. However, this might change depending on the nature of the task at the hand.

In case of surnames, our dictionary would be consisting of characters rather than words.

We will now programatically achieve what is explained above.

In our dataset, we have surname which will be fed to the neural network. We need to transform surnames into numeric forms. To achieve that we will need the followings

  • Dictionary: A dictionary containing an etry for each character that appeared in the surnames in the dataset
  • Vectorizer: It will perform the transformation process.

Dictionary

We will start with the dictionary. In simple terms, the dictionary stores each character (appearing in the surnames) in the storage at a particular index. In addition, it also offers mapping functionalities between position index and the corresponding character.

The following class implements the dictionary’s functionality for our dataset. The class has three crucial functions: add_token, lookup_token, and lookup_index. We also have a class function build_from_dataframe which processes the data frames and returns an object of the implemented class.

The following provides a simple and short explanation for each function in the class.

  • Initialization: During initialization, we create two mapping dictionaries, one for index-token mapping, and another for token-index mapping. We also set the specified unk_token and its flag.
    There is a likelihood of a surname during testing having a new character which was not in the training dataset. To handle such cases, we can specify whether to have a unk_token which can be used for any non-available characters in the dictionary.

  • Add a new token: Here, we first simply check whether the token we are going to add already exists in the dictionary or not. If the token is already in the dictionary then we simply return its corresponding index. Otherwise, we will add one entry in both the mappings.

  • Looking up for an index: This is where we return the token stored on the specified index.

  • Looking up for a token: This is similar to the previous function except this time we return the index for the specified token.

class SurnameVocabulary(object):
    
    def __init__(self,index_to_token=None,add_unk = True, unk_token='@'):
        super(SurnameVocabulary, self)
        
        self._index_to_token = {}
        self._token_to_index = {}
        
        if index_to_token is None:
            index_to_token = {}
            
        
        self._add_unk = add_unk
            
        self._unk_token = unk_token
            
        if self._add_unk:
            self.add_token(unk_token)
        
        for key,value in index_to_token.items():
            self._index_to_token[key] = value
            self._token_to_index[value] = key
            
    @classmethod
    def build_from_dataframe(cls,df,cutoff=5,add_unk=True):
        temp = {}
        token_counts = {}
        for row in df.itertuples():
            # count the frequency of each character (or token in this case)
            for token in list(row.surname.lower()):
                if token in token_counts.keys():
                    token_counts[token.lower()] += 1
                else:
                    token_counts[token.lower()] =  1
                    
        for token in token_counts.keys():
            # consider tokens appearing more than specified cutoff
            if token_counts[token] >= cutoff:
                idx = len(temp)
                
                temp[idx+1] = token
        
        return cls(temp,add_unk)
        
    def add_token(self,token):
        if token in self._token_to_index.keys():
            idx = self._token_to_index[token]
        else:
            idx = len(self._token_to_index)
            self._token_to_index[token] = idx
            self._index_to_token[idx] = token
        return idx
        
    def lookup_token(self,token):
        if token in self._token_to_index.keys():
            return self._token_to_index[token]
        else:
            return self._token_to_index[self._unk_token]
    
    def lookup_index(self,index):
        if index in self._index_to_token.keys():
            return self._index_to_token[index]
        else:
            raise Exception("Index not found")
            
    def __len__(self):
        return len(self._index_to_token)
    
    def __str__(self):
        return 'SurnameVocabulary object(size={})'.format(len(self._index_to_token))
        

Dictionary example

As we have now our class prepared, we will now initialize the dictionary using our surname dataset.

We can see below the mapping of dictionary where each index has a corresponding token. We also have @ as a unknown token (for handling new characters).

s = SurnameVocabulary.build_from_dataframe(df,cutoff=1)

print(s._index_to_token)
{0: '@', 1: 't', 2: 'o', 3: 'a', 4: 'h', 5: 'b', 6: 'u', 7: 'd', 8: 'f', 9: 'k', 10: 'r', 11: 'y', 12: 's', 13: 'e', 14: 'g', 15: 'c', 16: 'm', 17: 'i', 18: 'n', 19: 'w', 20: 'l', 21: 'z', 22: 'q', 23: 'j', 24: '-', 25: 'p', 26: 'x', 27: ':', 28: 'v', 29: '1', 30: '/', 31: 'é', 32: "'", 33: 'ç', 34: 'ê', 35: 'ß', 36: 'ö', 37: 'ä', 38: 'ü', 39: 'ú', 40: 'à', 41: 'ò', 42: 'è', 43: 'ó', 44: 'ù', 45: 'ì', 46: 'ś', 47: 'ą', 48: 'ń', 49: 'á', 50: 'ż', 51: 'ł', 52: 'õ', 53: 'ã', 54: 'í', 55: 'ñ'}

Let’s now move to the second component of vectorization process, i.e., vectorizer.

Vectorizer

The vectorizer utilizes the dictionary and transforms text data into a vector of numbers. The process of vectorization can be summarized below

  • The vectorizer first breaks the text data into tokens (e.g., breaking surname into characters)
  • It then creates a vector consisting of zeros of the same size as the dictionary (i.e., 24).
  • Finally, the vectorizer takes a surname, and iterates over it; For each character, the vectorizer first fetches the character’s index from the dictionary, and then it sets 1 in the vector on the same index as the character’s.

We have implemented the vectorization process in the form of a class given below.

The vectorizer class maintains two dictionaries, one for surnames and another for labels (nationality). We also included one class function to initializes the vectorizer. This is just for the sake of putting all the vectorizer-related code at one place. The build_vectorizer could be written an external code as well.

class SurnameVectorizer(object):
    def __init__(self,surname_vocab,label_vocab):
        super(SurnameVectorizer,self)
        self._surname_vocab = surname_vocab
        self._label_vocab = label_vocab
        
        
    def vectorize(self,surname):
        vector = [0]*len(self._surname_vocab)
        for token in list(surname.lower()):
            
            idx = self._surname_vocab.lookup_token(token)
            # for demonstration only
            #print('Tok:',token,'Index:',idx)
            vector[idx] = 1
        
        return np.array(vector,dtype=np.float32)
    
    @classmethod
    def build_vectorizer(cls,df,cutoff):
        vocab1 = SurnameVocabulary.build_from_dataframe(df,add_unk=True,cutoff=cutoff)
        vocab2 = SurnameVocabulary({},add_unk=False)
        
        for label in df['nationality'].unique():
            vocab2.add_token(label)
        return cls(vocab1,vocab2)
        

Vectorizer example

We will now see how the vectorizer transforms a surname into a vector. For the demonstration, we will use the surname Totah. The vectorizer started by breaking down the surname into characters and then fetching the index of each character from the SurnameDictionary. It then sets a 1 at the corresponding position in the vector of size 24 (same size as SurnameDictionary)

print('Surname:Totah\n\n')

vec = SurnameVectorizer.build_vectorizer(df,cutoff=0)
surname_vector = vec.vectorize('Totah')
print('\n\nVector:',surname_vector)
Surname:Totah


Tok: t Index: 1
Tok: o Index: 2
Tok: t Index: 1
Tok: a Index: 3
Tok: h Index: 4


Vector: [0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Important

If you notice in the resultant vector we only have 1 to represent existence and 0 for the absence of a character. Therefore, we lose information when a character occurs more than one time (as happened in the above example). One solution for that is to use the frequency in the vector.

Preparing the dataset

This is the step where we will prepare our dataset for the training. In particular, we will take care of transforming the data into tensors, and handling the indexing while fetching data in smaller batches.

PyTorch provides two classes Dataset and DataLoader to abstract all such smaller but crucial stages of data preparation. We will also utilize these two classes to prepare our data for the training.

In the following class, we inherited the Dataset class. We have to define two functions of the Dataset class, i.e., __len__ and __getitem__. The first function returns the size of the dataset while the second function fetches and returns a data instance stored at a specified index.

In the initialization function (i.e., init), we also put some additional code to fetch data as per the need (e.g., train, test, val).

Important

The __getitem__ function returns a vector for the specified surname. But for the nationality, it simply returns the index. This is because PyTroch allows the use of index values for computing loss (we will see it later).

from torch.utils.data import Dataset, DataLoader

class SurnameDataset(Dataset):
    def __init__(self,df,split="train",cutoff=2):
        super(SurnameDataset,self)
        self._df = df
        self._vectorizer = SurnameVectorizer.build_vectorizer(df,cutoff=cutoff)     
        self._target_store = {}
        
        for split_type in df.split.unique():
            tdf = df.loc[df['split'] == split_type,:]
            self._target_store[split_type]={'dataframe':tdf,'size':tdf.shape[0]}
        
        self._target_df = self._target_store[split]['dataframe']
        self._target_size = self._target_store[split]['size']
        
        # The following code computes weights for different nationality
        counts = df['nationality'].value_counts().to_dict()
        
        def sort_key(k):
            return self._vectorizer._label_vocab.lookup_token(k[0])
        
        sorted_counts = sorted(counts.items(),key=sort_key)
        
        weights = [c for _, c in sorted_counts]
        
        self.class_weights = 1 / torch.tensor(weights,dtype=torch.float32)
        
    def set_split(self,split):
        self._target_df = self._target_store[split]['dataframe']
        self._target_size = self._target_store[split]['size']
    
    def __len__(self):
        return self._target_size
    
    def __getitem__(self,index):
        data = self._target_df.iloc[index,:].to_dict()
    
        x = self._vectorizer.vectorize(data['surname'])
        y = self._vectorizer._label_vocab.lookup_token(data['nationality'])
        
        return {'input_x':x,
               'label':y}

Dataset Example

We will now see an example of using our SurnameDataset class. We will create an object of the class and then will fetch a particular data instance with its label.

data = SurnameDataset(df)

#dictionary size
print('Size:',len(data._vectorizer._surname_vocab))
# fetching a data instance
data[1214]
Size: 46
{'input_x': array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32),
 'label': 1}

During training, we need to feed data in batches for efficiency reasons. We can either handle grouping data instances in batches ourselves or we can use the DataLoader class.

The following code creates a batch iterator using the DataLoader class. The iterator returns a batch consisting of a specified number of data instances (e.g., 100). We can specify the size in the batch_size argument. The drop_last argument drops the last batch if it is smaller than the batch size.

data_loader = DataLoader(data,batch_size=100,drop_last=True)

# fetching the first batch 
first_batch = next(iter(data_loader))

print(first_batch['input_x'].size())
torch.Size([100, 46])

Building a surname classifier using neural networks with PyTorch

We will now build the architecture of our surname classifier. For the sake of simplicity, we are using a simple architecture with one hidden layer.

To build a neural network, we need to inherit the Module class and have to specify a function forward. In the following class, we create two layers during the initialization. In the forward function, we perform the computation by passing the input data in the first layer. Then, we apply the Relu activation function on it and pass it to the second layer.

Finally, we apply softmax which transforms the output of the second layer into probabilities across 18 nationalities.

One thing to note here is that we are only applying the softmax function when the flag apply_softmax is True. The reason is that we don’t need to do that during training because we can compute the loss on the second layer’s output directly. However, during the prediction we need the probabilities.

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim


class SurnameClassifier(nn.Module):
    def __init__(self,input_dim,hidden_dim,output_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_features=input_dim,out_features = hidden_dim)
        self.fc2 = nn.Linear(in_features=hidden_dim,out_features = output_dim)
        
    def forward(self,input_x, apply_softmax=False):
        intermediate = F.relu(self.fc1(input_x))
        output = self.fc2(intermediate)
        final = output
        if apply_softmax:
            final = F.softmax(output, dim=1)
        return final
      
class SurnameClassifierV1(nn.Module):
    def __init__(self,input_dim,hidden_dim,output_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_features=input_dim,out_features = hidden_dim)
        self.fc2 = nn.Linear(in_features=hidden_dim,out_features = output_dim)
        
    def forward(self,input_x, apply_softmax=False):
        intermediate = F.relu(self.fc1(input_x))
        output = self.fc2(F.dropout(intermediate,p=.05))
        final = output
        if apply_softmax:
            final = F.softmax(output, dim=1)
        return final
        
def compute_accuracy(y_pred,y_true):
    _, y_pred_indices = y_pred.max(dim=1)
    
    acc = torch.eq(y_pred_indices,y_true).sum().item()
    
    return acc*100/len(y_true)

Training and Testing the surname classifier

We would now move to the final part of this post i.e., training the neural network and testing it.

For the training, we will follow the following procedure

  • Initialize the classifier, loss function, and optimizer
  • Iterate over several epochs (one epoch means one complete pass over the dataset for training)
  • In each epoch, set the classifier in training mode and iterate over the entire dataset by fetching data in the form of batches
  • For each batch, first zeros the optimizer’s gradients (from the previous run), perform a forward pass by computing the output using the input, compute the loss, make a backward pass, and update the weight parameters.
input_layer_size = 46 # equals to the size of surname dictionary
hidden_layer_size = 300 
output_layer_size = 18 # equals to the number of nationality in the dataset


data = SurnameDataset(df,cutoff=2)

# train split
data.set_split('train')
dataloader_train = DataLoader(data,batch_size=64)

# initializing the classifier:
# 24 is the size of dictionary, we set the input layer size the same
# 18 is the total number of nationality in the dataset, we set the output layer size the same
# hidden layer size
classifier = SurnameClassifier(input_layer_size,hidden_layer_size,output_layer_size)

# loss function using cross entropy loss
loss_fnc = nn.CrossEntropyLoss(data_train.class_weights)

# optimizer to update the weight parameters
adam = optim.Adam(classifier.parameters(),lr=.001)

# to store accuracy and loss metrics
acc_data = []
loss_data = []


for e in range(100):
    running_loss = 0.0
    running_acc = 0.0
    classifier.train()

    for batch_index, batch in enumerate(dataloader_train):
        
        # zeros the gradients
        adam.zero_grad()
        
        # forward pass
        output = classifier(batch['input_x']) 

        # compute the loss
        loss = loss_fnc(output,batch['label'])       
        
        # fetch the loss value (the loss is a tensor and also contains gradients, we don't need that)
        loss_t = loss.item()
        
        # compute accuracy
        acc_t = compute_accuracy(output,batch['label'])
        
        
        # performing running loss and running accuracy computation
        running_loss += (loss_t - running_loss)/(batch_index + 1)
        running_acc += (acc_t - running_acc)/(batch_index + 1)
        
        # backward pass
        loss.backward()
        
        # update the parameters
        adam.step()
    
    
    # add running loss and running accuracy to the list
    acc_data.append(running_acc)
    loss_data.append(running_loss)
    
print('Final:',running_loss,'  ',running_acc)
Final: 0.9167288458595673    68.61979166666667
import matplotlib.pyplot as plt

plt.plot(list(range(1,100+1)),acc_data,label='Training Accuracy')
plt.legend()
plt.title('Training accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
Text(0, 0.5, 'Accuracy')

Testing the classifier using validation set

Here, we will use the validation set for prediction. The code is similar to the training code except that we set the mode of classifier to testing and we don’t make the backward pass and update the parameters.

# setting split for validation data
data.set_split('val')
dataloader_val = DataLoader(data,batch_size=64)
print('Val data size:',len(data))
running_loss = 0.0
running_acc = 0.0

classifier.eval()

for batch_index,batch in enumerate(dataloader_val):
    
    output = classifier(batch['input_x'])
    
    loss = loss_fnc(output,batch['label'])
    
    loss_t = loss.item()
    
    acc_t = compute_accuracy(output,batch['label'])
    
    running_loss += (loss_t - running_loss)/(batch_index + 1)
    running_acc  += (acc_t - running_acc)/(batch_index + 1)
    
print('Validation Loss:%.2f'%running_loss)
print('Validation Accuracy:%.2f'%running_acc)
Val data size: 1640
Validation Loss:1.82
Validation Accuracy:60.07

Testing the classifier using test set

Here, we will apply our classifier to the test set. We will follow the same procedure as we followed with the validation set.

# setting split for test data
data.set_split('test')
print('Test data size:',len(data))
dataloader_test = DataLoader(data,batch_size=100)

running_loss = 0.0
running_acc = 0.0
classifier.eval()
for batch_index,batch in enumerate(dataloader_test):
    
    output = classifier(batch['input_x'])
    
    loss = loss_fnc(output,batch['label'])
    
    loss_t = loss.item()
    
    acc_t = compute_accuracy(output,batch['label'])
    
    running_loss += (loss_t - running_loss)/(batch_index + 1)
    running_acc  += (acc_t - running_acc)/(batch_index + 1)
    
print('Loss:%.2f'%running_loss)
print('Accuracy:%.2f'%running_acc)
Test data size: 1660
Loss:1.89
Accuracy:59.88

Predicting nationality of a given surname

Now, we will illustrate the use of our classifier for predicting the nationality of a given surname.

The following function first applies the vectorization on the given surname and then converts the vector into a tensor of size 1 (using view()). Next, the classifier uses the tensor and performs computation. This time we want probabilities that’s why we specified apply_softmax as True.

From the obtained probabilities, we extract the one with the highest probability and its corresponding index. We use that index to get the nationality associated with that index.

def predict_nationality(surname,vectorizer,classifier):
    # apply vectorization
    surname_vector = vectorizer.vectorize(surname)
    
    # convert into a tensor
    surname_input = torch.tensor(surname_vector).view(1,-1)
    
    # perform prediction
    output = classifier(surname_input,apply_softmax=True)
    
    # get highest probability and its index
    val,ind = torch.max(output,dim=1)
    
    
    print('Prediction:',vectorizer._label_vocab.lookup_index(ind.item()),'({:.2f}%)'.format(100*val.item()))
predict_nationality('María',data._vectorizer,classifier)
Prediction: Spanish (99.88%)
names_to_predict = ['García','Linda']

for name in names_to_predict:
    print('\nName:',name)
    predict_nationality(name,data._vectorizer,classifier)
    print('-'*30)

Name: García
Prediction: Spanish (99.88%)
------------------------------

Name: Linda
Prediction: Italian (42.89%)
------------------------------
Back to top