Introduction to word embeddings with hands-on exercises
python
nlp
pytorch
Author
Pankaj Chejara
Published
January 22, 2024
A word embedding is a numeric representation of text in natural language processing. The representation is in the form of a vector where each number represents a dimension (or a specific attribute). Let’s take an example to understand it further.
Consider the following two sentences
A likes to drink coffee in the morning
B likes to drink tea in the morning
These two sentences are syntactically and semantically similar. To transform these sentences into vector forms, we can consider attributes like like to drink, morning, etc.
We can see that A and B both like to drink some beverages in the morning. Therefore, in our vector representations of these two imaginary persons (i.e., A, B), the values for the attribute like to drink should be close to each other. Similarly, there could be another attribute of time of the day when they like to drink. Such attributes can be decided based on the text data.
However, the manual crafting of such attributes is a very resource-exhaustive process and may take forever to cover a massive amount of text corpora.
To solve this problem, machine learning is used to learn these multi-dimensional vector representations also widely known as word embeddings. Word embeddings aim to capture semantic meaning in addition to syntactic information of the text data, and it has been found to a highly successful in solving a variety of NLP tasks, e.g., language modeling, text generation, etc.
The word embeddings can be developed from scratch in the context of the task at hand. Additionally, there are some pre-trained word embeddings that are available publicly. These word embeddings include GoogleWord2Vec, StanfordGlove, FacebookFastText.
In this post, we will see how to use these pre-trained word embeddings in Python. We will also build a small application based on the word embeddings to illustrate its use cases.
Pre-trained word embeddings
To use the pre-trained word embeddings, we will Gensim, a Python library that simplifies working with the aforementioned word embeddings. The library also offers downloading functionality for a number of word embeddings. These word embeddings are the following
Building an automatic word analogy task solver using word embeddings
The analogy task aims to test the relationship between words. The task includes the prediction of a word that has a similar analogy as a given pair of words.
It is given in the form of a:b::c:?.
Here, a and b have some kind of relationship.
The task is to find a word that has a similar relationship with `c`.
To do this task, we will use pre-trained embeddings, i.e., Glove. We can download the Glove embeddings using load function (which we employed in our above example).
We have our word embeddings downloaded and loaded for use. Next, we will take the user’s input for a,b,c and predict d. We will follow the following steps
We will first capture the relationship between a and b. We will do that by performing a subtraction operation on the vectors of words a and b.
We will use the difference vector as relationship, and add it to the word embedding of c.
The resultant word embedding will then be used to find most similar word embeddings.
We will print the first word with the highest similarity measure.
# Taking input from the usera =input('Enter a: ')b =input('Enter b: ')c =input('Enter c: ')
Enter a: King
Enter b: Man
Enter c: Queen
# Capturing the relationship between `a` and `b`embed_a = embeds[a.lower()]embed_b = embeds[b.lower()]embed_c = embeds[c.lower()]rel_a_b = embed_b - embed_aembed_d = embed_c + rel_a_b
In the above code, we first extracted the word embeddings for the words entered by the user. Then, we computed subtraction between word embeddings of b and a. Here, we see the relationship captured through subtraction (i.e., b - a) similar to the relationship between c and d (i.e., d - c).
We used the relationship vector and added it to the word embedding of c. This gave us the word embedding for our resultant word.
Finally, we search through all word embeddings and extract the one with the highest similarity measure. For that, we will use most_similar function from Gensim library.
# finding similar words pred_d = embeds.similar_by_vector(embed_d, topn=5)print(pred_d)
We extracted the top 6 words with the highest similarity measure. We will next iterate over this list of words and print the first word which is not in the words entered by the user (i.e., a,b,c).
for word in pred_d:if word[0] notin [a,b,c]:print(word[0])break
woman
Let’s put together our code and run it again.
# Taking input from the usera =input('Enter a: ')b =input('Enter b: ')c =input('Enter c: ')print('\n\nGiven task:')print('-'*40)print('{}:{}::{}:?'.format(a,b,c))print('-'*40)# getting word embeddings for a,b,cembed_a = embeds[a.lower()]embed_b = embeds[b.lower()]embed_c = embeds[c.lower()]# compute relationship between b and arel_a_b = embed_b - embed_a# approximate word embedding using the capture relationshipembed_d = embed_c + rel_a_b# extract most similar wordspred_d = embeds.similar_by_vector(embed_d, topn=5)d =''# find the most similar words to the computed word embeddings for dfor word in pred_d:if word[0] notin [a,b,c]: d= word[0]breakprint('\n\nSolution:')print('='*20)print('{}:{}::{}:{}'.format(a,b,c,d))
Enter a: father
Enter b: grandfather
Enter c: mother
Given task:
----------------------------------------
father:grandfather::mother:?
----------------------------------------
Solution:
====================
father:grandfather::mother:grandmother