Introduction to word embeddings with hands-on exercises

python
nlp
pytorch
Author

Pankaj Chejara

Published

January 22, 2024

A word embedding is a numeric representation of text in natural language processing. The representation is in the form of a vector where each number represents a dimension (or a specific attribute). Let’s take an example to understand it further.

Consider the following two sentences

A likes to drink coffee in the morning

B likes to drink tea in the morning

These two sentences are syntactically and semantically similar. To transform these sentences into vector forms, we can consider attributes like like to drink, morning, etc.

We can see that A and B both like to drink some beverages in the morning. Therefore, in our vector representations of these two imaginary persons (i.e., A, B), the values for the attribute like to drink should be close to each other. Similarly, there could be another attribute of time of the day when they like to drink. Such attributes can be decided based on the text data.

However, the manual crafting of such attributes is a very resource-exhaustive process and may take forever to cover a massive amount of text corpora.

To solve this problem, machine learning is used to learn these multi-dimensional vector representations also widely known as word embeddings. Word embeddings aim to capture semantic meaning in addition to syntactic information of the text data, and it has been found to a highly successful in solving a variety of NLP tasks, e.g., language modeling, text generation, etc.

The word embeddings can be developed from scratch in the context of the task at hand. Additionally, there are some pre-trained word embeddings that are available publicly. These word embeddings include Google Word2Vec, Stanford Glove, Facebook FastText.

In this post, we will see how to use these pre-trained word embeddings in Python. We will also build a small application based on the word embeddings to illustrate its use cases.

Pre-trained word embeddings

To use the pre-trained word embeddings, we will Gensim, a Python library that simplifies working with the aforementioned word embeddings. The library also offers downloading functionality for a number of word embeddings. These word embeddings are the following

import gensim.downloader

print("\n".join(list(gensim.downloader.info()['models'].keys())))
fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis

We can download any of the aforementioned word embeddings using gensim.downloader. For example, the following code downloads word2vec word embeddings.

# let's download word2vec
embeds = gensim.downloader.load('word2vec-google-news-300')

#print embeddings for word king
king_embeddings = embeds['king']

print('Word embedding size:',len(king_embeddings))
Word embedding size: 300

The above vector contains 300 integers (or 300 dimensions) representations for the word king.

Let’s now see how close this word is to another word queen. We can use here similarity function which computes Cosine similarity.

print('Similarity:',embeds.similarity('king','queen'))
Similarity: 0.6510957

Building an automatic word analogy task solver using word embeddings

The analogy task aims to test the relationship between words. The task includes the prediction of a word that has a similar analogy as a given pair of words.

It is given in the form of a:b::c:?.

 Here, a and b have some kind of relationship.
 The task is to find a word that has a similar relationship with `c`.

To do this task, we will use pre-trained embeddings, i.e., Glove. We can download the Glove embeddings using load function (which we employed in our above example).

embeds = gensim.downloader.load('glove-wiki-gigaword-100')
[==================================================] 100.0% 128.1/128.1MB downloaded

We have our word embeddings downloaded and loaded for use. Next, we will take the user’s input for a,b,c and predict d. We will follow the following steps

  • We will first capture the relationship between a and b. We will do that by performing a subtraction operation on the vectors of words a and b.
  • We will use the difference vector as relationship, and add it to the word embedding of c.
  • The resultant word embedding will then be used to find most similar word embeddings.
  • We will print the first word with the highest similarity measure.
# Taking input from the user
a = input('Enter a: ')
b = input('Enter b: ')
c = input('Enter c: ')
Enter a: King
Enter b: Man
Enter c: Queen
# Capturing the relationship between `a` and `b`

embed_a = embeds[a.lower()]
embed_b = embeds[b.lower()]
embed_c = embeds[c.lower()]

rel_a_b = embed_b - embed_a

embed_d = embed_c + rel_a_b

In the above code, we first extracted the word embeddings for the words entered by the user. Then, we computed subtraction between word embeddings of b and a. Here, we see the relationship captured through subtraction (i.e., b - a) similar to the relationship between c and d (i.e., d - c).

We used the relationship vector and added it to the word embedding of c. This gave us the word embedding for our resultant word.

Finally, we search through all word embeddings and extract the one with the highest similarity measure. For that, we will use most_similar function from Gensim library.

# finding similar words 
pred_d = embeds.similar_by_vector(embed_d, topn=5)

print(pred_d)
[('woman', 0.8039792776107788), ('man', 0.7791377305984497), ('girl', 0.7349346280097961), ('she', 0.6817952394485474), ('her', 0.6592202186584473)]

We extracted the top 6 words with the highest similarity measure. We will next iterate over this list of words and print the first word which is not in the words entered by the user (i.e., a,b,c).

for word in pred_d:
    if word[0] not in [a,b,c]:
        print(word[0])
        break
woman

Let’s put together our code and run it again.

# Taking input from the user
a = input('Enter a: ')
b = input('Enter b: ')
c = input('Enter c: ')

print('\n\nGiven task:')
print('-'*40)
print('{}:{}::{}:?'.format(a,b,c))
print('-'*40)

# getting word embeddings for a,b,c
embed_a = embeds[a.lower()]
embed_b = embeds[b.lower()]
embed_c = embeds[c.lower()]

# compute relationship between b and a
rel_a_b = embed_b - embed_a

# approximate word embedding using the capture relationship
embed_d = embed_c + rel_a_b

# extract most similar words
pred_d = embeds.similar_by_vector(embed_d, topn=5)

d = ''

# find the most similar words to the computed word embeddings for d
for word in pred_d:
    if word[0] not in [a,b,c]:
        d= word[0]
        break
        
print('\n\nSolution:')
print('='*20)
print('{}:{}::{}:{}'.format(a,b,c,d))
Enter a: father
Enter b: grandfather
Enter c: mother


Given task:
----------------------------------------
father:grandfather::mother:?
----------------------------------------


Solution:
====================
father:grandfather::mother:grandmother

References

  1. https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
  2. https://www.merriam-webster.com/dictionary/orthography
  3. https://machinelearningmastery.com/develop-word-embeddings-python-gensim/ [training your own embedding using gensim]
Back to top