Basics of Natural Language Processing in Python

Natural language processing is an interdisciplinary field combining computer science, artificial intelligence, and linguistics. Natural language processing focuses on the processing of natural languages (i.e., human languages such as English, and Hindi). This processing aims at comprehending natural languages and generating text in those languages.

Natural language processing, thus, enables computers to understand and process human languages and presents the enormous potential of building intelligent machines capable of understanding human input in their own languages.

In this post, we will become familiar with the basics of natural language processing with Python. We will use the NLTK library for the tutorial. The post is targeted at beginners who are just starting to gain first-hand experience with natural language processing. # Basics of Natural Language Processing in Python Natural langauge processing enables computer to human languages. It combines linguistic with statistical modeling. In this post, we will become familiar with basics of nlp with Python. We will use NLTK library for the tutorial.

Front image by Andrea De Santis on Unsplash.

Installation

First we need to install nltk library. The following command can be used to do that.

pip install nltk

Basic concepts

Now, we will get familiar with the basic concepts of natural language processing. This processing takes place through multiple steps. With each step, a higher level of abstraction is achieved. Let’s understand some basic steps first.

When a text is preseted to a computer, it only sees the text as a sequence of characters; So, the first step focuses on breaking the character’s sequence into sentences, and then each sentence into words. This step is known as Tokenization.

Next, the sentence structure is understood from a grammatical point of view. This involves identifying nouns, verbs, objects, etc. This step is known as Parts-of-Speech Tagging.

Once the grammatical relavant tags are identified for each word in the sentence, the next step tries to apply a transformation which replaces words with their base form. For example, transforming running into run. This step is known as Stemming/Lemmatization. We will later discuss the differences between these two.

The final step converts the text data into numbers for the computer’s usage. This step is known as Vectorization.

Now we will cover these topics one by one. The list of topics is provided below as well.

Tokenization
Parts of Speech Tagging
Stemming/Lemmatization
Vectorization

Tokenization

Let’s start with tokenization, the first step in the process. The tokenization step simply breaks down text data into smaller units for analysis purposes such as sentences, words, numbers, etc. These units are also known as tokens.

The following program performs tokenization, first breaking the text into a group of sentences, and second, breaking each sentence into a group of words.

from nltk import word_tokenize, sent_tokenize

text = """This post offers basics of natural langauge processing (NLP) in Python. 
    NLP enables computer to human languages. It combines linguistic with statistical modeling.
    """

# Breaking the text data into sentences
sentences = sent_tokenize(text)
print('Sentences:\n',sentences)

# Breaking each sentece into words
words = [word_tokenize(sentence) for sentence in sentences]
print('Words:\n',words)

Sentences:
 ['This post offers basics of natural langauge processing (NLP) in Python.', 'NLP enables computer to human languages.', 'It combines linguistic with statistical modeling.']
Words:
 [['This', 'post', 'offers', 'basics', 'of', 'natural', 'langauge', 'processing', '(', 'NLP', ')', 'in', 'Python', '.'], ['NLP', 'enables', 'computer', 'to', 'human', 'languages', '.'], ['It', 'combines', 'linguistic', 'with', 'statistical', 'modeling', '.']]

Tip

You need to run nltk.download('punkt') only once.

POS tagging

Let’s move to now understanding grammatical structure of sentences. The step involves identifying whether a word in the sentence is noun, verb, adverb, etc. It is known as Parts-of-Speech or POS tagging.

Let’s see our first example of tagging.

from nltk import word_tokenize, sent_tokenize
from nltk import pos_tag
import nltk

# Uncomment the below statement if you get an error of tagger not found
# nltk.download('averaged_perceptron_tagger')

# using only word tokenization because there is only one sentence in the text data.
words = word_tokenize('Estonia is a leading country in digital space.')

# applying parts-of-speech tagging
tags = pos_tag(words)

print(tags)

[('Estonia', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'JJ'), ('country', 'NN'), ('in', 'IN'), ('digital', 'JJ'), ('space', 'NN'), ('.', '.')]

Important

You can check the meaning of each tag using ntlk.help.upenn_tagset() function.

# checking the meaning of NN
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...

The result is provided in a form of list of tuples. Each tuple contains a word and corresponding word category (e.g., VBZ for verb). This tuple is also known as tagged token. You can read about it in more details here.

The complete list is given below.

Tag	Description
CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there (like: “there is” . think of it like “there exists”)
FW	foreign word
IN	preposition/subordinating conjunction
JJ	adjective ‘big’
JJR	adjective, comparative ‘bigger’
JJS	adjective, superlative ‘biggest’
LS	list marker
MD	modal could, will
NN	noun, singular ‘desk’
NNS	noun plural ‘desks’
NNP	proper noun, singular ‘Harrison’
NNPS	proper noun, plural ‘Americans’
PDT	predeterminer ‘all the kids’
POS	possessive ending parent
PRP	personal pronoun I, he, she
PRP$	possessive pronoun my, his, hers
RB	adverb very, silently,
RBR	adverb, comparative better
RBS	adverb, superlative best
RP	particle give up
TO,	to go ‘to’ the store.
UH	interjection, errrrrrrrm
VB	verb, base form take
VBD	verb, past tense took
VBG	verb, gerund/present participle taking
VBN	verb, past participle taken
VBP	verb, sing. present
VBZ	verb, third person sing. present takes
WDT	wh-determiner which
WP	wh-pronoun who, what
WP$	possessive wh-pronoun whose
WRB	wh-abverb where, when

Stemming and Lemmatization

This step transforms a word into its root word or base form. For example, car, cars, car’s all share a common root word car. In the linguistic field, such words are known as words with inflection endings or derivationally related words. There is a nice post which you can refer to understand more about it and also about morphological analysis.

Here we are briefly discussing inflection and derivational forms of words.

Inflection forms These word forms are used to distinct tenses, person, gender, etc. For example, words like go, going, gone, goes. If you notice these words have different endings. These all are called inflection endings in linguistic field.

Tip

The words with inflection endings do not have a separate entry in the dictionary. You will find all words with inflection endings under a single entry i.e., go.

Derivational form

These word forms are derived from the rood words and create a new meaning. For example, react and actor both are derived from the word act.

Tip

The words with derivational forms have a separate entry in the dictionary. You will find a separate entry in the dictionary for act, react, and actor.

Now, as we have a preliminary understanding of different forms of words, we next move to the stemming and lemmatization. These are two techniques to transform a word from its inflection form (and sometimes derivational form) to the base form.

Stemming

Stemming is the technique which simply chops off the ending of a word to obtain its base form. For example, removing ing from eating to obain the base form i.e., eat. By default NLTK uses a rule-based stemmer (i.e., Porter Stemmer). There are other stemmer as well. You can check this page for more information on different stemmers.

Let’s take a look at the following code which perform stemming.

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Uncomment the following statement if running it first time
#nltk.download('wordnet')


# Initialize Python porter stemmer
ps = PorterStemmer()

# Tokenize the text data
words = word_tokenize('Estonia is a leading country in digital space, and on its way to become the leader.')

# Printing the data with its base form after stemming operation
for word in words:
    print('Word:{:10} Stem:{}'.format(word,ps.stem(word)))

Word:Estonia    Stem:estonia
Word:is         Stem:is
Word:a          Stem:a
Word:leading    Stem:lead
Word:country    Stem:countri
Word:in         Stem:in
Word:digital    Stem:digit
Word:space      Stem:space
Word:,          Stem:,
Word:and        Stem:and
Word:on         Stem:on
Word:its        Stem:it
Word:way        Stem:way
Word:to         Stem:to
Word:become     Stem:becom
Word:the        Stem:the
Word:leader     Stem:leader
Word:.          Stem:.

In the output, we can see that words leading, country, digital are transformed into lead, countri, and digit, respectively. Now, we will move to Lemmatization.

Lemmatization

Lemmatization is an another technique which also perform a similar task as Stemming i.e., transforming words into its base forms. However, it differs in its approach. Lemmatization uses morphological analysis to achieve the goal. The morphological analysis means understanding of words and their parts.

You can refer to this post to gain more information on different lemmatization approaches in python.

from nltk.stem import WordNetLemmatizer

# Lemmatizer 
wordnet_lemmatizer = WordNetLemmatizer()

# Tokenization
words = word_tokenize('Estonia is a leading country in digital space, and on its way to become the leader.')

for word in words:
    print('Word:{:10} Lemma:{} Tag:{}'.format(word,wordnet_lemmatizer.lemmatize(word),wordnet_lemmatizer.lemmatize(word)))

Word:Estonia    Lemma:Estonia
Word:is         Lemma:is
Word:a          Lemma:a
Word:leading    Lemma:leading
Word:country    Lemma:country
Word:in         Lemma:in
Word:digital    Lemma:digital
Word:space      Lemma:space
Word:,          Lemma:,
Word:and        Lemma:and
Word:on         Lemma:on
Word:its        Lemma:it
Word:way        Lemma:way
Word:to         Lemma:to
Word:become     Lemma:become
Word:the        Lemma:the
Word:leader     Lemma:leader
Word:.          Lemma:.

[nltk_data] Downloading package wordnet to /Users/htk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

No change after lemmatization

If you notice in the results there are no changes after applying lemmatization. The reason is that if the word can not be found in WordNet (publicly awailable English dataset) then the word remain unchanged. It can be corrected by providing the pos tag of the word when calling lemmatize function.

Now, we will supply pos tag of each word when calling lemmatize function. However, the function only takes a single character for the pos tag. For example, n for nouns, v for verbs, a for adjectives, and r for adverbs.

So, we need to prepare a mapping which translates pos tags, obtained from nltk.pos_tag() function, into a, r, n, v (depending on the tag).

We know from the POS tags table above that tags for adjectives starts from ‘J’. So what we can do is, we can take the first character of pos-tag and determine which tag to supply in lemmatize() function.

from nltk import pos_tag

def get_pos(word):
    # the function returns a list with one tagged tuple, e.g., [('riding','VBD')]
    tagged_tuple_list = pos_tag([word])
    
    # fetching the item in the list and then the tag in tuple
    tagged_tuple = tagged_tuple_list[0]  # the first index will fetch the fist item in the list
    
    # fetching the tag from the tagged tuple
    tag = tagged_tuple[1]   # the index will fetch the tag (e.g., 'VBD')
    
    # extracting the first character
    tag_char = tag[0]
    
    # all these three statement can be combined into a single statement given below
    # tag_char = pos_tag([word])[0][1][0]
    
    # Now we will create a mapping
    pos_to_lemma_tag = {
        'J': 'a',
        'N': 'n',
        'R': 'r',
        'V': 'v'
    }
    
    # we will return the tag for usage in lemmatize function
    return pos_to_lemma_tag.get(tag_char,'n')   # get function will return n if the tag is something else

from nltk.stem import WordNetLemmatizer

# Lemmatizer 
wordnet_lemmatizer = WordNetLemmatizer()

# Tokenization
words = word_tokenize('Estonia is a leading country in digital space, and on its way to become the leader.')

for word in words:
    print('Word:{:10} Lemma:{}'.format(word,
                                       wordnet_lemmatizer.lemmatize(word,get_pos(word))))

Word:Estonia    Lemma:Estonia
Word:is         Lemma:be
Word:a          Lemma:a
Word:leading    Lemma:lead
Word:country    Lemma:country
Word:in         Lemma:in
Word:digital    Lemma:digital
Word:space      Lemma:space
Word:,          Lemma:,
Word:and        Lemma:and
Word:on         Lemma:on
Word:its        Lemma:it
Word:way        Lemma:way
Word:to         Lemma:to
Word:become     Lemma:become
Word:the        Lemma:the
Word:leader     Lemma:leader
Word:.          Lemma:.

It worked now like a charm :-)

Vectorization

Now, we will move to the final step in the aforementioned list of preprocessing steps in natural language processing. This final step is vectorization step. This step translates the text into numbers so that computers can use them for further anlaysis.

There are multiple techinques of vectorization. In this post, we are going to discuss two basics techniques techniques: count vectorization, and tf-idf vectorization.

Count Vectorization

In this technique, the input text is first broken down into a set of unique words, and next, each word is assigned a number which represents the frequency of that word.

Let’s see a working example. For the example, we will use CountVectorizer from scikit-learn.

from sklearn.feature_extraction.text import CountVectorizer

# input data
input_text = ["In this post, we will become familiar with the basics of natural language processing with Python. We will use the NLTK library for the tutorial."] 

# initialization
vect = CountVectorizer()

# applying count vectorization on the text
result = vect.fit_transform(input_text)

# printing vocubulary with frequency
print('Shape:',result.shape, '\n Vector:',result.toarray())

Shape: (1, 20) 
 Vector: [[1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 2 2 2]]

Tip

The result is a single vector of length 20. In out input, we only had a single sentence, therefore, one got a single vector. In case of multiple sentences, we get one vector for each sentence.

TF-IDF

The next vectorization technique is TF-IDF (Term Frequency- Inverse Document Frequency). Let’s understand their meaning.

Term Frequency (TF) This frequency counts the number of times a word occurs in the document.

Inverse Document Frequency This is the inverse of how many documents contains the specified term.

from sklearn.feature_extraction.text import TfidfVectorizer

# input data
input_text = ["This is an amazing field with huge potential of building intelligent machine.",
              "Those machine would be able to transform people's life.",
              "This transformation would significantly improve the quality of life."]
# TF-IDF intialization
tf = TfidfVectorizer()

# applying vectorizer
result = tf.fit_transform(input_text)

# print results
print('Shape:',result.shape, '\n Vector:',result.toarray())

Shape: (3, 25) 
 Vector: [[0.         0.30520733 0.30520733 0.         0.30520733 0.30520733
  0.30520733 0.         0.30520733 0.30520733 0.         0.23211804
  0.23211804 0.         0.30520733 0.         0.         0.
  0.23211804 0.         0.         0.         0.         0.30520733
  0.        ]
 [0.35955412 0.         0.         0.35955412 0.         0.
  0.         0.         0.         0.         0.27345018 0.27345018
  0.         0.35955412 0.         0.         0.         0.
  0.         0.35955412 0.35955412 0.35955412 0.         0.
  0.27345018]
 [0.         0.         0.         0.         0.         0.
  0.         0.36977238 0.         0.         0.28122142 0.
  0.28122142 0.         0.         0.36977238 0.36977238 0.36977238
  0.28122142 0.         0.         0.         0.36977238 0.
  0.28122142]]

You can check [this blog post](https://www.turing.com/kb/guide-on-word-embeddings-in-nlp) on further information on vectorization.

References
1. https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/
2. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
3. https://www.nltk.org/book/
4. Morphological analysis: https://www.education.vic.gov.au/school/teachers/teachingresources/discipline/english/literacy/readingviewing/Pages/litfocuswordmorph.aspx
5. https://www.datacamp.com/tutorial/stemming-lemmatization-python
6. https://pianalytix.com/countvectorizer-in-nlp/#:~:text=CountVectorizer%20means%20breaking%20down%20a,data%20needs%20to%20be%20vectorized.