pip install nltk
Natural language processing is an interdisciplinary field combining computer science, artificial intelligence, and linguistics. Natural language processing focuses on the processing of natural languages (i.e., human languages such as English, and Hindi). This processing aims at comprehending natural languages and generating text in those languages.
Natural language processing, thus, enables computers to understand and process human languages and presents the enormous potential of building intelligent machines capable of understanding human input in their own languages.
In this post, we will become familiar with the basics of natural language processing with Python. We will use the NLTK library for the tutorial. The post is targeted at beginners who are just starting to gain first-hand experience with natural language processing. # Basics of Natural Language Processing in Python Natural langauge processing enables computer to human languages. It combines linguistic with statistical modeling. In this post, we will become familiar with basics of nlp with Python. We will use NLTK library for the tutorial.
Front image by Andrea De Santis on Unsplash.
Installation
First we need to install nltk library. The following command can be used to do that.
Basic concepts
Now, we will get familiar with the basic concepts of natural language processing. This processing takes place through multiple steps. With each step, a higher level of abstraction is achieved. Let’s understand some basic steps first.
When a text is preseted to a computer, it only sees the text as a sequence of characters; So, the first step focuses on breaking the character’s sequence into sentences, and then each sentence into words. This step is known as Tokenization.
Next, the sentence structure is understood from a grammatical point of view. This involves identifying nouns, verbs, objects, etc. This step is known as Parts-of-Speech Tagging.
Once the grammatical relavant tags are identified for each word in the sentence, the next step tries to apply a transformation which replaces words with their base form. For example, transforming running
into run
. This step is known as Stemming/Lemmatization. We will later discuss the differences between these two.
The final step converts the text data into numbers for the computer’s usage. This step is known as Vectorization.
Now we will cover these topics one by one. The list of topics is provided below as well.
- Tokenization
- Parts of Speech Tagging
- Stemming/Lemmatization
- Vectorization
Tokenization
Let’s start with tokenization, the first step in the process. The tokenization step simply breaks down text data into smaller units for analysis purposes such as sentences, words, numbers, etc. These units are also known as tokens.
The following program performs tokenization, first breaking the text into a group of sentences, and second, breaking each sentence into a group of words.
from nltk import word_tokenize, sent_tokenize
= """This post offers basics of natural langauge processing (NLP) in Python.
text NLP enables computer to human languages. It combines linguistic with statistical modeling.
"""
# Breaking the text data into sentences
= sent_tokenize(text)
sentences print('Sentences:\n',sentences)
# Breaking each sentece into words
= [word_tokenize(sentence) for sentence in sentences]
words print('Words:\n',words)
Sentences:
['This post offers basics of natural langauge processing (NLP) in Python.', 'NLP enables computer to human languages.', 'It combines linguistic with statistical modeling.']
Words:
[['This', 'post', 'offers', 'basics', 'of', 'natural', 'langauge', 'processing', '(', 'NLP', ')', 'in', 'Python', '.'], ['NLP', 'enables', 'computer', 'to', 'human', 'languages', '.'], ['It', 'combines', 'linguistic', 'with', 'statistical', 'modeling', '.']]
You need to run nltk.download('punkt') only once.
POS tagging
Let’s move to now understanding grammatical structure of sentences. The step involves identifying whether a word in the sentence is noun, verb, adverb, etc. It is known as Parts-of-Speech or POS tagging.
Let’s see our first example of tagging.
from nltk import word_tokenize, sent_tokenize
from nltk import pos_tag
import nltk
# Uncomment the below statement if you get an error of tagger not found
# nltk.download('averaged_perceptron_tagger')
# using only word tokenization because there is only one sentence in the text data.
= word_tokenize('Estonia is a leading country in digital space.')
words
# applying parts-of-speech tagging
= pos_tag(words)
tags
print(tags)
[('Estonia', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'JJ'), ('country', 'NN'), ('in', 'IN'), ('digital', 'JJ'), ('space', 'NN'), ('.', '.')]
You can check the meaning of each tag using ntlk.help.upenn_tagset() function.
# checking the meaning of NN
help.upenn_tagset('NN') nltk.
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
The result is provided in a form of list of tuples. Each tuple contains a word and corresponding word category (e.g., VBZ for verb). This tuple is also known as tagged token. You can read about it in more details here.
The complete list is given below.
Tag | Description |
---|---|
CC | coordinating conjunction |
CD | cardinal digit |
DT | determiner |
EX | existential there (like: “there is” . think of it like “there exists”) |
FW | foreign word |
IN | preposition/subordinating conjunction |
JJ | adjective ‘big’ |
JJR | adjective, comparative ‘bigger’ |
JJS | adjective, superlative ‘biggest’ |
LS | list marker |
MD | modal could, will |
NN | noun, singular ‘desk’ |
NNS | noun plural ‘desks’ |
NNP | proper noun, singular ‘Harrison’ |
NNPS | proper noun, plural ‘Americans’ |
PDT | predeterminer ‘all the kids’ |
POS | possessive ending parent |
PRP | personal pronoun I, he, she |
PRP$ | possessive pronoun my, his, hers |
RB | adverb very, silently, |
RBR | adverb, comparative better |
RBS | adverb, superlative best |
RP | particle give up |
TO, | to go ‘to’ the store. |
UH | interjection, errrrrrrrm |
VB | verb, base form take |
VBD | verb, past tense took |
VBG | verb, gerund/present participle taking |
VBN | verb, past participle taken |
VBP | verb, sing. present |
VBZ | verb, third person sing. present takes |
WDT | wh-determiner which |
WP | wh-pronoun who, what |
WP$ | possessive wh-pronoun whose |
WRB | wh-abverb where, when |
Stemming and Lemmatization
This step transforms a word into its root word or base form. For example, car, cars, car’s all share a common root word car
. In the linguistic field, such words are known as words with inflection endings or derivationally related words. There is a nice post which you can refer to understand more about it and also about morphological analysis.
Here we are briefly discussing inflection and derivational forms of words.
Inflection forms These word forms are used to distinct tenses, person, gender, etc. For example, words like go, going, gone, goes. If you notice these words have different endings. These all are called inflection endings in linguistic field.
The words with inflection endings do not have a separate entry in the dictionary. You will find all words with inflection endings under a single entry i.e., go.
Derivational form
These word forms are derived from the rood words and create a new meaning. For example, react
and act
or both are derived from the word act
.
The words with derivational forms have a separate entry in the dictionary. You will find a separate entry in the dictionary for act, react, and actor.
Now, as we have a preliminary understanding of different forms of words, we next move to the stemming and lemmatization. These are two techniques to transform a word from its inflection form (and sometimes derivational form) to the base form.
Stemming
Stemming is the technique which simply chops off the ending of a word to obtain its base form. For example, removing ing
from eating
to obain the base form i.e., eat
. By default NLTK uses a rule-based stemmer (i.e., Porter Stemmer). There are other stemmer as well. You can check this page for more information on different stemmers.
Let’s take a look at the following code which perform stemming.
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
# Uncomment the following statement if running it first time
#nltk.download('wordnet')
# Initialize Python porter stemmer
= PorterStemmer()
ps
# Tokenize the text data
= word_tokenize('Estonia is a leading country in digital space, and on its way to become the leader.')
words
# Printing the data with its base form after stemming operation
for word in words:
print('Word:{:10} Stem:{}'.format(word,ps.stem(word)))
Word:Estonia Stem:estonia
Word:is Stem:is
Word:a Stem:a
Word:leading Stem:lead
Word:country Stem:countri
Word:in Stem:in
Word:digital Stem:digit
Word:space Stem:space
Word:, Stem:,
Word:and Stem:and
Word:on Stem:on
Word:its Stem:it
Word:way Stem:way
Word:to Stem:to
Word:become Stem:becom
Word:the Stem:the
Word:leader Stem:leader
Word:. Stem:.
In the output, we can see that words leading
, country
, digital
are transformed into lead
, countri
, and digit
, respectively. Now, we will move to Lemmatization.
Lemmatization
Lemmatization is an another technique which also perform a similar task as Stemming i.e., transforming words into its base forms. However, it differs in its approach. Lemmatization uses morphological analysis to achieve the goal. The morphological analysis means understanding of words and their parts.
You can refer to this post to gain more information on different lemmatization approaches in python.
from nltk.stem import WordNetLemmatizer
# Lemmatizer
= WordNetLemmatizer()
wordnet_lemmatizer
# Tokenization
= word_tokenize('Estonia is a leading country in digital space, and on its way to become the leader.')
words
for word in words:
print('Word:{:10} Lemma:{} Tag:{}'.format(word,wordnet_lemmatizer.lemmatize(word),wordnet_lemmatizer.lemmatize(word)))
Word:Estonia Lemma:Estonia
Word:is Lemma:is
Word:a Lemma:a
Word:leading Lemma:leading
Word:country Lemma:country
Word:in Lemma:in
Word:digital Lemma:digital
Word:space Lemma:space
Word:, Lemma:,
Word:and Lemma:and
Word:on Lemma:on
Word:its Lemma:it
Word:way Lemma:way
Word:to Lemma:to
Word:become Lemma:become
Word:the Lemma:the
Word:leader Lemma:leader
Word:. Lemma:.
[nltk_data] Downloading package wordnet to /Users/htk/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
If you notice in the results there are no changes after applying lemmatization. The reason is that if the word can not be found in WordNet (publicly awailable English dataset) then the word remain unchanged. It can be corrected by providing the pos tag of the word when calling lemmatize
function.
Now, we will supply pos tag of each word when calling lemmatize
function. However, the function only takes a single character for the pos tag. For example, n
for nouns, v
for verbs, a
for adjectives, and r
for adverbs.
So, we need to prepare a mapping which translates pos tags, obtained from nltk.pos_tag()
function, into a
, r
, n
, v
(depending on the tag).
We know from the POS tags table above that tags for adjectives starts from ‘J’. So what we can do is, we can take the first character of pos-tag and determine which tag to supply in lemmatize() function.
from nltk import pos_tag
def get_pos(word):
# the function returns a list with one tagged tuple, e.g., [('riding','VBD')]
= pos_tag([word])
tagged_tuple_list
# fetching the item in the list and then the tag in tuple
= tagged_tuple_list[0] # the first index will fetch the fist item in the list
tagged_tuple
# fetching the tag from the tagged tuple
= tagged_tuple[1] # the index will fetch the tag (e.g., 'VBD')
tag
# extracting the first character
= tag[0]
tag_char
# all these three statement can be combined into a single statement given below
# tag_char = pos_tag([word])[0][1][0]
# Now we will create a mapping
= {
pos_to_lemma_tag 'J': 'a',
'N': 'n',
'R': 'r',
'V': 'v'
}
# we will return the tag for usage in lemmatize function
return pos_to_lemma_tag.get(tag_char,'n') # get function will return n if the tag is something else
from nltk.stem import WordNetLemmatizer
# Lemmatizer
= WordNetLemmatizer()
wordnet_lemmatizer
# Tokenization
= word_tokenize('Estonia is a leading country in digital space, and on its way to become the leader.')
words
for word in words:
print('Word:{:10} Lemma:{}'.format(word,
wordnet_lemmatizer.lemmatize(word,get_pos(word))))
Word:Estonia Lemma:Estonia
Word:is Lemma:be
Word:a Lemma:a
Word:leading Lemma:lead
Word:country Lemma:country
Word:in Lemma:in
Word:digital Lemma:digital
Word:space Lemma:space
Word:, Lemma:,
Word:and Lemma:and
Word:on Lemma:on
Word:its Lemma:it
Word:way Lemma:way
Word:to Lemma:to
Word:become Lemma:become
Word:the Lemma:the
Word:leader Lemma:leader
Word:. Lemma:.
It worked now like a charm :-)
Vectorization
Now, we will move to the final step in the aforementioned list of preprocessing steps in natural language processing. This final step is vectorization step. This step translates the text into numbers so that computers can use them for further anlaysis.
There are multiple techinques of vectorization. In this post, we are going to discuss two basics techniques techniques: count vectorization, and tf-idf vectorization.
Count Vectorization
In this technique, the input text is first broken down into a set of unique words, and next, each word is assigned a number which represents the frequency of that word.
Let’s see a working example. For the example, we will use CountVectorizer
from scikit-learn
.
from sklearn.feature_extraction.text import CountVectorizer
# input data
= ["In this post, we will become familiar with the basics of natural language processing with Python. We will use the NLTK library for the tutorial."]
input_text
# initialization
= CountVectorizer()
vect
# applying count vectorization on the text
= vect.fit_transform(input_text)
result
# printing vocubulary with frequency
print('Shape:',result.shape, '\n Vector:',result.toarray())
Shape: (1, 20)
Vector: [[1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 2 2 2]]
The result is a single vector of length 20. In out input, we only had a single sentence, therefore, one got a single vector. In case of multiple sentences, we get one vector for each sentence.
TF-IDF
The next vectorization technique is TF-IDF (Term Frequency- Inverse Document Frequency). Let’s understand their meaning.
Term Frequency (TF) This frequency counts the number of times a word occurs in the document.
Inverse Document Frequency This is the inverse of how many documents contains the specified term.
from sklearn.feature_extraction.text import TfidfVectorizer
# input data
= ["This is an amazing field with huge potential of building intelligent machine.",
input_text "Those machine would be able to transform people's life.",
"This transformation would significantly improve the quality of life."]
# TF-IDF intialization
= TfidfVectorizer()
tf
# applying vectorizer
= tf.fit_transform(input_text)
result
# print results
print('Shape:',result.shape, '\n Vector:',result.toarray())
Shape: (3, 25)
Vector: [[0. 0.30520733 0.30520733 0. 0.30520733 0.30520733
0.30520733 0. 0.30520733 0.30520733 0. 0.23211804
0.23211804 0. 0.30520733 0. 0. 0.
0.23211804 0. 0. 0. 0. 0.30520733
0. ]
[0.35955412 0. 0. 0.35955412 0. 0.
0. 0. 0. 0. 0.27345018 0.27345018
0. 0.35955412 0. 0. 0. 0.
0. 0.35955412 0.35955412 0.35955412 0. 0.
0.27345018]
[0. 0. 0. 0. 0. 0.
0. 0.36977238 0. 0. 0.28122142 0.
0.28122142 0. 0. 0.36977238 0.36977238 0.36977238
0.28122142 0. 0. 0. 0.36977238 0.
0.28122142]]
//www.turing.com/kb/guide-on-word-embeddings-in-nlp) on further information on vectorization. You can check [this blog post](https:
References1. https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/
2. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
3. https://www.nltk.org/book/
4. Morphological analysis: https://www.education.vic.gov.au/school/teachers/teachingresources/discipline/english/literacy/readingviewing/Pages/litfocuswordmorph.aspx
5. https://www.datacamp.com/tutorial/stemming-lemmatization-python
6. https://pianalytix.com/countvectorizer-in-nlp/#:~:text=CountVectorizer%20means%20breaking%20down%20a,data%20needs%20to%20be%20vectorized.