Introduction to Regular Expressions

This post will explain the basics of regular expression. Regular expression (regexp in short) is a powerful technique of text searching and text manipulation. If you are planning to make your career in the field of natural language processing then regexp is a must have skill in your skill set.

Example: Every time you search in your word document using Ctrl+f, regular expression works in the background.

Let’s take an example, given below, to understand it further.

text = "abc abcd abcde"

We want to see if there is a string abc present in our text data. To do that we will use the most basic form of regular expressions which specifies the search query using the actual word or text. For example, we want to search for abc in our text data, thus, we simply specify the term abc as our regular expression.

The following code illustrates how to search for abc in our text data. We will use re package.

import re

# searching for pattern 'abc' in text. use r in the start of the pattern
result = re.findall(r'abc',text)

print(result)

['abc', 'abc', 'abc']

We can see in the result above that there were three occurrences of pattern `abc``. We have now seen the most basic form of using regular expression (i.e., simply using the text).

Now, we will move towards a more advanced form of regular expressions. Let's change our text data the following

text = "abc abcd abcde bcd apple ddeffe eef ggh"

Let’s now search all the words of three characters [no numbers]. We can only use our first approach of specifying the word itself for searching if we know all the three characters long words in the text data. However, we may not necessarily know about all of them.

How to search for three-characters-long words in the text?

To answer this we will use a special purpose character . which matches a single occurence of any word character. For our searching, we can simply specify . three times to search every three characters word.

Let’s apply it first to see it’s working.

import re

text = "abc abcd abcde bcd apple ddeffe eef ggh"

result = re.findall(r'...',text)

print(result)

['abc', ' ab', 'cd ', 'abc', 'de ', 'bcd', ' ap', 'ple', ' dd', 'eff', 'e e', 'ef ', 'ggh']

What went wrong? The results are not what we expected (i.e., all three-characters-long word). The reason is that the text is treated as sequence of characters and here characters are not limited to alphabets and numbers. Blanks are also considered as characters. So when we specify . it matches any single character including blank space.

To correct this we will use another special character \b which matches word boundry (in our case it is blank spaces between words).

Now, we will change our regular expression to \b...\b which will match three-characters-long words which exists independently. Let’s see now the results.

import re

text = "abc abcd abcde bcd apple ddeffe eef ggh"

result = re.findall(r'\b...\b',text)

print(result)

['abc', 'bcd', 'eef', 'ggh']

It worked now as expected.

Similar to the special characters we have seen so far, there are other characters as well with special meaning. These characters makes it easier to create patterns for searching. The list of most commonly used special characters is given below.

Pattern	Description
[abc]	Matches a single character among a,b,c
[^abc]	Matches a single character except a or b or c
[a-z]	Matches a single character in a-z
[^a-z]	Matches a single character except a-z
.	Matches any single character
`\d`	Matches any single digit
`\w`	Matches any single word character (i.e., a character between a-z or A-Z or _ or 0-9)
`\s`	Matches any white space character (i.e., space, tab, new line)
`\b`	Matches word boundary
^	Matches start of string
$	Matches end of string
a?	Matches zero or one occurrence of character a
a+	Matches one or more occurrences of character a
a*	Matches zero or more occurrences of character a
a{4}	Matches four occurrences of character a
a{2,4}	Matches occurrences of a between 2 to 4
a{2,}	Matches either two or more occurrences of a

Examples

Let’s now see some examples of using regular expressions.

Example-1: Search all numbers present in the below text.

import re

text = 'Apple Bus 123 Air 34 Data 33 45Egg'

# \d matches any digit while + matches one or more occurrence of preceeding pattern (i.e., digit)
results = re.findall(r'\d+',text)

# print results
print(results)

['123', '34', '33', '45']

Match independent word only

Remember to use \b in the expression if you want to extract the standalone numbers not which are parts of a string (e.g., 45 in 45Egg).

import re

text = 'Apple Bus 123 Air 34 Data 33 45Egg'

# \d matches any digit while + matches one or more occurrence of preceeding pattern (i.e., digit)
results = re.findall(r'\b\d+\b',text)

# print results
print(results)

['123', '34', '33']

References 1. https://regex101.com/ 2. https://www.regexone.com/ 3. https://docs.python.org/3/library/re.html