4 minute read

Regular expression, shortened to “regex”, is a powerful tool for text processing and manipulation. They allow you to search, extract, and replace patterns in strings. In Python, the re module provides access to this powerful functionality, making it an essential tool for any Python developer.

Getting Started with the re Module:

To use the re module, simply import it into your Python code

import re

Once imported, you can utilize various functions like match, search, findall, sub, and compile to work with regular expressions.

print(re.match('a', 'apple'))
# output: <re.Match object; span=(0, 1), match='a'>

print(re.match('b', 'apple'))
# output: None

In the next section, we will delve deeper into the re module’s functionalities and explore practical examples of its applications.

Metacharacters

Metacharacters . ^ $ * + ? { } [ ] \ | ( ) are special characters in regular expressions that have a specific meaning beyond their literal representation. They provide powerful tools for building complex and efficient patterns for text matching and manipulation.

.
Matches any single character (except newline)
a.a MATCHES “aaa”, “aba”, “a a”, …
a.a does NOT MATCHES “abba”, “a\na”, …
^
Matches the beginning of the string
^abc MATCHES “abcd”, “abc_”, …
^abc does NOT MATCHES “aabc”, “_abc”, …
$
Matches the end of the string
abc$ MATCHES “aabc”, “_abc”, …
abc$ does NOT MATCHES “abcd”, “abc_”, …
*
Matches 0 or more repetitions
ab*c MATCHES “abbc”, “ac”, …
ab*c does NOT MATCHES “abdc”, “a c”, …
+
Matches 1 or more repetitions
ab+c MATCHES “abc”, “abbc”, …
ab+c does NOT MATCHES “ac”, “abdc”, …
?
Matches 0 or 1 repetitions
ab?c MATCHES “ac”, “abc”, …
ab?c does NOT MATCHES “abbc”, “abdc”, …
{m}
where m is a number
Matches exactly m repetitions
ab{3}c MATCHES “abbbc”, “abbbcde”, …
ab{3}c does NOT MATCHES “abbc”, “abb_c”, …
{m,n}
where m and n are numbers
Matches exactly from m to n repetitions
ab{3,5}c MATCHES “abbbc”, “abbbbbc”, …
ab{3,5}c does NOT MATCHES “abbc”, “abbbbbbc”, …
[ ]
Mathes any characters within the specified range
[abc] MATCHES “a”, “b”, “c”, “az”, …
[a-c] MATCHES “a”, “b”, “c”, “zb”, …
[^a-c] MATCHES any characters EXCEPT “a”, “b”, or “c”
\
Escapes special characters or signals a special sequence
a\^b MATCHES “a^b”, “a^bc”, …
a\^b does NOT MATCHES “a\^b”, …
A|B
where A and B are expressions
Matches either A or B
ab|cd MATCHES “abc”, “bcd”, …
ab|cd does NOT MATCHES “aacc”, “bc”, …
( )
Matches and captures expression inside parentheses
(bcd) MATCHES “abcd”, “bcde”, …
(bcd) does NOT MATCHES “abc”, “cde”, …

Special Sequences

Beyond basic characters, Python’s re module offers special sequences like \d (digits), \s (whitespace), and \w (words). These tools let you build patterns to match specific text formats, extract data, and manipulate strings with precision.

\d
Matches any digit (0-9)
\d MATCHES “1234”, “12 eggs”, …
\d does NOT MATCHES “abcd”, “dozen”, …
\D
Matches any non-digit
\D MATCHES “abcd”, “dozen”, …
\d does NOT MATCHES “1234”, “12”, …
\s
Matches any whitespace character (space, tab, newline)
ab\scd MATCHES “ab cd”, “ab\tcd”, “ab\xa0cd” …
ab\scd does NOT MATCHES “abcd”, “ab_cd”, …
\S
Matches any non-whitespace character
a\Sbcd MATCHES “abcd”, “a_bcd”, “aabcd” …
a\Sbcd does NOT MATCHES “a bcd”, “a\tcd”, …
\w
Matches any word character (a-z, A-Z, 0-9, and )
a\wb MATCHES “aab”, “a5b”, “a_b
” …
a\wb does NOT MATCHES “a b”, “a+b”, …
\W
Matches any non-word character
a\Wb MATCHES “a@b”, “a b”, “a.b” …
a\Wb does NOT MATCHES “aab”, “a_b”, …

Functions

Now, let us look at some of the common Python Regex functions.

re.search(pattern, string)
Find the pattern within the string returns re.Match object
re.search('\d','I am 12 years old')
# output: <re.Match object; span=(5, 6), match='1'>
re.match(pattern, string)
Find the pattern at the beginning of a string returns re.Match object
re.match('\d','12 years old')
# output: <re.Match object; span=(0, 1), match='1'>
re.split(pattern, string)
Divides a string based on the pattern returns list of string
result = re.split(':','Question:Answer')
print(result)
# output: ['Question', 'Answer']
re.findall(pattern, string)
Extracts all occurrences of the pattern returns list of string
result = re.findall('\w+','#pizza, #coke')
print(result)
# output: ['pizza', 'coke']
re.sub(pattern, replace, string)
Replaces matched patterns with new text returns string
result = re.sub('\d+','2020', 'I was born in 2000')
print(result)
# output: I was born in 2020
re.compile(pattern)
Compile a regular expression pattern returns re.Pattern object
pattern = re.compile('\d+')
result = pattern.search('I was born in 2000')
print(result)
# output: <re.Match object; span=(14, 18), match='2000'>

re.Match Objects

Match.group()
Returns captured string that matches the pattern inside ( )
pattern = re.compile(r'(.*?)@(\w+\.\w+)')
email = 'username@company.com'
match = pattern.search(email)
print(match)
# output: <re.Match object; span=(0, 20), match='username@company.com'>

From the above example, match captured two strings from email address. To return the match strings;

match.group(0)        # The entire match
# output: 'username@company.com'
match.group(1)        # The first subgroup
# output: 'username'
match.group(2)       # The second subgroup
# output: 'company.com'  
match.group(1,2)      # Multiple arguments give us a tuple
# output: ('username', 'company.com')

If the regular expression uses the (?P<name>...) syntax, the captured string can also be identifed by their group name.

pattern = re.compile(r'(?P<name>.*?)@(?P<domain>\w+\.\w+)')
email = 'username@company.com'
match = pattern.search(email)
match.group('name')
# output: 'username'
match.group('domain')
# output: 'company.com' 
Boolean
Match objects always have a boolean value of True if match, and None if no match. You can use if statement with the Match object
match = re.search('\d','I am 12 years old')
if match:
  print('Match found')
else:
  print('No match')
# output: Match found

Updated: