Guide to Python’s re Module
Regular expression, shortened to “regex”, is a powerful tool for text processing and manipulation. They allow you to search, extract, and replace patterns in strings. In Python, the re module provides access to this powerful functionality, making it an essential tool for any Python developer.
Getting Started with the re Module:
To use the re module, simply import it into your Python code
import re
Once imported, you can utilize various functions like match, search, findall, sub, and compile to work with regular expressions.
print(re.match('a', 'apple'))
# output: <re.Match object; span=(0, 1), match='a'>
print(re.match('b', 'apple'))
# output: None
In the next section, we will delve deeper into the re module’s functionalities and explore practical examples of its applications.
Metacharacters
Metacharacters . ^ $ * + ? { } [ ] \ | ( ) are special characters in regular expressions that have a specific meaning beyond their literal representation. They provide powerful tools for building complex and efficient patterns for text matching and manipulation.
- .
- Matches any single character (except newline)
a.aMATCHES “aaa”, “aba”, “a a”, …
a.adoes NOT MATCHES “abba”, “a\na”, … - ^
- Matches the beginning of the string
^abcMATCHES “abcd”, “abc_”, …
^abcdoes NOT MATCHES “aabc”, “_abc”, … - $
- Matches the end of the string
abc$MATCHES “aabc”, “_abc”, …
abc$does NOT MATCHES “abcd”, “abc_”, … - *
- Matches 0 or more repetitions
ab*cMATCHES “abbc”, “ac”, …
ab*cdoes NOT MATCHES “abdc”, “a c”, … - +
- Matches 1 or more repetitions
ab+cMATCHES “abc”, “abbc”, …
ab+cdoes NOT MATCHES “ac”, “abdc”, … - ?
- Matches 0 or 1 repetitions
ab?cMATCHES “ac”, “abc”, …
ab?cdoes NOT MATCHES “abbc”, “abdc”, … - {m}
- where m is a number
Matches exactly m repetitions
ab{3}cMATCHES “abbbc”, “abbbcde”, …
ab{3}cdoes NOT MATCHES “abbc”, “abb_c”, … - {m,n}
- where m and n are numbers
Matches exactly from m to n repetitions
ab{3,5}cMATCHES “abbbc”, “abbbbbc”, …
ab{3,5}cdoes NOT MATCHES “abbc”, “abbbbbbc”, … - [ ]
- Mathes any characters within the specified range
[abc]MATCHES “a”, “b”, “c”, “az”, …
[a-c]MATCHES “a”, “b”, “c”, “zb”, …
[^a-c]MATCHES any characters EXCEPT “a”, “b”, or “c” - \
- Escapes special characters or signals a special sequence
a\^bMATCHES “a^b”, “a^bc”, …
a\^bdoes NOT MATCHES “a\^b”, … - A|B
- where A and B are expressions
Matches either A or B
ab|cdMATCHES “abc”, “bcd”, …
ab|cddoes NOT MATCHES “aacc”, “bc”, … - ( )
- Matches and captures expression inside parentheses
(bcd)MATCHES “abcd”, “bcde”, …
(bcd)does NOT MATCHES “abc”, “cde”, …
Special Sequences
Beyond basic characters, Python’s re module offers special sequences like \d (digits), \s (whitespace), and \w (words). These tools let you build patterns to match specific text formats, extract data, and manipulate strings with precision.
- \d
- Matches any digit (0-9)
\dMATCHES “1234”, “12 eggs”, …
\ddoes NOT MATCHES “abcd”, “dozen”, … - \D
- Matches any non-digit
\DMATCHES “abcd”, “dozen”, …
\ddoes NOT MATCHES “1234”, “12”, … - \s
- Matches any whitespace character (space, tab, newline)
ab\scdMATCHES “ab cd”, “ab\tcd”, “ab\xa0cd” …
ab\scddoes NOT MATCHES “abcd”, “ab_cd”, … - \S
- Matches any non-whitespace character
a\SbcdMATCHES “abcd”, “a_bcd”, “aabcd” …
a\Sbcddoes NOT MATCHES “a bcd”, “a\tcd”, … - \w
- Matches any word character (a-z, A-Z, 0-9, and )
a\wbMATCHES “aab”, “a5b”, “a_b” …
a\wbdoes NOT MATCHES “a b”, “a+b”, … - \W
- Matches any non-word character
a\WbMATCHES “a@b”, “a b”, “a.b” …
a\Wbdoes NOT MATCHES “aab”, “a_b”, …
Functions
Now, let us look at some of the common Python Regex functions.
- re.search(pattern, string)
- Find the pattern within the string
returns
re.Matchobjectre.search('\d','I am 12 years old') # output: <re.Match object; span=(5, 6), match='1'> - re.match(pattern, string)
- Find the pattern at the beginning of a string
returns
re.Matchobjectre.match('\d','12 years old') # output: <re.Match object; span=(0, 1), match='1'> - re.split(pattern, string)
- Divides a string based on the pattern
returns
listof stringresult = re.split(':','Question:Answer') print(result) # output: ['Question', 'Answer'] - re.findall(pattern, string)
- Extracts all occurrences of the pattern
returns
listof stringresult = re.findall('\w+','#pizza, #coke') print(result) # output: ['pizza', 'coke'] - re.sub(pattern, replace, string)
- Replaces matched patterns with new text
returns
stringresult = re.sub('\d+','2020', 'I was born in 2000') print(result) # output: I was born in 2020 - re.compile(pattern)
- Compile a regular expression pattern
returns
re.Patternobjectpattern = re.compile('\d+') result = pattern.search('I was born in 2000') print(result) # output: <re.Match object; span=(14, 18), match='2000'>
re.Match Objects
- Match.group()
- Returns captured string that matches the pattern inside
( )pattern = re.compile(r'(.*?)@(\w+\.\w+)') email = 'username@company.com' match = pattern.search(email) print(match) # output: <re.Match object; span=(0, 20), match='username@company.com'>From the above example,
matchcaptured two strings from email address. To return the match strings;match.group(0) # The entire match # output: 'username@company.com' match.group(1) # The first subgroup # output: 'username' match.group(2) # The second subgroup # output: 'company.com' match.group(1,2) # Multiple arguments give us a tuple # output: ('username', 'company.com')If the regular expression uses the
(?P<name>...)syntax, the capturedstringcan also be identifed by their group name.pattern = re.compile(r'(?P<name>.*?)@(?P<domain>\w+\.\w+)') email = 'username@company.com' match = pattern.search(email) match.group('name') # output: 'username' match.group('domain') # output: 'company.com' - Boolean
- Match objects always have a boolean value of
Trueif match, andNoneif no match. You can useifstatement with theMatchobjectmatch = re.search('\d','I am 12 years old') if match: print('Match found') else: print('No match') # output: Match found