2 minute read

Recently, I have been working on text files from different sources such as DOCX and PDF. Reading the text file into Python environment often leaves ASCII control characters. Many times, it gives better results when removing these ASCII control characters for text cleaning and processing or Natural Language Processing (NLP) tasks.

What are the ASCII control characters

ASCII control characters are non-printable characters within the ASCII character set that are designed to control devices or data flow rather than represent visible symbols. There are total 33 characters from 0 to 31 and 127. While some control characters are still used in modern computing (like tabs, line feeds, and carriage returns), many are obsolete due to advancements in device control and data handling.

The ASCII control chart from Wikipedia are below.

Dec Hex Abbreviation Escape sequence Name
0 00 NUL \0 Null
1 01 SOH   Start of Heading
2 02 STX   Start of Text
3 03 ETX   End of Text
4 04 EOT   End of Transmission
5 05 ENQ   Enquiry
6 06 ACK   Acknowledgement
7 07 BEL \a Bell
8 08 BS \b Backspace
9 09 HT \t Horizontal Tab
10 0A LF \n Line Feed
11 0B VT \v Vertical Tab
12 0C FF \f Form Feed
13 0D CR \r Carriage Return
14 0E SO   Shift Out
15 0F SI   Shift In
16 10 DLE   Data Link Escape
17 11 DC1   Device Control 1 (often XON)
18 12 DC2   Device Control 2
19 13 DC3   Device Control 3 (often XOFF)
20 14 DC4   Device Control 4
21 15 NAK   Negative Acknowledgement
22 16 SYN   Synchronous Idle
23 17 ETB   End of Transmission Block
24 18 CAN   Cancel
25 19 EM   End of Medium
26 1A SUB   Substitute
27 1B ESC \e Escape
28 1C FS   File Separator|
29 1D GS   Group Separator
30 1E RS   Record Separator
31 1F US   Unit Separator
127 7F DEL   Delete

How to remove

We can remove all those control characters using re library.

import re

def remove_ctrl_chars(text):
"""Removes all ASCII control characters from a string.""" 
    return re.sub(r'[\x00-\x1F\x7F]', '', text)

The explanation of regex expression:

  • \x00-\x1F :Matches any character with a hexadecimal code between 0x00 (null) and 0x1F (US), covering the first 32 control characters.
  • \x7F: Matches the DEL character (0x7F), the last control character in the ASCII table.
text = '1.\tThis is the first line of \x07string.\n\x0C2.\tThis is the \x08second line of string.'
cleaned_text = remove_ctrl_chars(text)
print(cleaned_text)
# output: 
# 1.This is the first line of string.2.This is the second line of string.

Optionally, we can preserve some characters such as tab \t and newlines \n.

import re

def remove_ctrl_chars_except_tab_newline(text):
"""Removes all ASCII control characters except tab and newline from a string.""" 
    return re.sub(r'[\x00-\x1F\x7F](?<![\x0A\x09])', '', text)

where, \x09 and \x0A matches tab and newline characters.

text = '1.\tThis is the first line of \x07string.\n\x0C2.\tThis is the \x08second line of string.'
cleaned_text = remove_ctrl_chars_except_tab_newline(text)
print(cleaned_text)
# output: 
# 1.    This is the first line of string. 
# 2.    This is the second line of string.

Updated: