How to remove ASCII control characters from a string
Recently, I have been working on text files from different sources such as DOCX and PDF. Reading the text file into Python environment often leaves ASCII control characters. Many times, it gives better results when removing these ASCII control characters for text cleaning and processing or Natural Language Processing (NLP) tasks.
What are the ASCII control characters
ASCII control characters are non-printable characters within the ASCII character set that are designed to control devices or data flow rather than represent visible symbols. There are total 33 characters from 0 to 31 and 127. While some control characters are still used in modern computing (like tabs, line feeds, and carriage returns), many are obsolete due to advancements in device control and data handling.
The ASCII control chart from Wikipedia are below.
| Dec | Hex | Abbreviation | Escape sequence | Name |
|---|---|---|---|---|
| 0 | 00 | NUL | \0 | Null |
| 1 | 01 | SOH | Start of Heading | |
| 2 | 02 | STX | Start of Text | |
| 3 | 03 | ETX | End of Text | |
| 4 | 04 | EOT | End of Transmission | |
| 5 | 05 | ENQ | Enquiry | |
| 6 | 06 | ACK | Acknowledgement | |
| 7 | 07 | BEL | \a | Bell |
| 8 | 08 | BS | \b | Backspace |
| 9 | 09 | HT | \t | Horizontal Tab |
| 10 | 0A | LF | \n | Line Feed |
| 11 | 0B | VT | \v | Vertical Tab |
| 12 | 0C | FF | \f | Form Feed |
| 13 | 0D | CR | \r | Carriage Return |
| 14 | 0E | SO | Shift Out | |
| 15 | 0F | SI | Shift In | |
| 16 | 10 | DLE | Data Link Escape | |
| 17 | 11 | DC1 | Device Control 1 (often XON) | |
| 18 | 12 | DC2 | Device Control 2 | |
| 19 | 13 | DC3 | Device Control 3 (often XOFF) | |
| 20 | 14 | DC4 | Device Control 4 | |
| 21 | 15 | NAK | Negative Acknowledgement | |
| 22 | 16 | SYN | Synchronous Idle | |
| 23 | 17 | ETB | End of Transmission Block | |
| 24 | 18 | CAN | Cancel | |
| 25 | 19 | EM | End of Medium | |
| 26 | 1A | SUB | Substitute | |
| 27 | 1B | ESC | \e | Escape |
| 28 | 1C | FS | File Separator| | |
| 29 | 1D | GS | Group Separator | |
| 30 | 1E | RS | Record Separator | |
| 31 | 1F | US | Unit Separator | |
| 127 | 7F | DEL | Delete |
How to remove
We can remove all those control characters using re library.
import re
def remove_ctrl_chars(text):
"""Removes all ASCII control characters from a string."""
return re.sub(r'[\x00-\x1F\x7F]', '', text)
The explanation of regex expression:
\x00-\x1F:Matches any character with a hexadecimal code between 0x00 (null) and 0x1F (US), covering the first 32 control characters.\x7F: Matches the DEL character (0x7F), the last control character in the ASCII table.
text = '1.\tThis is the first line of \x07string.\n\x0C2.\tThis is the \x08second line of string.'
cleaned_text = remove_ctrl_chars(text)
print(cleaned_text)
# output:
# 1.This is the first line of string.2.This is the second line of string.
Optionally, we can preserve some characters such as tab \t and newlines \n.
import re
def remove_ctrl_chars_except_tab_newline(text):
"""Removes all ASCII control characters except tab and newline from a string."""
return re.sub(r'[\x00-\x1F\x7F](?<![\x0A\x09])', '', text)
where, \x09 and \x0A matches tab and newline characters.
text = '1.\tThis is the first line of \x07string.\n\x0C2.\tThis is the \x08second line of string.'
cleaned_text = remove_ctrl_chars_except_tab_newline(text)
print(cleaned_text)
# output:
# 1. This is the first line of string.
# 2. This is the second line of string.