Text Normalization — NLP Basics — part 4 of 10
There’s no denying that NLP is one of the hottest fields of AI these days. Especially with LLMs, the field is advancing rapidly.
To understand the basics of NLP, I started a series of 10 articles dedicated to basic concepts.
Today, we will explore a crucial step in the overall NLP process, i.e. Text normalization.
What is Text Normalization?
Text normalization is a key step in NLP that cleans and preprocesses data into a usable, standard and “less-random” format.
Text normalization involves various techniques such as lowercasing, removing special characters and stop words removal etc.
Here’s a sample text:
Dealing with today's fast-paced lifestyle isn't easy - it's full of challenges
and surprises. We're constantly bombarded with information from all directions,
making it hard to focus on what truly matters. However, by taking small steps
towards mindfulness and self-care, we can navigate through life's ups and downs
with more ease and grace. It's important to remember that self-care isn't
selfish; it's a necessity for maintaining our overall well-being and happiness.
Here’s the same text but normalized:
dealing today fast pace lifestyle easy full challenge surprise constantly
bombard information direction making hard focus truly matter however taking
small step towards mindfulness self care navigate life up down ease grace
important remember self care selfish necessity maintaining overall wellbeing
happiness
In this article, I will explain why and how to get from the original to the normalized text.
Why do we need text normalization?
Here are two main reasons why we need text normalization:
1. Reduces complexity:
Human language is full of complexities such as slangs, abbreviations and different grammatical forms of the same word.
Text normalizations helps reduce these complexities by transforming the text into a standard and consistent format.
2. Improves Efficiency:
By reducing the number of unique forms that a word can take, text normalization improves the efficiency of NLP models.
For instance, a model doesn’t need to learn the difference between “play” and “playing” if it understands they both convey the same core meaning.
Techniques of text normalization:
Following are some of the main techniques used for text normalization:
1. Lowercasing:
Lowercasing is a technique that transforms all text into lowercase to ensure standard formats for all characters.
Here’s a simple function that implements lowercasing with python:
def lowercase_text(text):
"""
This function takes text and returns the text in lowercase
"""
return text.lower()
2. Removing punctuation:
There are cases where we need to get rid of punctuations.
For example, if your word embeddings matrix doesn’t support special characters, we need to get rid of them.
Here’s a short function that implements punctuation removal:
import string
punctuations = list(string.punctuation)
def remove_punctuations(text,punctuations):
for punctuation in punctuations:
if punctuation in text:
text = text.replace(punctuation, '')
return text.strip()
3. Stemming & lemmatization:
Stemming and lemmatization are techniques that reduce a word to its base form.
For example, “playing”, “played”, “plays” are all reduced to “play” and hence, converting all these forms to a standard format.
Here’s a python code that implements stemming:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
sentence = "The quick brown foxes are jumping over the lazy dogs"
words = word_tokenize(sentence)
for word in words:
print(word, ": ", stemmer.stem(word))
The: the
quick: quick
brown: brown
foxes: fox
are: are
jumping: jump
over: over
the: the
lazy: lazi
dogs: dog
Here’s a python code that implements lemmatization:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "The quick brown foxes are jumping over the lazy dogs"
text = nlp(text)
lemmatized_tokens = [token.lemma_ for token in text]
for original, lemmatized in zip(text,lemmatized_tokens):
print(str(original) + ": " + lemmatized)
The: the
quick: quick
brown: brown
foxes: fox
are: be
jumping: jump
over: over
the: the
lazy: lazy
dogs: dog
If you want to learn more about stemming and lemmatization, here’s a detailed article.
4. Stop words Removal:
For a variety of NLP tasks, words like “are”, “the”, “an” or “on” do not carry any useful information.
Hence, we remove these stop words for efficiency and reducing complexity.
Here’s a sample python function that accomplishes this:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def remove_stopwords(sentence):
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sentence)
filtered_sentence = [word for word in word_tokens if not word.lower() in stop_words]
return ' '.join(filtered_sentence)
Do you think we should remove stop words in all cases? Follow this link to find out.
5. Expanding contractions:
Contractions are words like “I’m”, “We’re” or “doesn’t”.
These are basically a short way of writing “I am”, “We are” and “Does not” respectively.
There are two main reasons why we should expand such contractions:
- Computer doesn’t understand that “I’m” and “I am” mean the same thing.
- It increases dimensionality of document-term matrix as we have to have separate columns for “I’m” and “I am”.
Here’s a python function that expands contractions:
pip install contractions
import contractions
def expand_contractions(text):
expanded_text = []
for word in text.split():
expanded_text.append(contractions.fix(word))
expanded_text = ' '.join(expanded_text)
return expanded_text
Conclusion:
In this part of our NLP basics series, we explored stop words. What they are, when to remove them, when not to remove them and how to remove them.
You can get the code here.
Stay tuned for the next part.