Text Normalization — NLP Basics — part 4 of 10

Shariq Hameed
4 min readMar 15, 2024

--

There’s no denying that NLP is one of the hottest fields of AI these days. Especially with LLMs, the field is advancing rapidly.

To understand the basics of NLP, I started a series of 10 articles dedicated to basic concepts.

Today, we will explore a crucial step in the overall NLP process, i.e. Text normalization.

What is Text Normalization?

Text normalization is a key step in NLP that cleans and preprocesses data into a usable, standard and “less-random” format.

Text normalization involves various techniques such as lowercasing, removing special characters and stop words removal etc.

Here’s a sample text:

Dealing with today's fast-paced lifestyle isn't easy - it's full of challenges
and surprises. We're constantly bombarded with information from all directions,
making it hard to focus on what truly matters. However, by taking small steps
towards mindfulness and self-care, we can navigate through life's ups and downs
with more ease and grace. It's important to remember that self-care isn't
selfish; it's a necessity for maintaining our overall well-being and happiness.

Here’s the same text but normalized:

dealing today fast pace lifestyle easy full challenge surprise constantly 
bombard information direction making hard focus truly matter however taking
small step towards mindfulness self care navigate life up down ease grace
important remember self care selfish necessity maintaining overall wellbeing
happiness

In this article, I will explain why and how to get from the original to the normalized text.

Why do we need text normalization?

Here are two main reasons why we need text normalization:

1. Reduces complexity:

Human language is full of complexities such as slangs, abbreviations and different grammatical forms of the same word.

Text normalizations helps reduce these complexities by transforming the text into a standard and consistent format.

2. Improves Efficiency:

By reducing the number of unique forms that a word can take, text normalization improves the efficiency of NLP models.

For instance, a model doesn’t need to learn the difference between “play” and “playing” if it understands they both convey the same core meaning.

Techniques of text normalization:

Following are some of the main techniques used for text normalization:

1. Lowercasing:

Lowercasing is a technique that transforms all text into lowercase to ensure standard formats for all characters.

Here’s a simple function that implements lowercasing with python:

def lowercase_text(text):
"""
This function takes text and returns the text in lowercase
"""

return text.lower()

2. Removing punctuation:

There are cases where we need to get rid of punctuations.

For example, if your word embeddings matrix doesn’t support special characters, we need to get rid of them.

Here’s a short function that implements punctuation removal:

import string
punctuations = list(string.punctuation)

def remove_punctuations(text,punctuations):
for punctuation in punctuations:
if punctuation in text:
text = text.replace(punctuation, '')
return text.strip()

3. Stemming & lemmatization:

Stemming and lemmatization are techniques that reduce a word to its base form.

For example, “playing”, “played”, “plays” are all reduced to “play” and hence, converting all these forms to a standard format.

Here’s a python code that implements stemming:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

sentence = "The quick brown foxes are jumping over the lazy dogs"
words = word_tokenize(sentence)

for word in words:
print(word, ": ", stemmer.stem(word))
The: the
quick: quick
brown: brown
foxes: fox
are: are
jumping: jump
over: over
the: the
lazy: lazi
dogs: dog

Here’s a python code that implements lemmatization:

import spacy

nlp = spacy.load('en_core_web_sm')

text = "The quick brown foxes are jumping over the lazy dogs"

text = nlp(text)

lemmatized_tokens = [token.lemma_ for token in text]

for original, lemmatized in zip(text,lemmatized_tokens):
print(str(original) + ": " + lemmatized)
The: the
quick: quick
brown: brown
foxes: fox
are: be
jumping: jump
over: over
the: the
lazy: lazy
dogs: dog

If you want to learn more about stemming and lemmatization, here’s a detailed article.

4. Stop words Removal:

For a variety of NLP tasks, words like “are”, “the”, “an” or “on” do not carry any useful information.

Hence, we remove these stop words for efficiency and reducing complexity.

Here’s a sample python function that accomplishes this:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_stopwords(sentence):
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sentence)

filtered_sentence = [word for word in word_tokens if not word.lower() in stop_words]

return ' '.join(filtered_sentence)

Do you think we should remove stop words in all cases? Follow this link to find out.

5. Expanding contractions:

Contractions are words like “I’m”, “We’re” or “doesn’t”.

These are basically a short way of writing “I am”, “We are” and “Does not” respectively.

There are two main reasons why we should expand such contractions:

  1. Computer doesn’t understand that “I’m” and “I am” mean the same thing.
  2. It increases dimensionality of document-term matrix as we have to have separate columns for “I’m” and “I am”.

Here’s a python function that expands contractions:

pip install contractions
import contractions

def expand_contractions(text):
expanded_text = []

for word in text.split():
expanded_text.append(contractions.fix(word))

expanded_text = ' '.join(expanded_text)
return expanded_text

Conclusion:

In this part of our NLP basics series, we explored stop words. What they are, when to remove them, when not to remove them and how to remove them.

You can get the code here.

Stay tuned for the next part.

--

--