Stemming and Lemmatization — NLP Basics — Part 2 of 10
Is it necessary to search for ‘thinking’ on Google and get results where the exact word ‘thinking’ gets matched and not ‘think’, ‘thought’ or ‘thinks’?
Of course not!
The answer lies in two powerful text processing techniques: stemming and lemmatization.
Both aim to reduce inflected words (variations of a base word) to their common base form, but they achieve this in different ways.
What is Stemming?
Stemming is a rule-based approach that removes suffixes from words to obtain a morphological stem.
This stem might not necessarily be a real word, but it captures the core meaning of the inflected word.
For example, stemming the words “running”, “runs”, and “ran” would all result in the stem “run”.
But this might also give the stem for changing, changes and changed to be “chang”.
What is Lemmatization?
Lemmatization, on the other hand, takes a more linguistic approach.
It uses dictionaries and morphological analysis to map inflected words to their canonical form, also known as the lemma.
Unlike stemming, lemmatization always results in a valid word, ensuring consistency and accuracy.
For instance, lemmatizing “changing”, “changed”, and “changes” would all result in the lemma “change”.
Why Use Stemming and Lemmatization?
These techniques offer several benefits in natural language processing (NLP) tasks:
- Improved information retrieval: By reducing word to their original base forms, stemming and lemmatization allow search engines to match queries with relevant documents regardless of the specific word forms used.
- Enhanced text analysis: These techniques help in tasks like sentiment analysis and topic modeling by grouping similar words together, leading to a more accurate understanding of the text.
- Reduced data sparsity: By reducing the number of unique words, stemming and lemmatization can help mitigate the issue of data sparsity.
Using Stemming and Lemmatization
The following code snippet implements stemming using nltk.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
sentence = "The quick brown foxes are jumping over the lazy dogs."
words = word_tokenize(sentence)
for word in words:
print(word, ": ", stemmer.stem(word))
Result:
The: the
quick: quick
brown: brown
foxes: fox
are: are
jumping: jump
over: over
the: the
lazy: lazi
dogs: dog
As you can see, the word ‘lazy’ is incorrectly stemmed as ‘lazi’.
Now, let’s see what lemmatization can do:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "The quick brown foxes are jumping over the lazy dogs"
text = nlp(text)
lemmatized_tokens = [token.lemma_ for token in text]
for original, lemmatized in zip(text,lemmatized_tokens):
print(str(original) + ": " + lemmatized)
Result:
The: the
quick: quick
brown: brown
foxes: fox
are: be
jumping: jump
over: over
the: the
lazy: lazy
dogs: dog
Conclusion:
In this article, we explored two text processing techniques that achieve the same objective, stemming and lemmatization.
You can get the code snippets here.
I hope you liked it. Stay tuned for the next part.