Removing Stop Words — NLP Basics — Part 3 of 10

3 min readMar 9, 2024

Stop words are highly occurring words in any document that do not convey significant meaning for much of NLP tasks such as information retrieval.

Hence, removing them enhances analysis and computational efficiency.

What are stop words?

Stop words are the most common words used in any language.

We typically remove these stop words from our text before processing it for much of NLP tasks.

If you plot the frequency distribution of all the words in a certain document, stop words would appear the most.

These are words such as “a”, “an”, “the” or “in” etc.

Every language has a list of these words.

You can see stop words in English using this code.

import nltk
from nltk.corpus import stopwords
 
nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 
'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 
'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 
'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y',
 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 
'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn',
 "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 
'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

When to remove stop words?

The need to remove stop words depends on the task at hand.

Let’s say we are making a classification model that categorizes text as positive or negative.

The words “a”, “an”, “the” or “in” do not convey any meaningful information for our task whereas words like “love”, “hate”, “angry” or “like” convey more meaning.

This is why to make sure our model pays attention to words that it should pay attention to, we remove these “stop words”.

When not to remove stop words?

For tasks such as machine translation, stop words removal is not recommended because every word conveys a specific meaning.

“is” or “was” convey information about the tense of the sentence and hence, it is required to translate efficiently.

How to remove stop words?

The following function uses nltk, takes in a sentence and returns the sentence without stop words:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
def remove_stopwords(sentence):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sentence)

    filtered_sentence = [word for word in word_tokens if not word.lower() in stop_words]

    return ' '.join(filtered_sentence)

sentence = "This is a sample code that shows how to remove stop words"

sentence_without_stopwords = remove_stopwords(sentence) 

print(sentence_without_stopwords)

Conclusion:

In this part of our NLP basics series, we explored stop words. What they are, when to remove them, when not to remove them and how to remove them.

You can get the code here.

Stay tuned for the next part.