TF-IDF and Word Embeddings in NLP — NLP Basics — Part 7 of 10

Shariq Hameed
3 min readOct 25, 2024

--

Computer don’t understand text as we do. They don’t see “word”, they see a number that represents “word”, maybe like 20.

But is it a good idea to represent a word by simply assigning it a number? If not, what is a good idea to represent a word?

Let’s see.

Simply assigning a number to each word doesn’t capture the meaning or relationships between words, so similar words aren’t treated similarly. “King” and “Queen” may get 5 and 29.

Also, this method creates very large, inefficient representations, making it hard for models to learn effectively.

So, what do we do?

We use other smart methods to represent text, two of which we are discussing in this article: TF-IDF and Word Embeddings.

These techniques convert text into numbers, which makes them understandable by machine learning algorithms.

Let’s start by exploring TF-IDF and then move on to Word Embeddings.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency.

It’s a statistical measure used to calculate the importance of a word in a document relative to a collection of documents (the corpus).

It works by combining two things:

1. Term Frequency (TF):

How often a word appears in a document.

2. Inverse Document Frequency (IDF):

How important that word is in the overall corpus. This cancels out words like “are”, “is” because they occur in every document. Hence, they get low IDF score.

So, TF-IDF becomes:

Example:

Let’s say you have the following documents:

  • Doc1: “Cats are cute”
  • Doc2: “I love dogs”
  • Doc3: “Dogs are amazing”

The word “are” appears in both Doc1 and Doc2, so it’s not “special”, hence, lower IDF.

However, the word “Cats” only appears in Doc1, giving it more weight when calculating TF-IDF.

Python Code:

from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["I love cats", "I love dogs", "Dogs are amazing"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print(vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())

What are Word Embeddings?

Word Embeddings are dense vector representations of words that capture their semantic meaning.

Unlike TF-IDF, Word Embeddings group similar words together in a continuous vector space.

This allows the model to understand the context in which a word is used.

Why do we need Word Embeddings?

While TF-IDF is effective, it has limitations:

  • Sparsity: TF-IDF creates high-dimensional vectors where most values are zero.
  • Lack of semantic understanding: TF-IDF doesn’t capture word meaning or similarity.

Word Embeddings solve this by learning relationships between words.

Words with similar meanings (e.g., “king” and “queen”) are placed closer in the vector space, allowing models to understand that there is a relationship between these words.

Techniques for Word Embeddings:

Word2Vec is one of the most popular and basic techniques for generating word embeddings.

It uses neural networks to learn the vector representation of words.

  • CBOW (Continuous Bag of Words): Predicts the current word from its surrounding context.
  • Skip-gram: Predicts the surrounding context from the current word.

Example Code for Word2Vec:

from gensim.models import Word2Vec
sentences = [["I", "love", "cats"], ["I", "love", "dogs"], ["Dogs", "are", "amazing"]]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Get the vector for the word "dogs"
print(model.wv['dogs'])

# Find words similar to "love"
print(model.wv.most_similar('love'))

Conclusion

In this article, we discussed two text representation techniques: TF-IDF and Word Embeddings.

Both techniques are widely used in NLP tasks such as information retrieval, sentiment analysis, and more.

Stay tuned for the next article in the series, where we’ll cover Text Classification!

--

--

Shariq Hameed
Shariq Hameed

No responses yet