Tokenization — NLP basic concepts — Part 1 of 10

3 min readFeb 27, 2024

Welcome to part I of my 10-parts series on the basics of Natural language processing.

In this part, we are exploring tokenization.

What is tokenization?

The first step of any ML project is to preprocess your data.

Similarly, in NLP we need to perform text preprocessing which aims to convert our text into meaningful and analyzable format.

Tokenization is one of the many steps of text preprocessing.

It is a process of taking raw text and breaking it into pieces like words, sentences, symbols, terms or other meaningful and manageable elements called tokens.

For example, the sentence “I love cats!” can be tokenized as:

Word level: ["I", "love", "cats", "!"]
Sentence level: ["I love cats!"]
Character level: ["I", " ", "l", "o", "v", "e", " ", "c", "a", "t", "s", "!"]

Why do we need tokenization?

Computers are fantastic at dealing with numbers, but they struggle with understanding human language.

Tokenization helps bridge this gap by converting text into a format that computers can understand and work with.

Imagine you want to count how many times the word “cat” appears in a large text.

By breaking down the book into individual words (tokens), it will be easy to identify and count each instance of “cat”.

Types of tokenization?

Broadly speaking, there are three main types of tokenization:

1. Character-level tokenization.
2. Word-level tokenization.
3. Sentence-level tokenization.

1. Character-level tokenization:

Character-level tokenization breaks text into individual characters and carries character-level information.

Example, character-level tokenization of the sentence “I love cats”:

["I", " ", "l", "o", "v", "e", " ", "c", "a", "t", "s", "!"]

A simple-most implementation of character-level tokenization in python can be as follows:

def char_tokenizer(text):
    """
    This function performs character-level tokenization on a string.
  
    Args:
        text: The string to be tokenized.
  
    Returns:
        A list of characters in the string.
    """
    return list(text)

This function takes a string and returns characters that make up the string.

2. Word-level tokenization:

Word-level tokenization breaks text into individual words.

Example, word-level tokenization of the sentence “I love cats”:

["I", "love", "cats"]

Nltk a famous python library has many implementations for word-level tokenization.

from nltk.tokenize import word_tokenize
text = "My name is Shariq and I love NLP"

tokens = word_tokenize(text)
print(tokens)

>> ['My', 'name', 'is', 'Shariq', 'and', 'I', 'love', 'NLP']

There’s also one WordPunctTokenizer that also treats punctuation marks as separate tokens.

from nltk import WordPunctTokenizer
text = "I can't travel today."

tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)

>> ["I", "can", "'", "t", "travel", "today", "."]

3. Sentence-level tokenization:

Sentence-level tokenization breaks a text/paragraph to break it down into individual sentences.

Here’s a sample code that uses sent_tokenize from Nltk.

 
from nltk.tokenize import sent_tokenize
 
text = "Hello everyone. Welcome to our 10-part series on NLP. Today, we are covering tokenization"

tokens = sent_tokenize(text)

print(tokens)

>> [
'Hello everyone.',
'Welcome to our 10-part series on NLP.',
'Today, we are covering tokenization'
]

Conclusion

In this article, we covered the concept of tokenization which is a crucial step in text preprocessing in almost all NLP applications.

I also shared some code snippets; you can find a Jupyter notebook at the repository for this series at my GitHub account — here.

I hope you liked this article and stay tuned for the next one.