Dictionary Based Tokenization in NLP

Last Updated : 04 Jun, 2023

Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary.

Dictionary-based tokenization is a technique in natural language processing (NLP) that involves splitting a text into individual tokens based on a predefined dictionary of multi-word expressions. This is useful when the standard word tokenization techniques may not be sufficient for certain applications, such as sentiment analysis or named entity recognition, where multi-word expressions need to be treated as a single token.

Dictionary-based tokenization divides the text into tokens by using a predefined dictionary of multi-word expressions. A dictionary is a list of words, phrases, and other linguistic constructions along with the definitions, speech patterns, and other pertinent data that go with them. Each word in the text is compared to the terms in the dictionary as part of the dictionary-based tokenization process, and the text is then divided into tokens based on the matches discovered. We can tokenize the name, and phrases by creating a custom dictionary.

A token in natural language processing is a group of characters that stands for a single meaning. Words, phrases, integers, and punctuation marks can all be used as tokens. Several NLP activities, including text classification, sentiment analysis, machine translation, and named entity recognition, depend on the tokenization process.

Several methods, including rule-based tokenization, machine learning-based tokenization, and hybrid tokenization, can be used to conduct the dictionary-based tokenization process. Rule-based tokenization divides the text into tokens according to the text’s characteristics, such as punctuation, capitalization, and spacing. Tokenization that is based on machine learning entails training a model to separate text into tokens based on a set of training data. To increase accuracy and efficiency, hybrid tokenization blends rule-based and machine-learning-based methods.

Steps needed for implementing Dictionary-based tokenization:

Step 1: Collect a dictionary of words and their corresponding parts of speech. The dictionary can be created manually or obtained from a pre-existing source such as WordNet or Wikipedia.
Step 2: Preprocess the text by removing any noise such as punctuation marks, stop words, and HTML tags.
Step 3: Tokenize the text into words using a whitespace tokenizer or a sentence tokenizer.
Step 4: Identify the parts of speech of each word in the text using a part-of-speech tagger such as the Stanford POS Tagger.
Step 5: Segment the text into tokens by comparing each word in the text with the words in the dictionary. If a match is found, the corresponding word in the dictionary is used as a token. Otherwise, the word is split into smaller sub-tokens based on its parts of speech.

For example, consider the following sentence:

Jammu Kashmir is an integral part of India.
My name is Pawan Kumar Gunjan.
He is from Himachal Pradesh.

The steps involved in the dictionary-based tokenization of this sentence are as follows:

Step 1: Import the necessary libraries

Python3

from nltk import word_tokenize 
from nltk.tokenize import MWETokenizer

Step 2: Create a custom dictionary using the name or phrases

Collect a dictionary of words having joint words like phrases or names. Let the dictionary contain the following name or phrases.

Python3

dictionary = [("Jammu", "Kashmir"),  
              ("Pawan", "Kumar", "Gunjan"),  
              ("Himachal", "Pradesh")]

Step 3: Create an instance of MWETokenizer with the dictionary

Python3

Dictionary_tokenizer = MWETokenizer(dictionary, separator=' ')

Step 4: Create a text dataset and tokenize with word_tokenize

Python3

text = """ 
Jammu Kashmir is an integral part of India. 
My name is Pawan Kumar Gunjan. 
He is from Himachal Pradesh. 
"""
tokens = word_tokenize(text) 
tokens

Output:

['Jammu',
 'Kashmir',
 'is',
 'an',
 'integral',
 'part',
 'of',
 'India',
 '.',
 'My',
 'name',
 'is',
 'Pawan',
 'Kumar',
 'Gunjan',
 '.',
 'He',
 'is',
 'from',
 'Himachal',
 'Pradesh',
 '.']

Step 5: Apply Dictionary based tokenization with Dictionary_tokenizer

Python3

dictionary_based_token =Dictionary_tokenizer.tokenize(tokens) 
dictionary_based_token

Output:

['Jammu Kashmir',
 'is',
 'an',
 'integral',
 'part',
 'of',
 'India',
 '.',
 'My',
 'name',
 'is',
 'Pawan Kumar Gunjan',
 '.',
 'He',
 'is',
 'from',
 'Himachal Pradesh',
 '.']

We can easily observe the differences between General word tokenization and Dictionary-based tokenization. This is useful when we know the phrases or joint words present in the TEXT DOCUMENT and we want to assign these joint words as single tokens.

Full code implementations

Python3

# import the necessary libraries 
from nltk import word_tokenize 
from nltk.tokenize import MWETokenizer 
  
# customn dictionary 
dictionary = [("Jammu", "Kashmir"),  
              ("Pawan", "Kumar", "Gunjan"),  
              ("Himachal", "Pradesh")] 
  
# Create an instance of MWETokenizer with the dictionary 
Dictionary_tokenizer = MWETokenizer(dictionary, separator=' ') 
  
# Text 
text = """ 
Jammu Kashmir is an integral part of India. 
My name is Pawan Kumar Gunjan. 
He is from Himachal Pradesh. 
"""
  
tokens = word_tokenize(text) 
print('General Word Tokenization \n',tokens) 
  
dictionary_based_token =Dictionary_tokenizer.tokenize(tokens) 
print('Dictionary based tokenization \n',dictionary_based_token)

Output:

General Word Tokenization 
 ['Jammu', 'Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan', 'Kumar', 'Gunjan', '.', 'He', 'is', 'from', 'Himachal', 'Pradesh', '.']
Dictionary based tokenization 
 ['Jammu Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan Kumar Gunjan', '.', 'He', 'is', 'from', 'Himachal Pradesh', '.']