Natural Language Processing
Natural Language Processing
Natural Language Processing
Python Cookbook
Krishna Bhavsar
Naresh Kumar
Pratap Dangeti
BIRMINGHAM - MUMBAI
Natural Language Processing with Python
Cookbook
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78728-932-1
www.packtpub.com
Credits
First and foremost, I would like to thank my mother for being the biggest motivating force
and a rock-solid support system behind all my endeavors in life. I would like to thank the
management team at Synerzip and all my friends for being supportive of me on this
journey. Last but not least, special thanks to Ram and Dorothy for keeping me on track
during this professionally difficult year.
Pratap Dangeti develops machine learning and deep learning solutions for structured,
image, and text data at TCS, in its research and innovation lab in Bangalore. He has
acquired a lot of experience in both analytics and data science. He received his master's
degree from IIT Bombay in its industrial engineering and operations research program.
Pratap is an artificial intelligence enthusiast. When not working, he likes to read about next-
gen technologies and innovative methodologies. He is also the author of the book Statistics
for Machine Learning by Packt.
I would like to thank my mom, Lakshmi, for her support throughout my career and in
writing this book. I dedicate this book to her. I also thank my family and friends for their
encouragement, without which it would not have been possible to write this book.
About the Reviewer
Juan Tomas Oliva Ramos is an environmental engineer from the University of Guanajuato,
Mexico, with a master's degree in administrative engineering and quality. He has more than
5 years of experience in the management and development of patents, technological
innovation projects, and the development of technological solutions through the statistical
control of processes.
Juan is an Alfaomega reviewer and has worked on the book Wearable Designs for Smart
Watches, Smart TVs and Android Mobile Devices.
Juan has also developed prototypes through programming and automation technologies for
the improvement of operations, which have been registered for patents.
I want to thank God for giving me wisdom and humility to review this book.
I thank Packt for giving me the opportunity to review this amazing book and to collaborate
with a group of committed people
I want to thank my beautiful wife, Brenda, our two magic princesses (Maria Regina and
Maria Renata) and our next member (Angel Tadeo), all of you, give me the strength,
happiness, and joy to start a new day. Thanks for being my family.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com. Did
you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.comand as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details. At www.PacktPub.com, you can also read a
collection of free technical articles, sign up for a range of free newsletters and receive
exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/178728932X. If you'd like to join our team of regular
reviewers, you can email us at [email protected]. We award our regular
reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be
relentless in improving our products!
Table of Contents
Preface 1
Chapter 1: Corpus and WordNet 8
Introduction 8
Accessing in-built corpora 9
How to do it... 9
Download an external corpus, load it, and access it 12
Getting ready 12
How to do it... 12
How it works... 14
Counting all the wh words in three different genres in the Brown corpus 15
Getting ready 15
How to do it... 15
How it works... 17
Explore frequency distribution operations on one of the web and chat
text corpus files 17
Getting ready 18
How to do it... 18
How it works... 20
Take an ambiguous word and explore all its senses using WordNet 20
Getting ready 21
How to do it... 21
How it works... 24
Pick two distinct synsets and explore the concepts of hyponyms and
hypernyms using WordNet 25
Getting ready 25
How to do it... 25
How it works... 28
Compute the average polysemy of nouns, verbs, adjectives, and
adverbs according to WordNet 28
Getting ready 29
How to do it... 29
How it works... 30
Chapter 2: Raw Text, Sourcing, and Normalization 31
Introduction 31
Table of Contents
[ ii ]
Table of Contents
Getting ready 60
How to do it… 61
How it works… 63
Stopwords – learning to use the stopwords corpus and seeing the
difference it can make 63
Getting ready 63
How to do it... 63
How it works... 66
Edit distance – writing your own algorithm to find edit distance
between two strings 66
Getting ready 66
How to do it… 67
How it works… 69
Processing two short stories and extracting the common vocabulary
between two of them 69
Getting ready 69
How to do it… 70
How it works… 75
Chapter 4: Regular Expressions 76
Introduction 76
Regular expression – learning to use *, +, and ? 77
Getting ready 77
How to do it… 77
How it works… 79
Regular expression – learning to use $ and ^, and the non-start and
non-end of a word 79
Getting ready 80
How to do it… 80
How it works… 82
Searching multiple literal strings and substring occurrences 83
Getting ready 83
How to do it… 83
How it works... 85
Learning to create date regex and a set of characters or ranges of
character 85
How to do it... 85
How it works... 87
Find all five-character words and make abbreviations in some
sentences 88
[ iii ]
Table of Contents
How to do it… 88
How it works... 89
Learning to write your own regex tokenizer 89
Getting ready 89
How to do it... 90
How it works... 91
Learning to write your own regex stemmer 91
Getting ready 91
How to do it… 92
How it works… 93
Chapter 5: POS Tagging and Grammars 94
Introduction 94
Exploring the in-built tagger 95
Getting ready 95
How to do it... 95
How it works... 96
Writing your own tagger 97
Getting ready 97
How to do it... 98
How it works... 99
Training your own tagger 104
Getting ready 104
How to do it... 104
How it works... 106
Learning to write your own grammar 108
Getting ready 109
How to do it... 109
How it works... 110
Writing a probabilistic CFG 112
Getting ready 112
How to do it... 113
How it works... 114
Writing a recursive CFG 116
Getting ready 117
How to do it... 117
How it works... 119
Chapter 6: Chunking, Sentence Parse, and Dependencies 121
Introduction 121
[ iv ]
Table of Contents
[v]
Table of Contents
[ vi ]
Table of Contents
[ vii ]
Table of Contents
[ viii ]
Preface
Dear reader, thank you for choosing this book to pursue your interest in natural language
processing. This book will give you a practical viewpoint to understand and implement
NLP solutions from scratch. We will take you on a journey that will start with accessing
inbuilt data sources and creating your own sources. And then you will be writing complex
NLP solutions that will involve text normalization, preprocessing, POS tagging, parsing,
and much more.
In this book, we will cover the various fundamentals necessary for applications of deep
learning in natural language processing, and they are state-of-the-art techniques. We will
discuss applications of deep learning using Keras software.
Chapter 2, Raw Text, Sourcing, and Normalization, shows how to extract text from various
formats of data sources. We will also learn to extract raw text from web sources. And finally
we will normalize raw text from these heterogeneous sources and organize it in corpus.
Chapter 4, Regular Expressions, covers one of the most basic and simple, yet most important
and powerful, tools that you will ever learn. In this chapter, you will learn the concept of
pattern matching as a way to do text analysis, and for this, there is no better tool than
regular expressions.
Preface
Chapter 5, POS Tagging and Grammars. POS tagging forms the basis of any further syntactic
analyses, and grammars can be formed and deformed using POS tags and chunks. We will
learn to use and write our own POS taggers and grammars.
Chapter 6, Chunking, Sentence Parse, and Dependencies, helps you to learn how to use the
inbuilt chunker as well as train/write your own chunker: dependency parser. In this
chapter, you will learn to evaluate your own trained models.
Chapter 7, Information Extraction and Text Classification, tells you more about named entities
recognition. We will be using inbuilt NEs and also creating your own named entities using
dictionaries. Let's learn to use inbuilt text classification algorithms and simple recipes
around its application.
Chapter 8, Advanced NLP Recipes, is about combining all your lessons so far and creating
applicable recipes that can be easily plugged into any of your real-life application problems.
We will write recipes such as text similarity, summarization, sentiment analysis, anaphora
resolution, and so on.
Chapter 9, Application of Deep Learning in NLP, presents the various fundamentals necessary
for working on deep learning with applications in NLP problems such as classification of
emails, sentiment classification with CNN and LSTM, and finally visualizing high-
dimensional words in low dimensional space.
Chapter 10, Advanced Application of Deep Learning in NLP, describes state-of-the-art problem
solving using deep learning. This consists of automated text generation, question and
answer on episodic data, language modeling to predict the next best word, and finally
chatbot development using generative principles.
This book assumes you know Keras's basics and how to install the libraries. We do not
expect that readers are already equipped with knowledge of deep learning and
mathematics, such as linear algebra and so on.
[2]
Preface
We have used the following versions of software throughout this book, but it should run
fine with any of the more recent ones also:
Anaconda 3 – 4.3.1 (all Python and its relevant packages are included in
Anaconda, Python – 3.6.1, NumPy – 1.12.1, pandas – 0.19.2)
Theano – 0.9.0
Keras – 2.0.2
feedparser – 5.2.1
bs4 – 4.6.0
gensim – 3.0.1
This book is intended for any newbie with no knowledge of NLP or any experienced
professional who would like to expand their knowledge from traditional NLP techniques to
state-of-the-art deep learning techniques in the application of NLP.
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to
do it…, How it works…, There's more…, and See also). To give clear instructions on how to
complete a recipe, we use these sections as follows.
Getting ready
This section tells you what to expect in the recipe, and describes how to set up any software
or any preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
[3]
Preface
How it works…
This section usually consists of a detailed explanation of what happened in the previous
section.
There's more…
This section consists of additional information about the recipe in order to make the reader
more knowledgeable about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Create a
new file named reuters.py and add the following import line in the file" A block of code
is set as follows:
for w in reader.words(fileP):
print(w + ' ', end='')
if (w is '.'):
print()
[4]
Preface
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of. To send us general feedback, simply e-mail
[email protected], and mention the book's title in the subject of your message. If
there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors .
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
1. Log in or register to our website using your e-mail address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.
[5]
Preface
You can also download the code files by clicking on the Code Files button on the book's
webpage at the Packt Publishing website. This page can be accessed by entering the book's
name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/
PacktPublishing/Natural-Language-Processing-with-Python-Cookbook. We also have
other code bundles from our rich catalog of books and videos available at https://github.
com/PacktPublishing/. Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to https://www.packtpub.com/
books/content/support and enter the name of the book in the search field. The required
information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy. Please
contact us at [email protected] with a link to the suspected pirated material. We
appreciate your help in protecting our authors and our ability to bring you valuable
content.
[6]
Preface
Questions
If you have a problem with any aspect of this book, you can contact us at
[email protected], and we will do our best to address the problem.
[7]
Corpus and WordNet
1
In this chapter, we will cover the following recipes:
Introduction
To solve any real-world Natural Language Processing (NLP) problems, you need to work
with huge amounts of data. This data is generally available in the form of a corpus out there
in the open diaspora and as an add-on of the NLTK package. For example, if you want to
create a spell checker, you need a huge corpus of words to match against.
We will try to understand these things from a practical standpoint. We will perform some
exercises that will fulfill all of these goals through our recipes.
Now, our first task/recipe involves us learning how to access any one of these corpora. We
have decided to do some tests on the Reuters corpus or the same. We will import the corpus
into our program and try to access it in different ways.
How to do it...
1. Create a new file named reuters.py and add the following import line in the
file. This will specifically allow access to only the reuters corpus in our program
from the entire NLTK data:
2. Now we want to check what exactly is available in this corpus. The simplest way
to do this is to call the fileids() function on the corpus object. Add the
following line in your program:
files = reuters.fileids()
print(files)
3. Now run the program and you shall get an output similar to this:
These are the lists of files and the relative paths of each of them in the reuters
corpus.
[9]
Corpus and WordNet Chapter 1
4. Now we will access the actual content of any of these files. To do this, we will use
the words() function on the corpus object as follows, and we will access the
test/16097 file:
words16097 = reuters.words(['test/16097'])
print(words16097)
5. Run the program again and an extra new line of output will appear:
As you can see, the list of words in the test/16097 file is shown. This is curtailed
though the entire list of words is loaded in the memory object.
6. Now we want to access a specific number of words (20) from the same file,
test/16097. Yes! We can specify how many words we want to access and store
them in a list for use. Append the following two lines in the code:
words20 = reuters.words(['test/16097'])[:20]
print(words20)
Run this code and another extra line of output will be appended, which will look
like this:
['UGANDA', 'PULLS', 'OUT', 'OF', 'COFFEE', 'MARKET', '-', 'TRADE',
'SOURCES', 'Uganda', "'", 's', 'Coffee', 'Marketing', 'Board', '(',
'CMB', ')', 'has', 'stopped']
7. Moving forward, the reuters corpus is not just a list of files but is also
hierarchically categorized into 90 topics. Each topic has many files associated
with it. What this means is that, when you access any one of the topics, you are
actually accessing the set of all files associated with that topic. Let's first output
the list of topics by adding the following code:
reutersGenres = reuters.categories()
print(reutersGenres)
Run the code and the following line of output will be added to the output console:
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa',
'coconut', 'coconut-oil', ...
[ 10 ]
Corpus and WordNet Chapter 1
8. Finally, we will write four simple lines of code that will not only access two
topics but also print out the words in a loosely sentenced fashion as one sentence
per line. Add the following code to the Python file:
for w in reuters.words(categories=['bop','cocoa']):
print(w+' ',end='')
if(w is '.'):
print()
9. To explain briefly, we first selected the categories 'bop' and 'cocoa' and
printed every word from these two categories' files. Every time we encountered a
dot (.), we inserted a new line. Run the code and something similar to the
following will be the output on the console:
[ 11 ]
Corpus and WordNet Chapter 1
Getting ready
First and foremost, you will need to download the dataset from the Internet. Here's the
link: http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_
tokens_cleaned.zip). Download the dataset, unzip it, and store the resultant
Reviews directory at a secure location on your computer.
How to do it...
1. Create a new file named external_corpus.py and add the following import
line to it:
Since the corpus that we have downloaded is already categorized, we will use
CategorizedPlaintextCorpusReader to read and load the given corpus. This
way, we can be sure that the categories of the corpus are captured, in this case,
positive and negative.
2. Now we will read the corpus. We need to know the absolute path of the Reviews
folder that we unzipped from the downloaded file from Cornell. Add the
following four lines of code:
reader = CategorizedPlaintextCorpusReader(r'/Volumes/Data/NLP-
CookBook/Reviews/txt_sentoken', r'.*\.txt', cat_pattern=r'(\w+)/*')
print(reader.categories())
print(reader.fileids())
[ 12 ]
Corpus and WordNet Chapter 1
The first line is where you are reading the corpus by calling the
CategorizedPlaintextCorpusReader constructor. The three arguments from
left to right are Absolute Path to the txt_sentoken folder on your computer, all
sample document names from the txt_sentoken folder, and the categories in the
given corpus (in our case, 'pos' and 'neg'). If you look closely, you'll see that all
the three arguments are regular expression patterns. The next two lines will
validate whether the corpus is loaded correctly or not, printing the associated
categories and filenames of the corpus. Run the program and you should see
something similar to the following:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
'bible-kjv.txt',....]
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday',
'an', 'investigation', 'of',...]]
3. Now that we've made sure that the corpus is loaded correctly, let's get on with
accessing any one of the sample documents from both the categories. For that,
let's first create a list, each containing samples of both the categories, 'pos' and
'neg', respectively. Add the following two lines of code:
posFiles = reader.fileids(categories='pos')
negFiles = reader.fileids(categories='neg')
The reader.fileids() method takes the argument category name. As you can
see, what we are trying to do in the preceding two lines of code is straightforward
and intuitive.
4. Now let's select a file randomly from each of the lists of posFiles and
negFiles. To do so, we will need the randint() function from the random
library of Python. Add the following lines of code and we shall elaborate what
exactly we did immediately after:
The first line imports the randint() function from the random library. The next
two files select a random file, each from the set of positive and negative category
reviews. The last two lines just print the filenames.
[ 13 ]
Corpus and WordNet Chapter 1
5. Now that we have selected the two files, let's access them and print them on the
console sentence by sentence. We will use the same methodology that we used in
the first recipe to print a line-by-line output. Append the following lines of code:
for w in reader.words(fileP):
print(w + ' ', end='')
if (w is '.'):
print()
for w in reader.words(fileN):
print(w + ' ', end='')
if (w is '.'):
print()
These for loops read every file one by one and will print on the console line by
line. The output of the complete recipe should look similar to this:
['neg', 'pos']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt',
'neg/cv002_17424.txt', ...]
pos/cv182_7281.txt
neg/cv712_24217.txt
the saint was actually a little better than i expected it to be ,
in some ways .
in this theatrical remake of the television series the saint...
How it works...
The quintessential ingredient of this recipe is the CategorizedPlaintextCorpusReader
class of NLTK. Since we already know that the corpus we have downloaded is categorized,
we only need provide appropriate arguments when creating the reader object. The
implementation of the CategorizedPlaintextCorpusReader class internally takes care
of loading the samples in appropriate buckets ('pos' and 'neg' in this case).
[ 14 ]
Corpus and WordNet Chapter 1
Getting ready
The objective of this recipe is to get you to perform a simple counting task on any given
corpus. We will be using nltk library's FreqDist object for this purpose here, but more
elaboration on the power of FreqDist will follow in the next recipe. Here, we will just
concentrate on the application problem.
How to do it...
1. Create a new file named BrownWH.py and add the following import statements
to begin:
import nltk
from nltk.corpus import brown
2. Next up, we will check all the genres in the corpus and will pick any three
categories from them to proceed with our task:
print(brown.categories())
The brown.categories() function call will return the list of all genres in the
Brown corpus. When you run this line, you will see the following output:
['adventure', 'belles_lettres', 'editorial', 'fiction',
'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery',
'news', 'religion', 'reviews', 'romance', 'science_fiction']
[ 15 ]
Corpus and WordNet Chapter 1
3. Now let's pick three genres--fiction, humor and romance--from this list as
well as the whwords that we want to count out from the text of these three
genres:
We have created a list containing the three picked genres and another list
containing the seven whwords.
4. Since we have the genres and the words we want to count in lists, we will be
extensively using the for loop to iterate over them and optimize the number of
lines of code. So first, we write a for iterator on the genres list:
These four lines of code will only start iterating on the list genres and load the
entire text of each genre in the genre_text variable as a continuous list words.
5. Next up is a complex little statement where we will use the nltk library's
FreqDist object. For now, let's understand the syntax and the broad-level output
we will get from it:
fdist = nltk.FreqDist(genre_text)
FreqDist() accepts a list of words and returns an object that contains the map
word and its respective frequency in the input word list. Here, the fdist object
will contain the frequency of each of the unique words in the genre_text word
list.
6. I'm sure you've already guessed what our next step is going to be. We will simply
access the fdist object returned by FreqDist() and get the count of each of the
wh words. Let's do it:
for wh in whwords:
print(wh + ':', fdist[wh], end=' ')
[ 16 ]
Corpus and WordNet Chapter 1
We are iterating over the whwords word list, accessing the fdist object with each
of the wh words as index, getting back the frequency/count of all of them, and
printing them out.
After running the complete program, you will get this output:
['adventure', 'belles_lettres', 'editorial', 'fiction',
'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery',
'news', 'religion', 'reviews', 'romance', 'science_fiction']
what: 128 which: 123 how: 54 why: 18 when: 133 where: 76 who: 103
what: 121 which: 104 how: 60 why: 34 when: 126 where: 54 who: 89
How it works...
On analyzing the output, you can clearly see that we have the word count of all seven wh
words for the three picked genres on our console. By counting the population of wh words,
you can, to a degree, gauge whether the given text is high on relative clauses or question
sentences. Similarly, you may have a populated ontology list of important words that you
want to get a word count of to understand the relevance of the given text to your ontology.
Counting word populations and analyzing distributions of counts is one of the oldest,
simplest, and most popular tricks of the trade to start any kind of textual analysis.
[ 17 ]
Corpus and WordNet Chapter 1
Getting ready
In keeping with the objective of this recipe, we will run the frequency distribution on the
personal advertising file inside nltk.corpus.webtext. Following that, we will explore the
various functionalities of the nltk.FreqDist object such as the count of distinct words,
10 most common words, maximum-frequency words, frequency distribution plot, and
tabulation.
How to do it...
1. Create a new file named webtext.py and add the following three lines to it:
import nltk
from nltk.corpus import webtext
print(webtext.fileids())
We just imported the required libraries and the webtext corpus; along with that,
we also printed the constituent file's names. Run the program and you shall see
the following output:
['firefox.txt', 'grail.txt', 'overheard.txt', 'pirates.txt',
'singles.txt', 'wine.txt']
2. Now we will select the file that contains personal advertisement data and and
run frequency distribution on it. Add the following three lines for it:
fileid = 'singles.txt'
wbt_words = webtext.words(fileid)
fdist = nltk.FreqDist(wbt_words)
singles.txt contains our target data; so, we loaded the words from that file in
wbt_words and ran frequency distribution on it to get the FreqDist object
fdist.
3. Add the following lines, which will show the most commonly appearing word
(with the fdist.max() function) and the count of that word (with the
fdist[fdist.max()] operation):
[ 18 ]
Corpus and WordNet Chapter 1
4. The following line will show us the count of distinct words in the bag of our
frequency distribution using the fdist.N() function. Add the line in your code:
5. Now let's find out the 10 most common words in the selected corpus bag. The
function fdist.most_common() will do this for us. Add the following two lines
in the code:
7. Now we will plot the graph of the frequency distribution with cumulative
frequencies using the fdist.plot() function:
fdist.plot(cumulative=True)
Let's run the program and see the output; we will discuss the same in the
following section:
['firefox.txt', 'grail.txt', 'overheard.txt', 'pirates.txt',
'singles.txt', 'wine.txt']
[(',', 539), ('.', 353), ('/', 110), ('for', 99), ('and', 74),
('to',
None
[ 19 ]
Corpus and WordNet Chapter 1
How it works...
Upon analyzing the output, we realize that all of it is very intuitive. But what is peculiar is
that most of it is not making sense. The token with maximum frequency count is ,. And
when you look at the 10 most common tokens, again you can't make out much about the
target dataset. The reason is that there is no preprocessing done on the corpus. In the third
chapter, we will learn one of the most fundamental preprocessing steps called stop words
treatment and will also see the difference it makes.
[ 20 ]
Corpus and WordNet Chapter 1
Now a bat can also mean a nocturnal mammal that flies at nights. The Bat is also Batman's
preferred and most advanced transportation vehicle according to DC comics. These are all
noun variants; let's consider verb possibilities. Bat can also mean a slight wink (bat an
eyelid). Consequently, it can also mean beating someone to pulp in a fight or a competition.
We believe that's enough of an introduction; with this, let's move on to the actual recipe.
Getting ready
Keeping the objective of the recipe in mind we have to choose a word for which we would
be exploring its various senses as understood by WordNet. And yes, NLTK comes
equipped with WordNet; you need not worry about installing any further libraries. So let's
choose another simple word, CHAIR, as our sample for the purpose of this recipe.
How to do it...
1. Create a new file named ambiguity.py and add the following lines of code to
start with:
Here we imported the required NLTK corpus reader wordnet as the wn object.
We can import it just like any another corpus readers we have used so far. In
preparation for the next steps, we have created our string variable containing the
word chair.
2. Now is the most important step. Let's add two lines and I will elaborate what we
are doing:
chair_synsets = wn.synsets(chair)
print('Synsets/Senses of Chair :', chair_synsets, '\n\n')
[ 21 ]
Corpus and WordNet Chapter 1
The first line, though it looks simple, is actually the API interface that is accessing
the internal WordNet database and fetching all the senses associated with the
word chair. WordNet calls each of these senses synsets. The next line simply
asks the interpreter to print what it has fetched. Run this much and you should
get an output like:
Synsets/Senses of Chair : [Synset('chair.n.01'),
Synset('professorship.n.01'), Synset('president.n.04'),
Synset('electric_chair.n.01'), Synset('chair.n.05'),
Synset('chair.v.01'), Synset('moderate.v.01')]
As you can see, the list contains seven Synsets, which means seven different
senses of the word Chair exist in the WordNet database.
3. We will add the following for loop, which will iterate over the list of synsets
we have obtained and perform certain operations:
We are iterating over the list of synsets and printing the definition of each sense,
associated lemmas/synonymous words, and example usage of each of the senses
in a sentence. One typical iteration will print something similar to this:
Synset('chair.v.01') :
The first line is the name of Synset, the second line is the definition of this
sense/Synset, the third line contains Lemmas associated with this Synset, and the
fourth line is an example sentence.
[ 22 ]
Corpus and WordNet Chapter 1
Synset('chair.n.01') :
Definition: a seat for one person, with a support for the back
Example: ['he put his coat over the back of the chair and sat
down']
Synset('professorship.n.01') :
Synset('president.n.04') :
chair', 'chairperson']
Synset('electric_chair.n.01') :
Synset('chair.n.05') :
[ 23 ]
Corpus and WordNet Chapter 1
Synset('chair.v.01') :
Synset('moderate.v.01') :
How it works...
As you can see, definitions, Lemmas, and example sentences of all seven senses of the word
chair are seen in the output. Straightforward API interfaces are available for each of the
operations as elaborated in the preceding code sample. Now, let's talk a little bit about how
WordNet arrives at such conclusions. WordNet is a database of words that stores all
information about them in a hierarchical manner. If we take a look at the current example
Write about synsets and hierarchical nature of WordNet storage. The following diagram
will explain it in more detail.
[ 24 ]
Corpus and WordNet Chapter 1
Getting ready
For the purpose of exploring the concepts of hyponym and hypernym, we have decided to
select the synsets bed.n.01 (first word sense of bed) and woman.n.01 (second word sense
of woman). Now we will explain the usage and meaning of the hypernym and hyponym
APIs in the actual recipe section.
How to do it...
1. Create a new file named HypoNHypernyms.py and add following three lines:
We've imported the libraries and initialized the two synsets that we will use in
later processing.
[ 25 ]
Corpus and WordNet Chapter 1
print(woman.hypernyms())
woman_paths = woman.hypernym_paths()
It's a simple call to the hypernyms() API function on the woman Synset; it will
return the set of synsets that are direct parents of the same. However, the
hypernym_paths() function is a little tricky. It will return a list of sets. Each set
contains the path from the root node to the woman Synset. When you run these
two statements, you will see the two direct parents of the Synset woman as
follows in the console:
[Synset('adult.n.01'), Synset('female.n.02')]
Woman belongs to the adult and female categories in the hierarchical structure of
the WordNet database.
3. Now we will try to print the paths from root node to the woman.n.01 node. To
do so, add the following lines of code and nested for loop:
As explained, the returned object is a list of sets ordered in such a way that it
follows the path from the root to the woman.n.01 node exactly as stored in the
WordNet hierarchy. When you run, here's an example Path:
Hypernym Path : 1
4. Now let's work with hyponyms. Add the following two lines, which will fetch the
hyponyms for the synset bed.n.01 and print them to the console:
types_of_beds = bed.hyponyms()
print('\n\nTypes of beds(Hyponyms): ', types_of_beds)
[ 26 ]
Corpus and WordNet Chapter 1
As explained, run them and you will see the following 20 synsets as output:
Types of beds(Hyponyms): [Synset('berth.n.03'), Synset('built-
in_bed.n.01'), Synset('bunk.n.03'), Synset('bunk_bed.n.01'),
Synset('cot.n.03'), Synset('couch.n.03'), Synset('deathbed.n.02'),
Synset('double_bed.n.01'), Synset('four-poster.n.01'),
Synset('hammock.n.02'), Synset('marriage_bed.n.01'),
Synset('murphy_bed.n.01'), Synset('plank-bed.n.01'),
Synset('platform_bed.n.01'), Synset('sickbed.n.01'),
Synset('single_bed.n.01'), Synset('sleigh_bed.n.01'),
Synset('trundle_bed.n.01'), Synset('twin_bed.n.01'),
Synset('water_bed.n.01')]
These are Hyponyms or more specific terms for the word sense bed.n.01 within
WordNet.
5. Now let's print the actual words or lemmas that will make more sense to humans.
Add the following line of code:
This line of code is pretty similar to what we did in the hypernym example nested
for loop written in four lines, which is clubbed in a single line here (in other
words, we're just showing off our skills with Python here). It will print the 26
lemmas that are very meaningful and specific words. Now let's look at the final
output:
Output: [Synset('adult.n.01'), Synset('female.n.02')]
Hypernym Path : 1
entity.n.01 , physical_entity.n.01 , causal_agent.n.01 ,
person.n.01 , adult.n.01 , woman.n.01 ,
Hypernym Path : 2
entity.n.01 , physical_entity.n.01 , object.n.01 , whole.n.02 ,
living_thing.n.01 , organism.n.01 , person.n.01 , adult.n.01 ,
woman.n.01 ,
Hypernym Path : 3
entity.n.01 , physical_entity.n.01 , causal_agent.n.01 ,
person.n.01 , female.n.02 , woman.n.01 ,
Hypernym Path : 4
entity.n.01 , physical_entity.n.01 , object.n.01 , whole.n.02 ,
living_thing.n.01 , organism.n.01 , person.n.01 , female.n.02 ,
woman.n.01 ,
[ 27 ]
Corpus and WordNet Chapter 1
Synset('double_bed.n.01'), Synset('four-poster.n.01'),
Synset('hammock.n.02'), Synset('marriage_bed.n.01'),
Synset('murphy_bed.n.01'), Synset('plank-bed.n.01'),
Synset('platform_bed.n.01'), Synset('sickbed.n.01'),
Synset('single_bed.n.01'), Synset('sleigh_bed.n.01'),
Synset('trundle_bed.n.01'), Synset('twin_bed.n.01'),
Synset('water_bed.n.01')]
How it works...
As you can see, woman.n.01 has two hypernyms, namely adult and female, but it follows
four different routes in the hierarchy of WordNet database from the root node entity to
woman as shown in the output.
Similarly, the Synset bed.n.01 has 20 hyponyms; they are more specific and less
ambiguous (for nothing is unambiguous in English). Generally the hyponyms correspond
to leaf nodes or nodes very much closer to the leaves in the hierarchy as they are the least
ambiguous ones.
[ 28 ]
Corpus and WordNet Chapter 1
Getting ready
I have decided to write the program to compute the polysemy of any one of the POS types
of words and will leave it to you guys to modify the program to do so for the other three. I
mean we shouldn't just spoon-feed everything, right? Not to worry! I will provide enough
hints in the recipe itself to make it easier for you (for those who think it's already not very
intuitive). Let's get on with the actual recipe then; we will compute the average polysemy of
nouns alone.
How to do it...
1. Create a new file named polysemy.py and add these two initialization lines:
We have initialized the POS type of words we are interested in and, of course,
imported the required libraries. To be more descriptive, n corresponds to nouns.
synsets = wn.all_synsets(type)
This API returns all synsets of type n that is a noun present in the WordNet
database, full coverage. Similarly, if you change the POS type to a verb, adverb, or
adjective, the API will return all words of the corresponding type (hint #1).
3. Now we will consolidate all lemmas in each of the synset into a single mega list
that we can process further. Add the following code to do that:
lemmas = []
for synset in synsets:
for lemma in synset.lemmas():
lemmas.append(lemma.name())
This piece of code is pretty intuitive; we have a nested for loop that iterates over
the list of synsets and the lemmas in each synset and adds them up in our
mega list lemmas.
[ 29 ]
Corpus and WordNet Chapter 1
4. Although we have all lemmas in the mega list, there is a problem. There are some
duplicates as it's a list. Let's remove the duplicates and take the count of distinct
lemmas:
lemmas = set(lemmas)
Converting a list into a set will automatically deduplicate (yes, it's a valid English
word, I invented it) the list.
5. Now, the second most important step in the recipe. We count the senses of each
lemma in the WordNet database:
count = 0
for lemma in lemmas:
count = count + len(wn.synsets(lemma, type))
Most of the code is intuitive; let's focus on the the API wn.synsets(lemma,
type). This API takes as input a word/lemma (as the first argument) and the POS
type it belongs to and returns all the senses (synsets) belonging to the lemma
word. Note that depending on what you provide as the POS type, it will return
senses of the word of only the given POS type (hint #2).
6. We have all the counts we need to compute the average polysemy. Let's just do it
and print it on the console:
This prints the total distinct lemmas, the count of senses, and the average
polysemy of POS type n or nouns:
Output: Total distinct lemmas: 119034
Total senses : 152763
Average Polysemy of n : 1.2833560159282222
How it works...
There is nothing much to say in this section, so I will instead give you some more
information on how to go about computing the polysemy of the rest of the types. As you
saw, Noun -> 'n'. Similarly, Verbs -> 'v', Adverbs -> 'r', and Adjective -> 'a' (hint # 3).
Now, I hope I have given you enough hints to get on with writing an NLP program of your
own and not be dependent on the feed of the recipes.
[ 30 ]
Raw Text, Sourcing, and
2
Normalization
In this chapter, we will be covering the following topics:
Introduction
In the previous chapter, we looked at NLTK inbuilt corpora. The corpora are very well
organized and normalized for usage, but that will not always be the case when you work on
your industry problems. Let alone normalization and organization, we may not even get the
data we need in a uniform format. The goal of this chapter is to introduce some Python
libraries that will help you extract data from binary formats: PDF and Word DOCX files. We
will also look at libraries that can fetch data from web feeds such as RSS and a library that
will help you parse HTML and extract the raw text out of the documents. We will also learn
to extract raw text from heterogeneous sources, normalize it, and create a user-defined
corpus from it.
Raw Text, Sourcing, and Normalization Chapter 2
In this chapter, you will learn seven different recipes. As the name of the chapter suggests,
we will be learning to source data from PDF files, Word documents, and the Web. PDFs and
Word documents are binary, and over the Web, you will get data in the form of HTML. For
this reason, we will also perform normalization and raw text conversion tasks on this data.
Getting ready…
For this recipe, you will just need the Python interpreter and a text editor, nothing more.
We will see join, split, addition, and multiplication operators and indices.
How to do it…
1. Create a new Python file named StringOps1.py.
2. Define two objects:
namesList = ['Tuffy','Ali','Nysha','Tim' ]
sentence = 'My dog sleeps on sofa'
The first object, nameList, is a list of str objects containing some names as
implied, and the second object, sentence, is a sentence that is an str object.
names = ';'.join(namesList)
print(type(names), ':', names)
The join() function can be called on any string object. It accepts a list of str
objects as argument and concatenates all the star objects into a single str object,
with the calling string object's contents as the joining delimiter. It returns that
object. Run these two lines and your output should look like:
<class 'str'> : Tuffy;Ali;Nysha;Tim
[ 32 ]
Raw Text, Sourcing, and Normalization Chapter 2
The split function called on a string will split its contents into multiple str
objects, create a list of the same, and return that list. The function accepts a single
str argument, which is used as the splitting criterion. Run the code and you will
see the following output:
<class 'list'> : ['My', 'dog', 'sleeps', 'on', 'sofa']
5. The arithmetic operators + and * can also be used with strings. Add the following
lines and see the output:
This time we will first see the output and then discuss how it works:
Text Additions: ganehsaganeshaganesha
Text Multiplication: ganeshaganesha
6. Let's look at the indices of the characters in the strings. Add the following lines of
code:
[ 33 ]
Raw Text, Sourcing, and Normalization Chapter 2
First, we declare a new string object. Then we access the second character (y) in
the string, which just shows that it is straightforward. Now comes the tricky part;
Python allows you to use negative indexes when accessing any list object; -1
means the last member, -2 is the second last, and so on. For example, in the
preceding str object, index 7 and -4 are the same character, N:
Output: <class 'str'> : Tuffy;Ali;Nysha;Tim
<class 'list'> : ['My', 'dog', 'sleeps', 'on', 'sofa']
Text Additions : ganehsaganeshaganesha
Text Multiplication : ganeshaganesha
y
L
How it works…
We created a list of strings from a string and a string from a list of strings using
the split() and join() functions, respectively. Then we saw the use of some arithmetic
operators with strings. Please note that we can't use the "-"(negation) and the "/"(division)
operators with strings. In the end, we saw how to access individual characters in any string,
in which peculiarly, we can use negative index numbers while accessing strings.
This recipe is pretty simple and straightforward, in that the objective was to introduce some
common and uncommon string operations that Python allows. Up next, we will continue
where we left off and do some more string operations.
How to do it…
1. Create a new Python file named StringOps2.py and define the following string
object str:
[ 34 ]
Raw Text, Sourcing, and Normalization Chapter 2
2. Let's access the substring that ends at the fourth character from the str object.
As we know the index starts at zero, this will return the substring containing
characters from zero to three. When you run, the output will be:
Substring ends at: NLTK
3. Now we will access the substring that starts at a certain point until the end in
object str:
This tells the interpreter to return a substring of object str from index 11 to the
end. When you run this, the following output will be visible:
Substring starts from: Python
4. Let's access the Dolly substring from the str object. Add the following line:
print('Substring :',str[5:10])
The preceding syntax returns characters from index 5 to 10, excluding the 10th
character. The output is:
Substring : Dolly
5. Now, it's time for a fancy trick. We have already seen how negative indices work
for string operations. Let's try the following and see how it works:
Exactly similar to the previous step! Go ahead and do the back calculations: -1 as
the last character, -2 as the last but one, and so and so forth. Thus, you will get the
index values.
if 'NLTK' in str:
print('found NLTK')
Run the preceding code and check the output; it will be:
found NLTK
[ 35 ]
Raw Text, Sourcing, and Normalization Chapter 2
As elaborate as it looks, the in operator simply checks whether the left-hand side
string is a substring of the right-hand side string.
The replace function simply takes two arguments. The first is the substring that
needs to be replaced and the second is the new substring that will come in place of
it. It returns a new string object and doesn't modify the object it was called
upon. Run and see the following output:
Replaced String: NLTK Dorothy Python
8. Last but not least, we will iterate over the replaced object and access every
character:
This will print each character from the replaced object on a new line. Let's see the
final output:
Output: Substring ends at: NLTK
Substring starts from: Python
Substring : Dolly
Substring fancy: Dolly
found NLTK
Replaced String: NLTK Dorothy Python
Accessing each character:
N
L
T
K
D
o
r
o
t
h
y
P
y
t
h
[ 36 ]
Raw Text, Sourcing, and Normalization Chapter 2
o
n
How it works…
A string object is nothing but a list of characters. As we saw in the first step we can access
every character from the string using the for syntax for accessing a list. The character :
inside square brackets for any list denotes that we want a piece of the list; : followed by a
number means we want the sublist starting at zero and ending at the index minus 1.
Similarly, a number followed by a : means we want a sublist from the given number to the
end.
This ends our brief journey of exploring string operations with Python. After this, we will
move on to files, online resources, HTML, and more.
Getting ready
We assume you have pip installed. Then, to install the PyPDF2 library with pip on Python
2 and 3, you only need to run the following command from the command line:
pip install pypdf2
If you successfully install the library, we are ready to go ahead. Along with that, I also that
request you to download some test documents that we will be using during this chapter
from this link: https://www.dropbox.com/sh/bk18dizhsu1p534/
AABEuJw4TArUbzJf4Aa8gp5Wa?dl=0.
[ 37 ]
Raw Text, Sourcing, and Normalization Chapter 2
How to do it…
1. Create a file named pdf.py and add the following import line to it:
2. Add this Python function in the file that is supposed to read the file and return
the full text from the PDF file:
This function accepts two arguments, the path to the PDF file you want to read
and the password (if any) for the PDF file. As you can see, the password
parameter is optional.
3. Now let's define the function. Add the following lines under the function:
The first line opens the file in read and backwards seek mode. The first line is
essentially the Python open file command/function that will only open a file that
is non-text in binary mode. The second line will pass this opened file to
the PdfFileReader class, which will consume the PDF document.
if password != '':
read_pdf.decrypt(password)
If a password is provided with the function call, then we will try to decrypt the
file using the same.
text = []
for i in range(0,read_pdf.getNumPages()-1):
text.append(read_pdf.getPage(i).extractText())
We create a list of strings and append text from each page to that list of strings.
[ 38 ]
Raw Text, Sourcing, and Normalization Chapter 2
return '\n'.join(text)
We return the single string object by joining the contents of all the string objects
inside the list with a new line.
7. Create another file named TestPDFs.py in the same folder as pdf.py, and add
the following import statement:
import pdf
8. Now we'll just print out the text from a couple of documents, one password
protected and one plain:
pdfFile = 'sample-one-line.pdf'
pdfFileEncrypted = 'sample-one-line.protected.pdf'
print('PDF 1: \n',pdf.getTextPDF(pdfFile))
print('PDF 2: \n',pdf.getTextPDF(pdfFileEncrypted,'tuffy'))
Output: The first six steps of the recipe only create a Python function and no
output will be generated on the console. The seventh and eighth steps will output
the following:
This is a sample PDF document I am using to demonstrate in the
tutorial.
password protected.
How it works…
PyPDF2 is a pure Python library that we use to extract content from PDFs. The library has
many more functionalities to crop pages, superimpose images for digital signatures, create
new PDF files, and much more. However, your purpose as an NLP engineer or in any text
analytics task would only be to read the contents. In step 2, it's important to open the file in
backwards seek mode since the PyPDF2 module tries to read files from the end when
loading the file contents. Also, if any PDF file is password protected and you do not decrypt
it before accessing its contents, the Python interpreter will throw a PdfReadError.
[ 39 ]
Raw Text, Sourcing, and Normalization Chapter 2
If you do not have access to Microsoft Word, you can always use open
source versions of Liber Office and Open Office to create and edit .docx
files.
Getting ready…
Assuming you already have pip installed on your machine, we will use pip to install a
module named python-docx. Do not confuse this with another library named docx, which
is a different module altogether. We will be importing the docx object from the python-
docx library. The following command, when fired on your command line, will install the
library:
pip install python-docx
After having successfully installed the library, we are ready to go ahead. We will be using a
test document in this recipe, and if you have already downloaded all the documents from
the link shared in the first recipe in this chapter, you should have the relevant document. If
not, then please download the sample-one-line.docx document from https://www.
dropbox.com/sh/bk18dizhsu1p534/AABEuJw4TArUbzJf4Aa8gp5Wa?dl=0.
How to do it…
1. Create a new Python file named word.py and add the following import line:
import docx
[ 40 ]
Raw Text, Sourcing, and Normalization Chapter 2
def getTextWord(wordFileName):
doc = docx.Document(wordFileName)
The doc object is now loaded with the word file you want to read.
4. We will read the text from the document loaded inside the doc object. Add the
following lines for that:
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
First, we initialized a string array, fullText. The for loop reads the text from the
document paragraph by paragraph and goes on appending to the list fullText.
5. Now we will join all the fragments/paras in a single string object and return it as
the final output of the function:
return '\n'.join(fullText)
We joined all the constituents of the fullText array with the delimited \n and
returned the resultant object. Save the file and exit.
6. Create another file, name it TestDocX.py, and add the following import
statements:
import docx
import word
Simply import the docx library and the word.py that we wrote in the first
five steps.
[ 41 ]
Raw Text, Sourcing, and Normalization Chapter 2
7. Now we will read a DOCX document and print the full contents using the API
we wrote on word.py. Write down the following two lines:
docName = 'sample-one-line.docx'
print('Document in full :\n',word.getTextWord(docName))
Initialize the document path in the first line, and then, using the API print out the
full document. When you run this part, you should get an output that looks
something similar to:
Document in full :
This is a sample PDF document with some text in bold, some in italic, and some
underlined. We are also embedding a title shown as follows:
This is my TITLE.
This is my third paragraph.
doc = docx.Document(docName)
print('Number of paragraphs :',len(doc.paragraphs))
print('Paragraph 2:',doc.paragraphs[1].text)
print('Paragraph 2 style:',doc.paragraphs[1].style)
The second line in the previous snippet gives us the number of paragraphs in the
given document. The third line returns only the second paragraph from the
document and the fourth line will analyze the style of the second paragraph,
which is Title in this case. When you run, the output for these four lines will be:
Number of paragraphs : 3
Paragraph 2: This is my TITLE.
Paragraph 2 style: _ParagraphStyle('Title') id: 4374023248
It is quite self-explanatory.
9. Next, we will see what a run is. Add the following lines:
print('Paragraph 1:',doc.paragraphs[0].text)
print('Number of runs in paragraph 1:',len(doc.paragraphs[0].runs))
for idx, run in enumerate(doc.paragraphs[0].runs):
print('Run %s : %s' %(idx,run.text))
[ 42 ]
Raw Text, Sourcing, and Normalization Chapter 2
Here, we are first returning the first paragraph; next we are returning the number
of runs in the paragraph. Later we are printing out every run.
10. And now to identify the styling of each run, write the following lines of code:
Each line in the previous snippet is checking for underline, bold, and italic styling
respectively. In the following section, we will see the final output:
Output: Document in full :
This is a sample PDF document with some text in BOLD, some in
ITALIC and some underlined. We are also embedding a Title down
below.
This is my TITLE.
This is my third paragraph.
Number of paragraphs : 3
Paragraph 2: This is my TITLE.
Paragraph 2 style: _ParagraphStyle('Title') id: 4374023248
Paragraph 1: This is a sample PDF document with some text in BOLD,
some in ITALIC and some underlined. We're also embedding a Title
down below.
Number of runs in paragraph 1: 8
Run 0 : This is a sample PDF document with
Run 1 : some text in BOLD
Run 2 : ,
Run 3 : some in ITALIC
Run 4 : and
Run 5 : some underlined.
Run 6 : We are also embedding a Title down below
Run 7 : .
is Run 0 underlined: True
is Run 2 bold: True
is Run 7 italic: True
How it works…
First, we wrote a function in the word.py file that will read any given DOCX file and
return to us the full contents in a string object. The preceding output text you see is fairly
self-explanatory though some things I would like to elaborate are Paragraph and Run lines.
The structure of a .docx document is represented by three data types in the python-docx
library. At the highest level is the Document object. Inside each document, we have multiple
paragraphs.
[ 43 ]
Raw Text, Sourcing, and Normalization Chapter 2
Every time we see a new line or a carriage return, it signifies the start of a new paragraph.
Every paragraph contains multiple Runs , which denotes a change in word styling. By
styling, we mean the possibilities of different fonts, sizes, colors, and other styling elements
such as bold, italic, underline, and so on. Each time any of these elements vary, a new run is
started.
Getting ready
In terms of getting ready, we are going to use a few files from the Dropbox folder
introduced in the first recipe of this chapter. If you've downloaded all the files from the
folder, you should be good. If not, please download the following files from https://www.
dropbox.com/sh/bk18dizhsu1p534/AABEuJw4TArUbzJf4Aa8gp5Wa?dl=0:
sample_feed.txt
sample-pdf.pdf
sample-one-line.docx
If you haven't followed the order of this chapter, you will have to go back and look at the
first two recipes in this chapter. We are going to reuse two modules we wrote in the
previous two recipes, word.py and pdf.py. This recipe is more about an application of
what we did in the first two recipes and the corpus from the first chapter than introducing a
new concept. Let's get on with the actual code.
[ 44 ]
Raw Text, Sourcing, and Normalization Chapter 2
How to do it…
1. Create a new Python file named createCorpus.py and add the following
import lines to start off:
import os
import word, pdf
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
We have imported the os library for use with file operations, the word and pdf
modules we wrote in the first two recipes of this chapter, and the
PlaintextCorpusReader, which is our final objective of this recipe.
2. Now let's write a little function that will take as input the path of a plain text file
and return the full text as a string object. Add the following lines:
def getText(txtFileName):
file = open(txtFileName, 'r')
return file.read()
The first line defines the function and input parameter. The second line opens the
given file in reading mode (the second parameter of the open function r denotes
read mode). The third line reads the content of the file and returns it into a
string object, all at once in a single statement.
3. We will create the new corpus folder now on the disk/filesystem. Add the
following three lines:
newCorpusDir = 'mycorpus/'
if not os.path.isdir(newCorpusDir):
os.mkdir(newCorpusDir)
The first line is a simple string object with the name of the new folder.
The second line checks whether a directory/folder of the same name already exists
on the disk. The third line instructs the os.mkdir() function to create the
directory on the disk with the specified name. As the outcome, a new directory
with the name mycorpus would be created in the working directory where your
Python file is placed.
[ 45 ]
Raw Text, Sourcing, and Normalization Chapter 2
4. Now we will read the three files one by one. Starting with the plain text file, add
the following line:
txt1 = getText('sample_feed.txt')
5. Now we will read the PDF file. Add the following line:
txt2 = pdf.getTextPDF('sample-pdf.pdf')
6. Finally, we will read the DOCX file by adding the following line:
txt3 = word.getTextWord('sample-one-line.docx')
7. The next step is to write the contents of these three string objects on the disk, in
files. Write the following lines of code for that:
files = [txt1,txt2,txt3]
for idx, f in enumerate(files):
with open(newCorpusDir+str(idx)+'.txt', 'w') as fout:
fout.write(f)
First line: Creates an array from the string objects so as to use it in the upcoming
for loop
Second line: A for loop with index on the files array
Third line: This opens a new file in write mode (the w option in the open function
call)
Fourth line: Writes the contents of the string object in the file
[ 46 ]
Raw Text, Sourcing, and Normalization Chapter 2
print(newCorpus.words())
print(newCorpus.sents(newCorpus.fileids()[1]))
print(newCorpus.paras(newCorpus.fileids()[0]))
The first line will print the array containing all the words in the corpus (curtailed).
The second line will print the sentences in file 1.txt. The third line will print the
paragraphs in file 0.txt:
Output: ['Five', 'months', '.', 'That', "'", 's', 'how', ...]
[['A', 'generic', 'NLP'], ['(', 'Natural', 'Language',
'Processing', ')', 'toolset'], ...]
[[['Five', 'months', '.']], [['That', "'", 's', 'how', 'long',
'it', "'", 's', 'been', 'since', 'Mass', 'Effect', ':',
'Andromeda', 'launched', ',', 'and', 'that', "'", 's', 'how',
'long', 'it', 'took', 'BioWare', 'Montreal', 'to', 'admit', 'that',
'nothing', 'more', 'can', 'be', 'done', 'with', 'the', 'ailing',
'game', "'", 's', 'story', 'mode', '.'], ['Technically', ',', 'it',
'wasn', "'", 't', 'even', 'a', 'full', 'five', 'months', ',', 'as',
'Andromeda', 'launched', 'on', 'March', '21', '.']], ...]
How it works…
The output is fairly straightforward and as explained in the last step of the recipe. What is
peculiar is the characteristics of each of objects on show. The first line is the list of all words
in the new corpus; it doesn't have anything to do with higher level structures like
sentences/paragrpahs/files and so on. The second line is the list of all sentences in the file
1.txt, of which each sentence is a list of words inside each of the sentences. The third line
is a list of paragraphs, of which each paragraph object is in turn a list of sentences, of which
each sentence is in turn a list of words in that sentence, all from the file 0.txt. As you can
see, a lot of structure is maintained in paragraphs and sentences.
[ 47 ]
Raw Text, Sourcing, and Normalization Chapter 2
Getting ready
The objective of this recipe is to read such an RSS feed and access content of one of the posts
from that feed. For this purpose, we will be using the RSS feed of Mashable. Mashable is a
digital media website, in short a tech and social media blog listing. The URL of the website's
RS feed is http://feeds.mashable.com/Mashable.
Also, we need the feedparser library to be able to read an RSS feed. To install this library
on your computer, simply open the terminal and run the following command:
pip install feedparser
Armed with this module and the useful information, we can begin to write our first RSS
feed reader in Python.
How to do it…
1. Create a new file named rssReader.py and add the following import:
import feedparser
2. Now we will load the Mashable feed into our memory. Add the following line:
myFeed = feedparser.parse("http://feeds.mashable.com/Mashable")
The myFeed object contains the first page of the RSS feed of Mashable. The feed
will be downloaded and parsed to fill all the appropriate fields by the
feedparser. Each post will be part of the entry list in to the myFeed object.
3. Let's check the title and count the number of posts in the current feed:
[ 48 ]
Raw Text, Sourcing, and Normalization Chapter 2
In the first line, we are fetching the feed title from the myFeed object, and in the
second line, we are counting the length of the entries object inside the myFeed
object. The entries object is nothing but a list of all the posts from the parsed
feed as mentioned previously. When you run, the output is something similar to:
Feed Title: Mashable
Number of posts : 30
Title will always be Mashable, and at the time of writing this chapter, the
Mashable folks were putting a maximum of 30 posts in the feed at a time.
4. Now we will fetch the very first post from the entries list and print it's title on
the console:
post = myFeed.entries[0]
print('Post Title :',post.title)
In the first line, we are physically accessing the zeroth element in the entries list
and loading it in the post object. The second line prints the title of that post. Upon
running, you should get an output similar to the following:
Post Title: The moon literally blocked the sun on Twitter
I say something similar and not exactly the same as the feed keeps updating itself.
5. Now we will access the raw HTML content of the post and print it on the console:
content = post.content[0].value
print('Raw content :\n',content)
First we access the content object from the post and the actual value of the same.
And then we print it on the console:
Output: Feed Title: Mashable
Number of posts : 30
Post Title: The moon literally blocked the sun on Twitter
Raw content :
<img alt=""
src="https://i.amz.mshcdn.com/DzkxxIQCjyFHGoIBJoRGoYU3Y8o=/575x323/
filters:quality(90)/https%3A%2F%2F2.gy-118.workers.dev/%3A443%2Fhttps%2Fblueprint-api-
production.s3.amazonaws.com%2Fuploads%2Fcard%2Fimage%2F569570%2F0ca
3e1bf-a4a2-4af4-85f0-1bbc8587014a.jpg" /><div style="float: right;
width: 50px;"><a
href="http://twitter.com/share?via=Mashable&text=The+moon+literally
+blocked+the+sun+on+Twitter&url=https%3A%2F%2F2.gy-118.workers.dev/%3A443%2Fhttp%2Fmashable.com%2F2017%2F
08%2F21%2Fmoon-blocks-sun-eclipse-2017-
twitter%2F%3Futm_campaign%3DMash-Prod-RSS-Feedburner-All-
[ 49 ]
Raw Text, Sourcing, and Normalization Chapter 2
Partial%26utm_cid%3DMash-Prod-RSS-Feedburner-All-Partial"
style="margin: 10px;">
<p>The national space agency threw shade the best way it knows how:
by blocking the sun. Yep, you read that right. </p>
<div><div><blockquote>
<p>HA HA HA I've blocked the Sun! Make way for the Moon<a
href="https://twitter.com/hashtag/SolarEclipse2017?src=hash">#Solar
Eclipse2017</a> <a
href="https://t.co/nZCoqBlSTe">pic.twitter.com/nZCoqBlSTe</a></p>
<p>— NASA Moon (@NASAMoon) <a
href="https://twitter.com/NASAMoon/status/899681358737539073">Augus
t 21, 2017</a></p>
</blockquote></div></div>
How it works…
Most of the RSS feeds you will get on the Internet will follow a chronological order, with the
latest post on top. Hence, the post we accessed in the recipe will be always be the most
recent post the feed is offering. The feed itself is ever-changing. So every time you run the
program, the format of the output will the remain same, but the content of the post on the
console may differ depending upon how fast the feed updates. Also, here we are directly
displaying the raw HTML on the console and not the clean content. Up next, we are going
to look at parsing HTML and getting only the information we need from a page. Again, a
further addendum to this recipe could be to read any feed of your choice, store all the posts
from the feed on disk, and create a plain text corpus using it. Needless to say, you can take
inspiration from the previous and the next recipes.
[ 50 ]
Raw Text, Sourcing, and Normalization Chapter 2
Getting ready
The package BeautifulSoup4 will work for Python 2 and Python 3. We will have to
download and install this package on our interpreter before we can start using it. In tune
with what we have been doing throughout, we will use the pip install utility for it. Run the
following command from the command line:
pip install beautifulsoup4
Along with this module, you will also need the sample-html.html file from the chapter's
Dropbox location. In case you haven't downloaded the files already, here's the link again:
https://www.dropbox.com/sh/bk18dizhsu1p534/AABEuJw4TArUbzJf4Aa8gp5Wa?dl=0
How to do it…
1. Assuming you have already installed the required package, start with the
following import statement:
We have imported the BeautifulSoup class from the module bs4, which we will
be using to parse the HTML.
In the first line, we load the sample-html.html file's content into the str object
html_doc. Next we create a BeautifulSoup object, passing to it the contents of
our HTML file as the first argument and html.parser as the second argument.
We instruct it to parse the document using the html parser. This will load the
document into the soup object, parsed and ready to use.
3. The first, simplest, and most useful task on this soup object will be to strip all the
HTML tags and get the text content. Add the following lines of code:
[ 51 ]
Raw Text, Sourcing, and Normalization Chapter 2
The get_text() method called on the soup object will fetch us the HTML
stripped content of the file. If you run the code written so far, you will get this
output:
Full text HTML Stripped:
Sample Web Page
Main heading
This is a very simple HTML document
Improve your image by including an image.
Add a link to your favorite Web site.
This is a new sentence without a paragraph break, in bold italics.
This is purely the contents of our sample HTML document without any
of the HTML tags.
4. Sometimes, it's not enough to have pure HTML stripped content. You may also
need specific tag contents. Let's access one of the tags:
The soup.title will return the first title tag it encounters in the file. Output of
these lines will look like:
Accessing the <title> tag : <title>Sample Web Page</title>
5. Let us get only the HTML stripped text from a tag now. We will grab the text of
the <h1> tag with the following piece of code:
The command soup.h1.string will return the text surrounded by the first <h1>
tag encountered. The output of this line will be:
Accessing the text of <H1> tag : Main heading
6. Now we will access attributes of a tag. In this case, we will access the alt
attribute of the img tag; add the following lines of code:
Look carefully; the syntax to access attributes of a tag is different than accessing
the text. When you run this piece of code, you will get this output:
Accessing property of <img> tag : A Great HTML Resource
[ 52 ]
Raw Text, Sourcing, and Normalization Chapter 2
7. Finally, there can be multiple occurrences of any type of tag in an HTML file.
Simply using the . syntax will only fetch you the first instance. To fetch all
instances, we use the find_all() functionality, shown as follows:
Main heading
This is a very simple HTML document
Improve your image by including an image.
How it works…
BeautifulSoup 4 is a very handy library used to parse any HTML and XML content. It
supports Python's inbuilt HTML parser, but you can also use other third-party parsers with
it, for example, the lxml parser and the pure-Python html5lib parser. In this recipe, we
used the Python inbuilt HTML parser. The output generated is pretty much self-
explanatory, and of course, the assumption is that you do know what HTML is and how to
write simple HTML.
[ 53 ]
Pre-Processing
3
In this chapter, we will cover the following recipes:
Introduction
In the previous chapter, we learned to read, normalize, and organize raw data coming from
heterogeneous forms and formats into uniformity. In this chapter, we will go a step forward
and prepare the data for consumption in our NLP applications. Preprocessing is the most
important step in any kind of data processing task, or else we fall prey to the age old
computer science cliché of garbage in, garbage out. The aim of this chapter is to introduce
some of the critical preprocessing steps such as tokenization, stemming, lemmatization, and
so on.
Pre-Processing Chapter 3
In this chapter, we will be seeing six different recipes. We will build up the chapter by
performing each preprocessing task in individual recipes—tokenization, stemming,
lemmatization, stopwords treatment, and edit distance—in that order. In the last recipe, we
will look at an example of how we can combine some of these preprocessing techniques to
find common vocabulary between two free-form texts.
Getting ready
Let's first see what a token is. When you receive a document or a long string that you want
to process or work on, the first thing you'd want to do is break it into words and
punctuation marks. This is what we call the process of tokenization. We will see what types
of tokenizers are available with NLTK and implement them as well.
How to do it…
1. Create a file named tokenizer.py and add the following import lines to it:
Import the four different types of tokenizers that we are going to explore in this
recipe.
lTokenizer = LineTokenizer();
print("Line tokenizer output :",lTokenizer.tokenize("My name is
Maximus Decimus Meridius, commander of the Armies of the North,
General of the Felix Legions and loyal servant to the true emperor,
Marcus Aurelius. \nFather to a murdered son, husband to a murdered
wife. \nAnd I will have my vengeance, in this life or the next."))
[ 55 ]
Pre-Processing Chapter 3
3. As the name implies, this tokenizer is supposed to divide the input string into
lines (not sentences, mind you!). Let's see the output and what the tokenizer does:
As you can see, it has returned a list of three strings, meaning the given input has
been divided in to three lines on the basis of where the newlines are.
LineTokenizer simply divides the given input string into new lines.
6. As expected, the input rawText is split on the space character "". On to the next
one! It's the word_tokenize() method. Add the following line:
7. See the difference here. The other two we have seen so far are classes, whereas
this is a method of the nltk module. This is the method that we will be using
most of the time going forward as it does exactly what we've defined to be
tokenization. It breaks up words and punctuation marks. Let's see the output:
[ 56 ]
Pre-Processing Chapter 3
9. Now, on to the last one. There's a special TweetTokernizer that we can use
when dealing with special case strings:
tTokenizer = TweetTokenizer()
print("Tweet Tokenizer output :",tTokenizer.tokenize("This is a
cooool #dummysmiley: :-) :-P <3"))
10. Tweets contain special words, special characters, hashtags, and smileys that we
want to keep intact. Let's see the output of these two lines:
As we see, the Tokenizer kept the hashtag word intact and didn't break it; the
smileys are also kept intact and are not lost. This is one special little class that can
be used when the application demands it.
11. Here's the output of the program in full. We have already seen it in detail, so I
will not be going into it again:
How it works…
We saw three tokenizer classes and a method implemented to do the job in the NLTK
module. It's not very difficult to understand how to do it, but it is worth knowing why we
do it. The smallest unit to process in language processing task is a token. It is very much like
a divide-and-conquer strategy, where we try to make sense of the smallest units at a
granular level and add them up to understand the semantics of the sentence, paragraph,
document, and the corpus (if any) by moving up the level of detail.
[ 57 ]
Pre-Processing Chapter 3
Getting ready
So what is a stem supposed to be? A stem is the base form of a word without any suffixes.
And a stemmer is what removes the suffixes and returns the stem of the word. Let's see
what types of stemmers are available with NLTK.
How to do it…
1. Create a file named stemmers.py and add the following import lines to it:
Importing the four different types of tokenizers that we are going to explore in
this recipe
2. Before we apply any stems, we need to tokenize the input text. Let's quickly get
that done with the following code:
The token list contains all the tokens generated from the raw input string.
porter = PorterStemmer()
pStems = [porter.stem(t) for t in tokens]
print(pStems)
[ 58 ]
Pre-Processing Chapter 3
4. First, we initialize the stemmer object. Then we apply the stemmer to all tokens
of the input text, and finally we print the output. Let's see the output and we
will know more:
As you can see in the output, all the words have been rid of the trailing 's', 'es',
'e', 'ed', 'al', and so on.
5. The next one is LancasterStemmer. This one is supposed to be even more error
prone as it contains many more suffixes to be removed than porter:
lancaster = LancasterStemmer()
lStems = [lancaster.stem(t) for t in tokens]
print(lStems)
6. The same drill! Just that this time we have LancasterStemmer instead of
PrterStemmer. Let's see the output:
We shall discuss the difference in the output section, but we can make out that the
suffixes that are dropped are bigger than Porter. 'us', 'e', 'th', 'eral',
"ered", and many more!
7. Here's the output of the program in full. We will compare the output of both the
stemmers:
[ 59 ]
Pre-Processing Chapter 3
As we compare the output of both the stemmers, we see that lancaster is clearly the
greedier one when dropping suffixes. It tries to remove as many characters from the end as
possible, whereas porter is non-greedy and removes as little as possible.
How it works…
For some language processing tasks, we ignore the form available in the input text and
work with the stems instead. For example, when you search on the Internet for cameras, the
result includes documents containing the word camera as well as cameras, and vice versa. In
hindsight though, both words are the same; the stem is camera.
Having said this, we can clearly see that this method is quite error prone, as the spellings
are quite meddled with after a stemmer is done reducing the words. At times, it might be
okay, but if you really want to understand the semantics, there is a lot of data loss here. For
this reason, we shall next see what is called lemmatization.
Getting ready
A lemma is a lexicon headword or, more simply, the base form of a word. We have already
seen what a stem is, but a lemma is a dictionary-matched base form unlike the stem
obtained by removing/replacing the suffixes. Since it is a dictionary match, lemmatization is
a slower process than Stemming.
[ 60 ]
Pre-Processing Chapter 3
How to do it…
1. Create a file named lemmatizer.py and add the following import lines to it:
We will need to tokenize the sentences first, and we shall use the PorterStemmer
to compare the output.
2. Before we apply any stems, we need to tokenize the input text. Let's quickly get
that done with the following code:
The token list contains all the tokens generated from the raw input string.
porter = PorterStemmer()
stems = [porter.stem(t) for t in tokens]
print(stems)
First, we initialize the stemmer object. Then we apply the stemmer on all tokens
of the input text, and finally we print the output. We shall check the output at the
end of the recipe.
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
print(lemmas)
[ 61 ]
Pre-Processing Chapter 3
5. Run, and the output of these three lines will be like this:
As you see, it understands that for nouns it doesn't have to remove the trailing
's'. But for non-nouns, for example, legions and armies, it removes suffixes and
also replaces them. However, what it’s essentially doing is a dictionary match. We
shall discuss the difference in the output section.
6. Here's the output of the program in full. We will compare the output of both the
stemmers:
As we compare the output of the stemmer and the lemmatizer, we see that the stemmer
makes a lot of mistakes and the lemmatizer makes very few mistakes. However, it doesn't
do anything with the word 'murdered', and that is an error. Yet, as an end product,
lemmatizer does a far better job of getting us the base form than the stemmer.
[ 62 ]
Pre-Processing Chapter 3
How it works…
WordNetLemmatizer removes affixes only if it can find the resulting word in the
dictionary. This makes the process of lemmatization slower than Stemming. Also, it
understands and treats capitalized words as special words; it doesn’t do any processing for
them and returns them as is. To work around this, you may want to convert your input
string to lowercase and then run lemmatization on it.
All said and done, lemmatization is still not perfect and will make mistakes. Check the
input string and the result of this recipe; it couldn't convert 'murdered' to 'murder'.
Similarly, it will handle the word 'women' correctly but can't handle 'men'.
In accordance with the objectives, we will use this corpus to elaborate the usage of
Frequency Distribution of the NLTK module in Python within the context of stopwords. To
give a small synopsis, a stopword is a word that, though it has significant syntactic value in
sentence formation, carries very negligible or minimal semantic value. When you are not
working with the syntax but with a bag-of-words kind of approach (for example, TF/IDF), it
makes sense to get rid of stopwords except the ones that you are specifically interested in.
Getting ready
The nltk.corpus.stopwords is also a corpus as part of the NLTK Data module that we
will use in this recipe, along with nltk.corpus.gutenberg.
[ 63 ]
Pre-Processing Chapter 3
How to do it...
1. Create a new file named Gutenberg.py and add the following three lines of
code to it:
import nltk
from nltk.corpus import gutenberg
print(gutenberg.fileids())
2. Here we are importing the required libraries and the Gutenberg corpus in the
first two lines. The second line is used to check if the corpus was loaded
successfully. Run the file on the Python interpreter and you should get an output
that looks similar to:
As you can see, the names of all 18 Gutenberg texts are printed on the console.
3. Add the following two lines of code, where we are doing a little preprocessing
step on the list of all words from the corpus:
gb_words = gutenberg.words('bible-kjv.txt')
words_filtered = [e for e in gb_words if len(e) >= 3]
The first line simply copies the list of all words in the corpus from the sample
bible—kjv.txt in the gb_words variable. The second, and interesting, step is
where we are iterating over the entire list of words from Gutenberg, discarding all
the words/tokens whose length is two characters or less.
stopwords = nltk.corpus.stopwords.words('english')
words = [w for w in words_filtered if w.lower() not in stopwords]
[ 64 ]
Pre-Processing Chapter 3
The first line simply loads words from the stopwords corpus into the stopwords
variable for the english language. The second line is where we are filtering out
all stopwords from the filtered word list we had developed in the previous
example.
5. Now we will simply apply nltk.FreqDist to the list of preprocessed words and
the plain list of words. Add these lines to do the same:
fdistPlain = nltk.FreqDist(words)
fdist = nltk.FreqDist(gb_words)
Create the FreqDist object by passing as argument the words list that we
formulated in steps 2 and 3.
6. Now we want to see some of the characteristics of the frequency distribution that
we just made. Add the following four lines in the code and we will see what each
does:
The most_common(10) function will return the 10 most common words in the word bag being processed by frequency distribution. What it
outputs is what we will discuss and elaborate now.
7. After you run this program, you should get something similar to the following:
Following are the most common 10 words in the bag minus the
stopwords
[ 65 ]
Pre-Processing Chapter 3
How it works...
If you look carefully at the output, the most common 10 words in the unprocessed or plain
list of words won't make much sense. Whereas from the preprocessed bag of words, the
most common 10 words such as god, lord, and man give us a quick understanding that we
are dealing with a text related to faith or religion.
The foremost objective of this recipe is to introduce you to the concept of stopwords
treatment for text preprocessing techniques that you would most likely have to do before
running any complex analysis on your data. The NLTK stopwords corpus contains stop-
words for 11 languages. When you are trying to analyze the importance of keywords in any
text analytics application, treating the stopwords properly will take you a long way.
Frequency distribution will help you get the importance of words. Statistically speaking,
this distribution would ideally look like a bell curve if you plot it on a two-dimensional
plane of frequency and importance of words.
Getting ready
You may want to look up a little more on the Levenshtein distance part for mathematical
equations. We will look at the algorithm implementation in python and why we do it, but it
may not be feasible to cover the complete mathematics behind it. Here’s a link on
Wikipedia: https://en.wikipedia.org/wiki/Levenshtein_distance.
[ 66 ]
Pre-Processing Chapter 3
How to do it…
1. Create a file named edit_distance_calculator.py and add the following
import lines to it:
We just imported the inbuilt nltk library's edit_distance function from the
nltk.metrics.distance module.
2. Let's define our method to accept two strings and calculate the edit distance
between the two. str1 and str2 are two strings that the function accepts, and
we will return an integer distance value:
3. The next step is to get the length of the two input strings. We will be using the
length to create an m x n table where m and n are the lengths of the two strings s1
and s2 respectively:
m=len(str1)+1
n=len(str2)+1
4. Now we will create table and initialize the first row and first column:
table = {}
for i in range(m): table[i,0]=i
for j in range(n): table[0,j]=j
5. This will initialize the two-dimensional array and the contents will look like the
following table in memory:
[ 67 ]
Pre-Processing Chapter 3
Please note that this is inside a function and I'm using the example strings we are
going to pass to the function to elaborate the algorithm.
6. Now comes the tricky part. We are going to fill up the matrix using the formula:
The cost is calculated on whether the characters in contention are the same or
they edition, specifically deletion or insertion. The formula in the next line for is
calculating the value of the cell in the matrix, the first two take care of substitution
and the third one is for substitution. We also add the cost of the previous step to it
and take the minimum of the three.
7. At the end, we return the value of the last cell, that is, table[m,n], as the final
edit distance value:
return table[i,j]
8. Now we will call our function and the nltk library's edit_distance()
function on two strings and check the output:
9. Our words are hand and and. Only a single delete operation on the first string or
a single insertion operation on the second string will give us a match. Hence, the
expected Levenshtein score is 1.
10. Here's the output of the program:
Our Algorithm : 1
NLTK Algorithm : 1
As expected, the NLTK edit_distance() returns 1 and so does our algorithm. Fair
to say that our algorithm is doing as expected, but I would urge you guys to test it
further by running it through with some more examples.
[ 68 ]
Pre-Processing Chapter 3
How it works…
I've already given you a brief on the algorithm; now let’s see how the matrix table gets
populated with the algorithm. See the attached table here:
You've already seen how we initialized the matrix. Then we filled up the matrix using the
formula in algorithm. The yellow trail you see is the significant numbers. After the first
iteration, you can see that the distance is moving in the direction of 1 consistently and the
final value that we return is denoted by the green background cell.
Now, the applications of the edit distance algorithm are multifold. First and foremost, it is
used in spell checkers and auto-suggestions in text editors, search engines, and many such
text-based applications. Since the cost of comparisons is equivalent to the product of the
length of the strings to be compared, it is sometimes impractical to apply it to compare large
texts.
[ 69 ]
Pre-Processing Chapter 3
Getting ready
We will be removing all special characters, splitting words, doing case folds, and some set
and list operations in this recipe. We won’t be using any special libraries, just Python
programming tricks.
How to do it…
1. Create a file named lemmatizer.py and create a couple of long strings with
short stories or any news articles:
story1 = """In a far away kingdom, there was a river. This river
was home to many golden swans. The swans spent most of their time
on the banks of the river. Every six months, the swans would leave
a golden feather as a fee for using the lake. The soldiers of the
kingdom would collect the feathers and deposit them in the royal
treasury.
One day, a homeless bird saw the river. "The water in this river
seems so cool and soothing. I will make my home here," thought the
bird.
As soon as the bird settled down near the river, the golden swans
noticed her. They came shouting. "This river belongs to us. We pay
a golden feather to the King to use this river. You can not live
here."
"I am homeless, brothers. I too will pay the rent. Please give me
shelter," the bird pleaded. "How will you pay the rent? You do not
have golden feathers," said the swans laughing. They further added,
"Stop dreaming and leave once." The humble bird pleaded many times.
But the arrogant swans drove the bird away.
"I will teach them a lesson!" decided the humiliated bird.
She went to the King and said, "O King! The swans in your river are
impolite and unkind. I begged for shelter but they said that they
had purchased the river with golden feathers."
The King was angry with the arrogant swans for having insulted the
homeless bird. He ordered his soldiers to bring the arrogant swans
to his court. In no time, all the golden swans were brought to the
King’s court.
"Do you think the royal treasury depends upon your golden feathers?
You can not decide who lives by the river. Leave the river at once
or you all will be beheaded!" shouted the King.
The swans shivered with fear on hearing the King. They flew away
never to return. The bird built her home near the river and lived
there happily forever. The bird gave shelter to all other birds in
the river. """
story2 = """Long time ago, there lived a King. He was lazy and
[ 70 ]
Pre-Processing Chapter 3
liked all the comforts of life. He never carried out his duties as
a King. “Our King does not take care of our needs. He also ignores
the affairs of his kingdom." The people complained.
One day, the King went into the forest to hunt. After having
wandered for quite sometime, he became thirsty. To his relief, he
spotted a lake. As he was drinking water, he suddenly saw a golden
swan come out of the lake and perch on a stone. “Oh! A golden swan.
I must capture it," thought the King.
But as soon as he held his bow up, the swan disappeared. And the
King heard a voice, “I am the Golden Swan. If you want to capture
me, you must come to heaven."
Surprised, the King said, “Please show me the way to heaven." “Do
good deeds, serve your people and the messenger from heaven would
come to fetch you to heaven," replied the voice.
The selfish King, eager to capture the Swan, tried doing some good
deeds in his Kingdom. “Now, I suppose a messenger will come to take
me to heaven," he thought. But, no messenger came.
The King then disguised himself and went out into the street. There
he tried helping an old man. But the old man became angry and said,
“You need not try to help. I am in this miserable state because of
out selfish King. He has done nothing for his people."
Suddenly, the King heard the golden swan’s voice, “Do good deeds and
you will come to heaven." It dawned on the King that by doing
selfish acts, he will not go to heaven.
He realized that his people needed him and carrying out his duties
was the only way to heaven. After that day he became a responsible
King.
"""
2. First, we will remove some of the special characters from the texts. We are
removing all newlines ('\n'), commas, full stops, exclamations, question marks,
and so on. At the end, we convert the entire string to lowercase with
the casefold() function:
[ 71 ]
Pre-Processing Chapter 3
4. Using split on the "" character, we split and get the list of words from story1
and story2. Let's see the output after this step:
[ 72 ]
Pre-Processing Chapter 3
As you can see, all the special characters are gone and a list of words is created.
5. Now let's create a vocabulary out of this list of words. A vocabulary is a set of
words. No repeats!
story1_vocab = set(story1_words)
print("First Story vocabulary :",story1_vocab)
story2_vocab = set(story2_words)
[ 73 ]
Pre-Processing Chapter 3
6. Calling the Python internal set() function on the list will deduplicate the list
and convert it into a set:
[ 74 ]
Pre-Processing Chapter 3
Here are the deduplicated sets, the vocabularies of both the stories.
7. Now, the final step. Produce the common vocabulary between these two stories:
8. Python allows the set operation & (AND), which we are using to find the set of
common entries between these two vocabulary sets. Let's see the output of the
final step:
How it works…
So here, we saw how we can go from a couple of narratives to the common vocabulary
between them. We didn’t use any fancy libraries, nor did we perform any complex
operations. Yet we built a base from which we can take this bag-of-words forward and do
many things with it.
From here on, we can think of many different applications, such as text similarity, search
engine tagging, text summarization, and many more.
[ 75 ]
Regular Expressions
4
In this chapter, we will cover the following recipes:
Introduction
In the previous chapter, we saw what preprocessing tasks you would want to perform on
your raw data. This chapter, immediately after, provides an excellent opportunity to
introduce regular expressions. Regular expressions are one of the most simple and basic, yet
most important and powerful, tools that you will learn. More commonly known as regex,
they are used to match patterns in text. We will learn exactly how powerful this is in this
chapter.
We do not claim that you will be an expert in writing regular expressions after this chapter
and that is perhaps not the goal of this book or this chapter. The aim of this chapter is to
introduce you to the concept of pattern matching as a way to do text analysis and for this,
there is no better tool to start with than regex. By the time you finish the recipes, you shall
feel fairly confident of performing any text match, text split, text search, or text extraction
operation.
Regular Expressions Chapter 4
Getting ready
The regular expressions library is a part of the Python package and no additional packages
need to be installed.
How to do it…
1. Create a file named regex1.py and add the following import line to it:
import re
2. Add the following Python function in the file that is supposed to apply the given
patterns for matching:
This function accepts two arguments; text is the input text on which the
patterns will be applied for match.
3. Now, let's define the function. Add the following lines under the function:
if re.search(patterns, text):
return 'Found a match!'
else:
return('Not matched!')
[ 77 ]
Regular Expressions Chapter 4
The re.search() method applies the given pattern to the text object and
returns true or false depending on the outcome after applying the method. That is
the end of our function.
4. Let's apply the wild card patterns one by one. We start with zero or one:
print(text_match("ac", "ab?"))
print(text_match("abc", "ab?"))
print(text_match("abbc", "ab?"))
5. Let's look at this pattern ab?. What this means is a followed by zero or one b.
Let's see what the output will be when we execute these three lines:
Found a match!
Found a match!
Found a match!
Now, all of them found a match. These patterns are trying to match a part of the
input and not the entire input; hence, they find a match with all three inputs.
6. On to the next one, zero or more! Add the following three lines:
print(text_match("ac", "ab*"))
print(text_match("abc", "ab*"))
print(text_match("abbc", "ab*"))
7. The same set of inputs but a different string. The pattern says, a followed by zero
or more b. Let's see the output of these three lines:
Found a match!
Found a match!
Found a match!
As you can see, all the texts find a match. As rule of thumb, whatever matches
zero or one wild card will also match zero or more. The ? wildcard is a subset of
*.
8. Now, the one or more wild card. Add the following lines:
print(text_match("ac", "ab+"))
print(text_match("abc", "ab+"))
print(text_match("abbc", "ab+"))
[ 78 ]
Regular Expressions Chapter 4
9. The same input! Just that the pattern contains the + one or more wild card. Let's
see the output:
Not matched!
Found a match!
Found a match!
As you can see, the first input string couldn't find the match. The rest did as
expected.
10. Now, being more specific in the number of repetitions, add the following line:
print(text_match("abbc", "ab{2}"))
The pattern says a followed by exactly two b. Needless to say, the pattern will
find a match in the input text.
print(text_match("aabbbbc", "ab{3,5}?"))
The output of the program won't really make much sense in full. We have already
ana.ysed the output of each and every step; hence, we won't be printing it down here
again.
How it works…
The re.search() function is a function that will only apply the given pattern as a test and
will return true or false as the result of the test. It won't return the matching value. For that,
there are other re functions that we shall learn in later recipes.
[ 79 ]
Regular Expressions Chapter 4
Getting ready
We could have reused the text_match() function from the previous recipe, but instead of
importing an external file, we shall rewrite it. Let's look at the recipe implementation.
How to do it…
1. Create a file named regex2.py and add the following import line to it:
import re
2. Add this Python function in the file that is supposed to apply the given patterns
for matching:
This function accepts two arguments; text is the input text on which patterns
will be applied for matching and will return whether the match was found or not.
The function is exactly what we wrote in the previous recipe.
3. Let's apply the following pattern. We start with a simple starts with ends with:
4. Let's look at this pattern, ^a.*c$. This means: start with a, followed by zero or
more of any characters, and end with c. Let's see the output when we execute
these three lines:
Found a match!
It found a match for the input text, of course. What we introduced here is a new
. wildcard. The dot matches any character except a newline in default mode; that
is, when you say .*, it means zero or more occurrences of any character.
[ 80 ]
Regular Expressions Chapter 4
5. On to the next one, to find a pattern that looks for an input text that begins with a
word. Add the following two lines:
6. \w stands for any alphanumeric character and underscore. The pattern says: start
with (^) any alphanumeric character (\w) and one or more occurrences of it (+).
The output:
Found a match!
7. Next, we check for an ends with a word and optional punctuation. Add the
following lines:
8. The pattern means one or more occurrences of \w, followed by zero or more
occurrences of \S, and that should be falling towards the end of the input text. To
understand \S (capital S), we must first understand \s, which is all whitespace
characters. \S is the reverse or the anti-set of \s, which when followed by \w
translates to looking for a punctuation:
Found a match!
We found the match with peas! at the end of the input text.
9. Next, find a word that contains a specific character. Add the following lines:
[ 81 ]
Regular Expressions Chapter 4
Found a match!
Here's the output of the program in full. We have already seen it in detail, so I will not go
into it again:
Pattern to test starts and ends with
Found a match!
Found a match!
Found a match!
Finding a word which contains character, not start or end of the word
Found a match!
How it works…
Along with starts with and ends with, we also learned the wild card character . and some
other special sequences such as, \w, \s, \b, and so on.
[ 82 ]
Regular Expressions Chapter 4
Getting ready
Open your PyCharm editor or any other Python editor that you use, and you are ready to
go.
How to do it…
1. Create a file named regex3.py and add the following import line to it:
import re
2. Add the following two Python lines to declare and define our patterns and the
input text:
This is a simple for loop, iterating on the list of patterns one by one and calling the
search function of re. Run this piece and you shall find a match for two of the
three words in the input string. Also, do note that these patterns are case sensitive;
the capitalized word Tuffy! We will discuss the output in the output section.
[ 83 ]
Regular Expressions Chapter 4
4. On to the next one, to search a substring and find its location too. Let's define the
pattern and the input text first:
The preceding two lines define the input text and the pattern to search for
respectively.
5. Now, the for loop that will iterate over the input text and fetch all occurrences of
the given pattern:
6. The finditer function takes as input the pattern and the input text on which to
apply that pattern. On the returned list, we shall iterate. For every object, we will
call the start and end methods to know the exact location where we found a
match for the pattern. We will discuss the output of this block here. The output of
this little block will look like:
Two lines of output! Which suggests that we found the pattern at two places in
the input. The first was at position 12:20 and the second was at 42:50 as
displayed in the output text lines.
Here's the output of the program in full. We have already seen some parts in detail but we
will go through it again:
Searching for "Tuffy" in "Tuffy eats pie, Loki eats peas!" ->
Found!
Searching for "Pie" in "Tuffy eats pie, Loki eats peas!" ->
Not Found!
Searching for "Loki" in "Tuffy eats pie, Loki eats peas!" ->
Found!
[ 84 ]
Regular Expressions Chapter 4
The output is quite intuitive, or at least the first six lines are. We searched for the word
Tuffy and it was found. The word Pie wasn't found (the re.search() function is case
sensitive) and then the word Loki was found. The last two lines we've already discussed, in
the sixth step. We didn't just search the string but also pointed out the index where we
found them in the given input.
How it works...
Let's discuss some more things about the re.search() function we have used quite
heavily so far. As you can see in the preceding output, the word pie is part of the input text
but we search for the capitalized word Pie and we can't seem to locate it. If you add a flag
in the search function call re.IGNORECASE, only then will it be a case-insensitive search.
The syntax will be re.search(pattern, string, flags=re.IGNORECASE).
How to do it...
1. Create a file named regex4.py and add the following import line to it:
import re
[ 85 ]
Regular Expressions Chapter 4
2. Let's declare a url object and write a simple date finder regular expression to
start:
url=
"http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix
-2017-time-does-start-tv-channel-odds-lewis1/"
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'
The url is a simple string object. The date_regex is also a simple string object
but it contains a regex that will match a date with format YYYY/DD/MM or
YYYY/MM/DD type of dates. \d denotes digits starting from 0 to 9. We've already
learned the notation {}.
3. Let's apply date_regex to url and see the output. Add the following line:
So, we've found the date 28 October 2017 in the given input string object.
5. Now comes the next part, where we will learn about the set of characters notation
[]. Add the following function in the code:
def is_allowed_specific_char(string):
charRe = re.compile(r'[^a-zA-Z0-9.]')
string = charRe.search(string)
return not bool(string)
The purpose here is to check whether the input string contains a specific set of
characters or others. Here, we are going with a slightly different approach; first,
we re.compile the pattern, which returns a RegexObject. Then, we call the
search method of RegexObject on the already compiled pattern. If a match is
found, the search method returns a MatchObject, and None otherwise. Now,
turning our attention to the set notation []. The pattern enclosed inside the
squared brackets means: not (^) the range of characters a-z, A-Z, 0-9, or ..
Effectively, this is an OR operation of all things enclosed by the squared brackets.
[ 86 ]
Regular Expressions Chapter 4
6. Now the test for the pattern. Let's call the function on two different types of
inputs, one that matches and one that doesn't:
print(is_allowed_specific_char("ABCDEFabcdef123450."))
print(is_allowed_specific_char("*&%@#!}{"))
7. The first set of characters contains all of the allowed list of characters, whereas the
second set contains all of the disallowed set of characters. As expected, the output
of these two lines will be:
True
False
The pattern will iterate through each and every character of the input string and
see if there is any disallowed character, and it will flag it out. You can try adding
any of the disallowed set of characters in the first call of
is_allwoed_specific_char() and check for yourself.
Here's the output of the program in full. We have already seen it in detail, so we shall not
go through it again:
Date found in the URL : [('2017', '10', '28')]
True
False
How it works...
Let's first discuss what a group is. A group in any regular expression is what is enclosed
inside the brackets () inside the pattern declaration. If you see the output of the date match,
you will see a set notation, inside which you have three string objects: [('2017', '10',
'28')]. Now, look at the pattern declared carefully, /(\d{4})/(\d{1,2})/(\d{1,2})/.
All the three components of the date are marked inside the group notation (), and hence all
three are identified separately.
Now, the re.findall() method will find all the matches in the given input. This means
that if there were more dates inside the give input text, the output would've looked like
[('2017', '10', '28'), ('2015', '05', '12')].
[ 87 ]
Regular Expressions Chapter 4
The [] notation that is set essentially means: match either of the characters enclosed inside
the set notation. If any single match is found, the pattern is true.
How to do it…
1. Create a file named regex_assignment1.py and add the following import line
to it:
import re
2. Add the following two Python lines to define the input string and apply the
substitution pattern for abbreviation:
3. First, we are going to do the abbreviation, for which we use the re.sub()
method. The pattern to look for is Road, the string to replace it with Rd, and the
input is the string object street. Let's look at the output:
21 Ramkrishna Rd
4. Now, let us find all five-character words inside any given sentence. Add these
two lines of code for that:
[ 88 ]
Regular Expressions Chapter 4
5. Declare a string object text and put the sentence side it. Next, create a pattern
and apply it using the re.findall() function. We are using the \b boundary set
to identify the boundary between words and the {} notation to make sure we are
only shortlisting five-character words. Run this and you shall see the list of words
matched as expected:
['light', 'color']
Here's the output of the program in full. We have already seen it in detail, so we will not go
through it again:
21 Ramkrishna Rd
['light', 'color']
How it works...
By now, I assume you have a good understanding of the regular expression notations and
syntax. Hence, the explanations given when we wrote the recipe are quite enough. Instead,
let us look at something more interesting. Look at the findall() method; you will see a
notation like r<pattern>. This is called the raw string notation; it helps keep the
regular expression sane looking. If you don't do it, you will have to provide an escape
sequence to all the backslashes in your regular expression. For example, patterns
r"\b\w{5}\b" and "\\b\\w{5}\\b" do the exact same job functionality wise.
Getting ready
If you have your Python interpreter and editor ready, you are as ready as you can ever be.
[ 89 ]
Regular Expressions Chapter 4
How to do it...
1. Create a file named regex_tokenizer.py and add the following import line to
it:
import re
2. Let's define our raw sentence to tokenize and the first pattern:
3. This pattern will perform the same as the space tokenizer we saw in previous
chapter. Let's look at the output:
4. Now, this is not enough, is it? We want to split the tokens on anything non-word
and not the ' ' characters alone. Let's try the following pattern:
print(re.split(r'\W+', raw))
5. We are splitting on all non-word characters, that is, \W. Let's see the output:
We did split out on all the non-word characters (' ', ,, !, and so on), but we
seem to have removed them from the result altogether. Looks like we need to do
something more and different.
6. Split doesn't seem to be doing the job; let's try a different re function,
re.findall(). Add the following line:
print(re.findall(r'\w+|\S\w*', raw))
[ 90 ]
Regular Expressions Chapter 4
Here's the output of the program in full. We have already discussed it; let's print it out:
['I', 'am', 'big!', "It's", 'the', 'pictures', 'that', 'got', 'small.']
['I', 'am', 'big', 'It', 's', 'the', 'pictures', 'that', 'got', 'small',
'']
['I', 'am', 'big', '!', 'It', "'s", 'the', 'pictures', 'that', 'got',
'small', '.']
As you can see, we have gradually improved upon our pattern and approach to achieve the
best possible outcome in the end.
How it works...
We started with a simple re.split on space characters and improvised it using the non-
word character. Finally, we changed our approach; instead of trying to split, we went about
matching what we wanted by using re.findall, which did the job.
Getting ready
As we did in previous stemmer and lemmatizer recipes, we will need to tokenize the text
before we apply the stemmer. That's exactly what we are going to do. We will reuse the
final tokenizer pattern from the last recipe. If you haven't checked out the previous recipe,
please do so and you are ready set to start this one.
[ 91 ]
Regular Expressions Chapter 4
How to do it…
1. Create a file named regex_tokenizer.py and add the following import line to
it:
import re
2. We will write a function that will do the job of stemming for us. Let's first declare
the syntax of the function in this step and we will define it in the next step:
def stem(word):
This function shall accept a string object as parameter and is supposed to return a
string object as the outcome. Word in stem out!
splits = re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$',
word)
stem = splits[0][0]
return stem
We are applying the re.findall() function to the input word to return two
groups as output. First is the stem and then it's any possible suffix. We return the
first group as our result from the function call.
4. Let's define our input sentence and tokenize it. Add the following lines:
[ 92 ]
Regular Expressions Chapter 4
6. Let's apply our stem() method to the list of tokens we just generated. Add the
following for loop:
for t in tokens:
print("'"+stem(t)+"'")
We are just looping over all tokens and printing the returned stem one by one. We
will see the output in the upcoming output section and discuss it there.
'Keep'
'your'
'friend'
'close'
','
'but'
'your'
'enem'
'closer'
'.'
Our stemmer seems to be doing a pretty decent job. However, I reckon I have passed an
easy-looking sentence for the stemmer.
How it works…
Again, we are using the re.findall() function to get the desired output, though you
might want to look closely at the first group's regex pattern. We are using a non-greedy
wildcard match (.*?); otherwise, it will greedily gobble up the entire word and there will
be no suffixes identified. Also, the start and end of the input are mandatory to match the
entire input word and split it.
[ 93 ]
POS Tagging and Grammars
5
In this chapter, we will cover the following recipes:
Introduction
This chapter primarily focuses on learning the following subjects using Python NLTK:
Taggers
CFG
Tagging is the process of classifying the words in a given sentence using parts of speech
(POS). Software that helps achieve this is called tagger. NLTK has support for a variety of
taggers. We will go through the following taggers as part of this chapter:
In-built tagger
Default tagger
Regular expression tagger
Lookup tagger
CFG describes a set of rules that can be applied to text in a formal language specification to
generate newer sets of text.
POS Tagging and Grammars Chapter 5
Starting token
A set of tokens that are terminals (ending symbols)
A set of tokens that are non-terminals (non-ending symbols)
Rules or productions that define rewrite rules that help transform non-terminals
to either terminals or non-terminals
We will make use of the following technologies from the Python NLTK library:
The datasets for these taggers can be downloaded from your NLTK distribution by
invoking nltk.download() from the Python prompt.
Getting ready
You should have a working Python (Python 3.6 is preferred) installed in your system along
with the NLTK library and all its collections for optimal experience.
How to do it...
1. Open atom editor (or favorite programming editor).
2. Create a new file called Exploring.py.
[ 95 ]
POS Tagging and Grammars Chapter 5
How it works...
Now, let's go through the program that we have just written and dig into the details:
import nltk
This is the first instruction in our program, which instructs the Python interpreter to load
the module from disk to memory and make the NLTK library available for use in the
program:
simpleSentence = "Bangalore is the capital of Karnataka."
In this instruction, we are creating a variable called simpleSentence and assigning a hard
coded string to it:
wordsInSentence = nltk.word_tokenize(simpleSentence)
[ 96 ]
POS Tagging and Grammars Chapter 5
print(wordsInSentence)
In this instruction, we are calling the Python built-in print() function, which displays the
given data structure on the screen. In our case, we are displaying the list of all words that
are tokenized. See the output carefully; we are displaying a Python list data structure on
screen, which consists of all the strings separated by commas, and all the list elements are
enclosed in square brackets:
partsOfSpeechTags = nltk.pos_tag(wordsInSentence)
In this instruction we are invoking the NLTK built-in tagger pos_tag(), which takes a list
of words in the wordsInSentence variable and identifies the POS. Once the identification
is complete, a list of tuples. Each tuple has the tokenized word and the POS identifier:
print(partsOfSpeechTags)
In this instruction, we are invoking the Python built-in print() function, which prints the
given parameter to the screen. In our case, we can see a list of tuples, where each tuple
consists of the original word and POS identifier.
Default tagger
Regular expression tagger
Lookup tagger
Getting ready
You should have a working Python (Python 3.6 is preferred) installed in your system along
with the NLTK library and all its collections for optimal experience.
[ 97 ]
POS Tagging and Grammars Chapter 5
How to do it...
1. Open your atom editor (or favorite programming editor).
2. Create a new file called OwnTagger.py.
3. Type the following source code:
[ 98 ]
POS Tagging and Grammars Chapter 5
How it works...
Now, let's go through the program that we have just written to understand more:
import nltk
This is the first instruction in our program; it instructs the Python interpreter to load the
module from disk to memory and make the NLTK library available for use in the program:
def learnDefaultTagger(simpleSentence):
wordsInSentence = nltk.word_tokenize(simpleSentence)
tagger = nltk.DefaultTagger("NN")
posEnabledTags = tagger.tag(wordsInSentence)
print(posEnabledTags)
All of these instructions are defining a new Python function that takes a string as input and
prints the words in this sentence along with the default tag on screen. Let's further
understand this function to see what it's trying to do:
def learnDefaultTagger(simpleSentence):
[ 99 ]
POS Tagging and Grammars Chapter 5
In this instruction, we are calling the word_tokenize function from the NLTK library. We
are passing simpleSentence as the first parameter to this function. Once the data is
computed by this function, the return value is stored in the wordsInSentence variable.
Which are list of words:
tagger = nltk.DefaultTagger("NN")
In this instruction, we are creating an object of the DefaultTagger() class from the Python
nltk library with NN as the argument passed to it. This will initialize the tagger and assign
the instance to the tagger variable:
posEnabledTags = tagger.tag(wordsInSentence)
In this instruction, we are calling the tag() function of the tagger object, which takes the
tokenized words from the wordsInSentence variable and returns the list of tagged words.
This is saved in posEnabledTags. Remember that all the words in the sentence will be
tagged as NN as that's what the tagger is supposed to do. This is like a very basic level of
tagging without knowing anything about POS:
print(posEnabledTags)
Here we are calling Python's built-in print() function to inspect the contents of the
posEnabledTags variable. We can see that all the words in the sentence will be tagged
with NN:
def learnRETagger(simpleSentence):
customPatterns = [
(r'.*ing$', 'ADJECTIVE'),
(r'.*ly$', 'ADVERB'),
(r'.*ion$', 'NOUN'),
(r'(.*ate|.*en|is)$', 'VERB'),
(r'^an$', 'INDEFINITE-ARTICLE'),
(r'^(with|on|at)$', 'PREPOSITION'),
(r'^\-?[0-9]+(\.[0-9]+)$', 'NUMBER'),
(r'.*$', None),
]
tagger = nltk.RegexpTagger(customPatterns)
wordsInSentence = nltk.word_tokenize(simpleSentence)
posEnabledTags = tagger.tag(wordsInSentence)
print(posEnabledTags)
These are the instructions to create a new function called learnRETagger(), which takes a
string as input and prints the list of all tokens in the string with properly identified tags
using the regular expression tagger as output.
[ 100 ]
POS Tagging and Grammars Chapter 5
We are defining a new Python function named learnRETagger to take a parameter called
simpleSentence.
In order to understand the next instruction, we should learn more about Python lists,
tuples, and regular expressions:
Python regular expressions are strings that begin with the letter r and follow the
standard PCRE notation:
customPatterns = [
(r'.*ing$', 'ADJECTIVE'),
(r'.*ly$', 'ADVERB'),
(r'.*ion$', 'NOUN'),
(r'(.*ate|.*en|is)$', 'VERB'),
(r'^an$', 'INDEFINITE-ARTICLE'),
(r'^(with|on|at)$', 'PREPOSITION'),
(r'^\-?[0-9]+(\.[0-9]+)$', 'NUMBER'),
(r'.*$', None),
]
Even though this looks big, this is a single instruction that does many things:
Now, translating the preceding instruction into a human-readable form, we have added
eight regular expressions to tag the words in a sentence to be any of ADJECTIVE, ADVERB,
NOUN, VERB, INDEFINITE-ARTICLE, PREPOSITION, NUMBER, or None type.
[ 101 ]
POS Tagging and Grammars Chapter 5
In the preceding example, these are the clues we are using to tag the POS of English words:
Words that end with ing can be called ADJECTIVE, for example, running
Words that end with ly can be called ADVERB, for example, willingly
Words that end with ion can be called NOUN, for example, intimation
Words that end with ate or en can be called VERB, for example, terminate,
darken, or lighten
Words that end with an can be called INDEFINITE-ARTICLE
Words such as with, on, or at are PREPOSITION
Words that are like, -123.0, 984 can be called NUMBER
We are tagging everything else as None, which is a built-in Python datatype used
to represent nothing
tagger = nltk.RegexpTagger(customPatterns)
In this instruction, we are creating an instance of the NLTK built-in regular expression
tagger RegexpTagger. We are passing the list of tuples in the customPatterns variable as
the first parameter to the class to initialize the object. This object can be referenced in future
with the variable named tagger:
wordsInSentence = nltk.word_tokenize(simpleSentence)
Following the general process, we first try to tokenize the string in simpleSentence using
the NLTK built-in word_tokenize() function and store the list of tokens in the
wordsInSentence variable:
posEnabledTags = tagger.tag(wordsInSentence)
Now we are invoking the regular expression tagger's tag() function to tag all the words
that are in the wordsInSentence variable. The result of this tagging process is stored in the
posEnabledTags variable:
print(posEnabledTags)
We are calling the Python built-in print() function to display the contents of the
posEnabledTags data structure on screen:
def learnLookupTagger(simpleSentence):
mapping = {
'.': '.', 'place': 'NN', 'on': 'IN',
'earth': 'NN', 'Mysore' : 'NNP', 'is': 'VBZ',
[ 102 ]
POS Tagging and Grammars Chapter 5
In this instruction, we are calling UnigramTagger from the nltk library. This is a lookup
tagger that takes the Python dictionary we have created and assigned to
the mapping variable. Once the object is created, it's available in the tagger variable for
future use:
wordsInSentence = nltk.word_tokenize(simpleSentence)
Here, we are tokenizing the sentence using the NLTK built-in word_tokenize() function
and capturing the result in the wordsInSentence variable:
posEnabledTags = tagger.tag(wordsInSentence)
Once the sentence is tokenized, we call the tag() function of the tagger by passing the list
of tokens in the wordsInSentence variable. The result of this computation is assigned to
the posEnabledTags variable:
print(posEnabledTags)
In this instruction, we are printing the data structure in posEnabledTags on the screen for
further inspection:
testSentence = "Mysore is an amazing place on earth. I have visited Mysore
10 times."
We are creating a variable called testSentence and assigning a simple English sentence to
it:
learnDefaultTagger(testSentence)
[ 103 ]
POS Tagging and Grammars Chapter 5
In this expression, we are invoking the learnRETagger() function with the same test
sentence in the testSentence variable. The output from this function is a list of tags that
are tagged as per the regular expressions that we have defined ourselves:
learnLookupTagger(testSentence)
The output from this function learnLookupTagger is list of all tags from the sentence
testSentence that are tagged using the lookup dictionary that we have created.
Getting ready
You should have a working Python (Python 3.6 is preferred) installed in your system, along
with the NLTK library and all its collections for optimal experience.
How to do it...
1. Open your atom editor (or favorite programming editor).
2. Create a new file called Train3.py.
[ 104 ]
POS Tagging and Grammars Chapter 5
[ 105 ]
POS Tagging and Grammars Chapter 5
How it works...
Let's understand how the program works:
import nltk
import pickle
In these two instructions, we are loading the nltk and pickle modules into the program.
The pickle module implements powerful serialization and de-serialization algorithms to
handle very complex Python objects:
def sampleData():
return [
"Bangalore is the capital of Karnataka.",
"Steve Jobs was the CEO of Apple.",
"iPhone was Invented by Apple.",
"Books can be purchased in Market.",
]
In these instructions, we are defining a function called sampleData() that returns a Python
list. Basically, we are returning four sample strings:
def buildDictionary():
dictionary = {}
for sent in sampleData():
partsOfSpeechTags = nltk.pos_tag(nltk.word_tokenize(sent))
for tag in partsOfSpeechTags:
value = tag[0]
pos = tag[1]
dictionary[value] = pos
return dictionary
[ 106 ]
POS Tagging and Grammars Chapter 5
We now define a function called buildDictionary(); it reads one string at a time from
the list generated by the sampleData() function. Each string is tokenized using
the nltk.word_tokenize() function. The resultant tokens are added to a Python
dictionary, where the dictionary key is the word in the sentence and the value is POS. Once
a dictionary is computed, it's returned to the caller:
def saveMyTagger(tagger, fileName):
fileHandle = open(fileName, "wb")
pickle.dump(tagger, fileHandle)
fileHandle.close()
In these instructions, we are defining a function called saveMyTagger() that takes two
parameters:
We first open the file in write binary (wb) mode. Then, using pickle module's dump()
method, we store the entire tagger in the file and call the close() function on fileHandle:
def saveMyTraining(fileName):
tagger = nltk.UnigramTagger(model=buildDictionary())
saveMyTagger(tagger, fileName)
Here, we are defining a new function, loadMyTagger(), which takes fileName as a single
argument. This function reads the file from disk and passes it to the pickle.load()
function which unserializes the tagger from disk and returns a reference to it:
sentence = 'Iphone is purchased by Steve Jobs in Bangalore Market'
fileName = "myTagger.pickle"
[ 107 ]
POS Tagging and Grammars Chapter 5
In these two instructions, we are defining two variables, sentence and fileName, which
contain a sample string that we want to analyze and the file path at which we want to store
the POS tagger respectively:
saveMyTraining(fileName)
This is the instruction that actually calls the function saveMyTraining() with
myTagger.pickle as argument. So, we are basically storing the trained tagger in this file:
myTagger = loadMyTagger(fileName)
In this instruction, we are calling the tag() function of the tagger that we have just loaded
from disk. We use it to tokenize the sample string that we have created.
The symbol/token can be anything that is specific to the language that we consider.
For example:
Generally, rules (or productions) are written in Backus-Naur form (BNF) notation.
[ 108 ]
POS Tagging and Grammars Chapter 5
Getting ready
You should have a working Python (Python 3.6 is preferred) installed on your system,
along with the NLTK library.
How to do it...
1. Open your atom editor (or your favorite programming editor).
2. Create a new file called Grammar.py.
3. Type the following source code:
[ 109 ]
POS Tagging and Grammars Chapter 5
How it works...
Now, let's go through the program that we have just written and dig into the details:
import nltk
[ 110 ]
POS Tagging and Grammars Chapter 5
This instruction imports the string module into the current program:
from nltk.parse.generate import generate
This instruction imports the generate function from the nltk.parse.generate module,
which helps in generating strings from the CFG that we are going to create:
productions = [
"ROOT -> WORD",
"WORD -> ' '",
"WORD -> NUMBER LETTER",
"WORD -> LETTER NUMBER",
]
We are defining a new grammar here. The grammar can contain the following production
rules:
[ 111 ]
POS Tagging and Grammars Chapter 5
Let's try to understand what this grammar is for. This grammar represents the language
wherein there are words such as 0a, 1a, 2a, a1, a3, and so on.
All the production rules that we have stored so far in the list variable
called productions are converted to a string:
grammarString = "\n".join(productions)
We are creating a new grammar object using the nltk.CFG.fromstring() method, which
takes the grammarString variable that we have just created:
grammar = nltk.CFG.fromstring(grammarString)
These instructions print the first five auto generated words that are present in this language,
which is defined with the grammar:
for sentence in generate(grammar, n=5, depth=5):
palindrome = "".join(sentence).replace(" ", "")
print("Generated Word: {}, Size : {}".format(palindrome,
len(palindrome)))
Getting ready
You should have a working Python (Python 3.6 is preferred) installed on your system,
along with the NLTK library.
[ 112 ]
POS Tagging and Grammars Chapter 5
How to do it...
1. Open your atom editor (or your favorite programming editor).
2. Create a new file called PCFG.py.
3. Type the following source code:
[ 113 ]
POS Tagging and Grammars Chapter 5
How it works...
Now, let's go through the program that we have just written and dig into the details:
import nltk
[ 114 ]
POS Tagging and Grammars Chapter 5
This instruction imports the generate function from the nltk.parse.genearate module:
productions = [
"ROOT -> WORD [1.0]",
"WORD -> P1 [0.25]",
"WORD -> P1 P2 [0.25]",
"WORD -> P1 P2 P3 [0.25]",
"WORD -> P1 P2 P3 P4 [0.25]",
"P1 -> 'A' [1.0]",
"P2 -> 'B' [0.5]",
"P2 -> 'C' [0.5]",
"P3 -> 'D' [0.3]",
"P3 -> 'E' [0.3]",
"P3 -> 'F' [0.4]",
"P4 -> 'G' [0.9]",
"P4 -> 'H' [0.1]",
]
Here, we are defining the grammar for our language, which goes like this:
Description Content
Starting symbol ROOT
Non-terminals WORD, P1, P2, P3, P4
Terminals 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'
Once we have identified the tokens in the grammar, let's see what the production rules look
like:
There is a ROOT symbol, which is the starting symbol for this grammar
There is a WORD symbol that has a probability of 1.0
There is a WORD symbol that can produce P1 with a probability of 0.25
There is a WORD symbol that can produce P1 P2 with a probability of 0.25
There is a WORD symbol that can produce P1 P2 P3 with a probability of 0.25
There is a WORD symbol that can produce P1 P2 P3 P4 with a probability of
0.25
The P1 symbol can produce symbol 'A' with a 1.0 probability
The P2 symbol can produce symbol 'B' with a 0.5 probability
The P2 symbol can produce symbol 'C' with a 0.5 probability
The P3 symbol can produce symbol 'D' with a 0.3 probability
[ 115 ]
POS Tagging and Grammars Chapter 5
If you observe carefully, the sum of all the probabilities of the non-terminal symbols is
equal to 1.0. This is a mandatory requirement for the PCFG.
We are joining the list of all the production rules into a string called
the grammarString variable:
grammarString = "\n".join(productions)
This instruction creates a grammar object using the nltk.PCFG.fromstring method and
taking the grammarString as input:
grammar = nltk.PCFG.fromstring(grammarString)
This instruction uses the Python built-in print() function to display the contents of
the grammar object on screen. This will summarize the total number of tokens and
production rules we have in the grammar that we have just created:
print(grammar)
We are printing 10 strings from this grammar using the NLTK built-in function generate
and then displaying them on screen:
for sentence in generate(grammar, n=10, depth=5):
palindrome = "".join(sentence).replace(" ", "")
print("String : {}, Size : {}".format(palindrome, len(palindrome)))
Palindromes are the best examples of recursive CFG. We can always write a recursive CFG
for palindromes in a given language.
[ 116 ]
POS Tagging and Grammars Chapter 5
11
1001
010010
No matter in whatever direction we read these alphabets (left to right or right to left), we
always get the same value. This is the special feature of palindromes.
In this recipe, we will write grammar to represent these palindromes and generate a few
palindromes using the NLTK built-in string generation libraries.
Getting ready
You should have a working Python (Python 3.6 is preferred) installed on your system,
along with the NLTK library.
How to do it...
1. Open your atom editor (or your favorite programming editor).
2. Create a new file called RecursiveCFG.py.
3. Type the following source code:
[ 117 ]
POS Tagging and Grammars Chapter 5
[ 118 ]
POS Tagging and Grammars Chapter 5
How it works...
Now, let's go through the program that we have just written and dig into the details. We are
importing the nltk library into our program for future use:
import nltk
We are also importing the string library into our program for future use:
import string
We have created a new list data structure called productions, where there are two
elements. Both the elements are strings that represent the two productions in our CFG:
productions = [
"ROOT -> WORD",
"WORD -> ' '"
]
We are retrieving the list of decimal digits as a list in the alphabets variable:
alphabets = list(string.digits)
Using the digits 0 to 9, we add more productions to our list. These are the production rules
that define palindromes:
for alphabet in alphabets:
productions.append("WORD -> '{w}' WORD '{w}'".format(w=alphabet))
In this instruction, we are creating a new grammar object by passing the newly
constructed grammarString to the NLTK built-in nltk.CFG.fromstring function:
grammar = nltk.CFG.fromstring(grammarString)
[ 119 ]
POS Tagging and Grammars Chapter 5
In this instruction, we print the grammar that we have just created by calling the Python
built-in print() function:
print(grammar)
We are generating five palindromes using the generate function of the NLTK library and
printing the same on the screen:
for sentence in generate(grammar, n=5, depth=5):
palindrome = "".join(sentence).replace(" ", "")
print("Palindrome : {}, Size : {}".format(palindrome, len(palindrome)))
[ 120 ]
Chunking, Sentence Parse, and
6
Dependencies
In this chapter, we will perform the following recipes:
Introduction
We have learned so far that the Python NLTK can be used to do part-of-speech (POS)
recognition in a given piece of text. But sometimes we are interested in finding more details
about the text that we are dealing with. For example, I might be interested in finding the
names of some famous personalities, places, and so on in a given text. We can maintain a
very big dictionary of all these names. But in the simplest form, we can use a POS analysis
to identify these patterns very easily.
Chunking is the process of extracting short phrases from text. We will leverage POS tagging
algorithms to do chunking. Remember that the tokens (words) produced by chunking do
not overlap.
Chunking, Sentence Parse, and Dependencies Chapter 6
Getting ready
You should have Python installed along with the nltk library. Prior understanding of POS
tagging as explained in Chapter 5, POS Tagging and Grammars is good to have.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called Chunker.py.
3. Type the following source code:
[ 122 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
How it works...
Let's try to understand how the program works. This instruction imports the nltk module
into the program:
import nltk
This is the data that we are going to analyze as part of this recipe. We are adding this string
to a variable called text:
text = "Lalbagh Botanical Gardens is a well known botanical garden in
Bengaluru, India."
This instruction is going to break the given text into multiple sentences. The result is a list of
sentences stored in the sentences variable:
sentences = nltk.sent_tokenize(text)
In this instruction, we are looping through all the sentences that we have extracted. Each
sentence is stored in the sentence variable:
for sentence in sentences:
[ 123 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
This instruction breaks the sentence into non-overlapping words. The result is stored in a
variable called words:
words = nltk.word_tokenize(sentence)
In this instruction, we do POS analysis using the default tagger that is available with NLTK.
Once the identification is done, the result is stored in a variable called tags:
tags = nltk.pos_tag(words)
In this instruction, we call the nltk.ne_chunk() function, which does the chunking part
for us. The result is stored in a variable called chunks. The result is actually tree-structured
data that contains the paths of the tree:
chunks = nltk.ne_chunk(tags)
This prints the chunks that are identified in the given input string. Chunks are grouped in
brackets, '(' and ')', to easily distinguish them from other words that are in the input text.
print(chunks)
We already understand that by using NLTK, we can identify the POS in their short form
(tags such as V, NN, NNP, and so on). Can we write regular expressions using these POS?
The answer is yes. You have guessed it correctly. We can leverage POS-based regular
expression writing. Since we are using POS tags to write these regular expressions, they are
called tag patterns.
Just like the way we write the native alphabets (a-z) of a given natural language to match
various patterns, we can also leverage POS to match words (any combinations from
dictionary) according to the NLTK matched POS.
[ 124 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
These tag patterns are one of the most powerful features of NLTK because they give us the
flexibility to match the words in a sentence just by POS-based regular expressions.
Let's pay close attention to the preceding POS output. We can make the following
observations:
By using these three simple observations, let's write a regular expression using POS, which
is called as tag phrase in the BNF form:
NP -> <PRP>
NP -> <DT>*<NNP>
NP -> <JJ>*<NN>
NP -> <NNP>+
We are interested in extracting the following chunks from the input text:
Ravi
the CEO
a company
powerful public speaker
Let's write a simple Python program that gets the job done.
[ 125 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
Getting ready
You should have Python installed, along with the nltk library. A fair understanding of
regular expressions is good to have.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called SimpleChunker.py.
3. Type the following source code:
[ 126 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
How it works...
Now, let's understand how the program works:
This instruction imports the nltk library into the current program:
import nltk
We are declaring the text variable with the sentences that we want to process:
text = "Ravi is the CEO of a Company. He is very powerful public speaker
also."
In this instruction, we are writing regular expressions, which are written using POS; so they
are specially called tag patterns. These tag patterns are not a randomly created ones. They
are carefully crafted from the preceding example.
grammar = '\n'.join([
'NP: {<DT>*<NNP>}',
'NP: {<JJ>*<NN>}',
'NP: {<NNP>+}',
])
[ 127 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
The more text we process, the more rules like this we can discover. These are specific to the
language we process. So, this is a practice we should do in order to become more powerful
at information extraction:
sentences = nltk.sent_tokenize(text)
First we break the input text into sentences by using the nltk.sent_tokenize() function:
for sentence in sentences:
This instruction iterates through a list of all sentences and assigns one sentence to
the sentence variable:
words = nltk.word_tokenize(sentence)
This instruction breaks the sentence into tokens using the nltk.word_tokenize()
function and puts the result into the words variable:
tags = nltk.pos_tag(words)
This instruction does the POS identification on the words variable (which has a list of
words) and puts the result in the tags variable (which has each word correctly tagged with
its respective POS tag):
chunkparser = nltk.RegexpParser(grammar)
This instruction invokes the nltk.RegexpParser on the grammar that we have created
before. The object is available in the chunkparser variable:
result = chunkparser.parse(tags)
We parse the tags using the object and the result is stored in the result variable:
print(result)
Now, we display the identified chunks on screen using the print() function. The output is
a tree structure with words and their associated POS.
[ 128 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
Training a chunker
In this recipe, will learn the training process, training our own chunker, and evaluating it.
Before we go into training, we need to understand the type of data we are dealing with.
Once we have a fair understanding of the data, we must train it according to the pieces of
information we need to extract. One particular way of training the data is to use IOB
tagging for the chunks that we extract from the given text.
Naturally, we find different words in a sentence. From these words, we can find POS. Later,
when chunking the text, we need to further tag the words according to where they are
present in the text.
Once we've done POS tagging and hunking of the data, we will see an output similar to this
one:
Bill NNP B-PERSON
Gates NNP I-PERSON
announces NNS O
Satya NNP B-PERSON
Nadella NNP I-PERSON
as IN O
new JJ O
CEO NNP B-ROLE
of IN O
Microsoft NNP B-COMPANY
This is called the IOB format, where each line consists of three tokens separated by spaces.
Column Description
First column in IOB The actual word in the input sentence
Second column in IOB The POS for the word
Third column in IOB Chunk identifier with I (inside chunk), O (outside chunk), B (beginning word of the chunk), and the
appropriate suffix to indicate the category of the word
[ 129 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
Once we have the training data in IOB format, we can further use it to extend the reach of
our chunker by applying it to other datasets. Training is very expensive if we want to do it
from scratch or want to identify new types of keywords from the text.
Let's try to write a simple chunker using the regexparser and see what types of results it
gives.
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called TrainingChunker.py.
3. Type the following source code:
[ 130 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
How it works...
This instruction imports the nltk module into the current program:
import nltk
This instruction imports the conll2000 corpus into the current program:
from nltk.corpus import conll2000
This instruction imports the treebank corpus into the current program:
from nltk.corpus import treebank_chunk
[ 131 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
We are defining a new function, mySimpleChunker(). We are also defining a simple tag
pattern that extracts all the words that have POS of NNP (proper nouns). This grammar is
used for our chunker to extract the named entities:
def mySimpleChunker():
grammar = 'NP: {<NNP>+}'
return nltk.RegexpParser(grammar)
This is a simple chunker; it doesn't extract anything from the given text. Useful to see if the
algorithm works correctly:
def test_nothing(data):
cp = nltk.RegexpParser("")
print(cp.evaluate(data))
This function uses mySimpleChunker() on the test data and evaluates the accuracy of the
data with respect to already tagged input data:
def test_mysimplechunker(data):
schunker = mySimpleChunker()
print(schunker.evaluate(data))
We create a list of two datasets, one from conll2000 and another from treebank:
datasets = [
conll2000.chunked_sents('test.txt', chunk_types=['NP']),
treebank_chunk.chunked_sents()
]
[ 132 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
In this recipe, we will explore how we can use the RD parser that comes with the
NLTK library.
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called ParsingRD.py.
3. Type the following source code:
[ 133 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
This graph is the output of the second sentence in the input as parsed by the RD parser:
[ 134 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
How it works...
Let's see how the program works. In this instruction, we are importing the nltk library:
import nltk
In these instructions, we are iterating over the list of sentences in the textlist variable.
Each text item is tokenized using the nltk.word_tokenize() function and then the
resultant words are passed to the parser.parse() function. Once the parse is complete,
we display the result on the screen and also show the parse tree:
for text in textlist:
sentence = nltk.word_tokenize(text)
for tree in parser.parse(sentence):
print(tree)
tree.draw()
These are the two sample sentences we use to understand the parser:
text = [
"Tajmahal is in Agra",
"Bangalore is the capital of Karnataka",
]
[ 135 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
We call RDParserExample using the grammar object and the list of sample sentences.
RDParserExample(grammar, text)
Parsing shift-reduce
In this recipe, we will learn to use and understand shift-reduce parsing.
Shift-reduce parsers are special types of parsers that parse the input text from left to right
on a single line sentences and top to bottom on multiline sentences.
For every alphabet/token in the input text, this is how parsing happens:
Read the first token from the input text and push it to the stack (shift operation)
Read the complete parse tree on the stack and see which production rule can be
applied, by reading the production rule from right to left (reduce operation)
This process is repeated until we run out of production rules, when we accept
that parsing has failed
This process is repeated until all of the input is consumed; we say parsing has
succeeded
In the following examples, we see that only one input text is going to be parsed successfully
and the other cannot be parsed.
Getting ready
You should have Python installed, along with the nltk library. An understanding of
writing grammars is needed.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called ParsingSR.py.
[ 136 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
[ 137 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
How it works...
Let's see how the program works. In this instruction we are importing the nltk library:
import nltk
[ 138 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
In these instructions, we are iterating over the list of sentences in the textlist variable.
Each text item is tokenized using the nltk.word_tokenize() function and then the
resultant words are passed to the parser.parse() function. Once the parse is complete,
we display the result on the screen and also show the parse tree:
for text in textlist:
sentence = nltk.word_tokenize(text)
for tree in parser.parse(sentence):
print(tree)
tree.draw()
These are the two sample sentences we are using to understand the shift-reduce parser:
text = [
"Tajmahal is in Agra",
"Bangalore is the capital of Karnataka",
]
We call the SRParserExample using the grammar object and the list of sample sentences.
SRParserExample(grammar, text)
Dependency grammars are based on the concept that sometimes there are direct
relationships between words that form a sentence. The example in this recipe shows this
clearly.
[ 139 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called ParsingDG.py.
3. Type the following source code:
[ 140 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
How it works...
Let's see how the program works. This instruction imports the nltk library into the
program:
import nltk
[ 141 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
This is the sample sentence on which we are going to run the parser. It is stored in a
variable called sentence:
sentence = 'small savings yield large gains'
Parsing a chart
Chart parsers are special types of parsers which are suitable for natural languages as they
have ambiguous grammars. They use dynamic programming to generate the desired
results.
The good thing about dynamic programming is that, it breaks the given problem into
subproblems and stores the result in a shared location, which can be further used by
algorithm wherever similar subproblem occurs elsewhere. This greatly reduces the need to
re-compute the same thing over and over again.
[ 142 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
In this recipe, we will learn the chart parsing features that are provided by the NLTK
library.
Getting ready
You should have Python installed, along with the nltk library. An understanding of
grammars is good to have.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called ParsingChart.py.
3. Type the following source code:
[ 143 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
How it works...
Let's see how the program works. This instruction imports the CFG module into the
program:
from nltk.grammar import CFG
This instruction imports the ChartParser and BU_LC_STRATEGY features into the
program:
from nltk.parse.chart import ChartParser, BU_LC_STRATEGY
[ 144 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
We are creating a sample grammar for the example that we are going to use. All the
producers are expressed in the BNF form:
grammar = CFG.fromstring("""
S -> T1 T4
T1 -> NNP VBZ
T2 -> DT NN
T3 -> IN NNP
T4 -> T3 | T2 T3
NNP -> 'Tajmahal' | 'Agra' | 'Bangalore' | 'Karnataka'
VBZ -> 'is'
IN -> 'in' | 'of'
DT -> 'the'
NN -> 'capital'
""")
A new chart parser object is created using the grammar object BU_LC_STRATEGY, and we
have set trace to True so that we can see how the parsing happens on the screen:
cp = ChartParser(grammar, BU_LC_STRATEGY, trace=True)
We are going to process this sample string in this program; it is stored in a variable
called sentence:
sentence = "Bangalore is the capital of Karnataka"
This instruction takes the list of words as input and then starts the parsing. The result of the
parsing is made available in the chart object:
chart = cp.chart_parse(tokens)
[ 145 ]
Chunking, Sentence Parse, and Dependencies Chapter 6
We are acquiring all the parse trees that are available in the chart into the parses variable:
parses = list(chart.parses(grammar.start()))
This instruction prints the total number of edges in the current chart object:
print("Total Edges :", len(chart.edges()))
This instruction shows a nice tree view of the chart on a GUI widget.
tree.draw()
[ 146 ]
Information Extraction and Text
7
Classification
In this chapter, we will cover the following recipes:
Introduction
Information retrieval is a vast area and has many challenges. In previous chapters, we
understood regular expressions, grammars, Parts-of-Speech (POS) tagging, and chunking.
The natural step after this process is to identify the Interested Entities in a given piece of
text. To be clear, when we are processing large amounts of data, we are really interested in
finding out whether any famous personalities, places, products, and so on are mentioned.
These things are called named entities in NLP. We will understand more about these with
examples in the following recipes. Also, we will see how we can leverage the clues that are
present in the input text to categorize large amounts of text, and many more examples will
be explained. Stay tuned!
Information Extraction and Text Classification Chapter 7
Named entities help us understand more about what is being referred to in a given text so
that we can further classify the data. Since named entities comprise more than one word, it
is sometimes difficult to find these from the text.
Let's take up the following examples to understand what named entities are:
1. Even though South Bank refers to a direction, it does not qualify as a named entity
because we cannot uniquely identify the object from that.
2. Even though Fashion is a noun, we cannot completely qualify it as named entity.
3. Skyscraper is a noun, but there can be many possibilities for Skyscrapers.
4. CEO is a role here; there are many possible persons who can hold this title. So,
this also cannot be a named entity.
To further understand, let's just look at these NEs from a categories perspective:
[ 148 ]
Information Extraction and Text Classification Chapter 7
TIME 10:10 AM
PERSON Satya Nadella, Jeff Weiner, Bill Gates
The next logical step would be to further extend the algorithms to find out the named
entities as a sixth step. So, we will basically be using data that is preprocessed until step 5 as
part of this example.
We will be using treebank data to understand the NER process. Remember, the data is
already pre-tagged in IOB format. Without the training process, none of the algorithms that
we are seeing here are going to work. (So, there is no magic!)
In order to understand the importance of the training process, let's take up an example. Say,
there is a need for the Archaeological department to figure out which of the famous places
in India are being tweeted and mentioned in social networking websites in the Kannada
Language.
Assuming that they have already got the data somewhere and it is in terabytes or even in
petabytes, how do they find out all these names? This is where we need to take a sample
dataset from the original input and do the training process to further use this trained data
set to extract the named entities in Kannada.
[ 149 ]
Information Extraction and Text Classification Chapter 7
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open Atom editor (or you favorite programming editor).
2. Create a new file called NER.py.
3. Type the following source code:
[ 150 ]
Information Extraction and Text Classification Chapter 7
[ 151 ]
Information Extraction and Text Classification Chapter 7
How it works...
The code looks so simple, right? However, all the algorithms are implemented in the nltk
library. So, let's dig into how this simple program gives what we are looking for. This
instruction imports the nltk library into the program:
import nltk
These three instructions define a new function called sampleNE(). We are importing the
first tagged sentence from the treebank corpus and then passing it to
the nltk.ne_chunk() function to extract the named entities. The output from this program
includes all the named entities with their proper category:
def sampleNE():
sent = nltk.corpus.treebank.tagged_sents()[0]
print(nltk.ne_chunk(sent))
These three instructions define a new function called sampleNE2(). We are importing the
first tagged sentence from the treebank corpus and then passing it to
the nltk.ne_chunk() function to extract the named entities. The output from this program
includes all the named entities without any proper category. This is helpful if the training
dataset is not accurate enough to tag the named entities with the proper category such as
person, organization, location, and so on:
def sampleNE2():
sent = nltk.corpus.treebank.tagged_sents()[0]
print(nltk.ne_chunk(sent, binary=True))
These three instructions will call the two sample functions that we have defined before and
print the results on the screen.
if __name__ == '__main__':
sampleNE()
sampleNE2()
[ 152 ]
Information Extraction and Text Classification Chapter 7
We can use POS identification on the preceding sentence. But if someone were to ask what
POS flights is in this sentence, we should have an efficient way to look for this word. This
is where dictionaries come into play. They can be thought of as one-to-one mappings
between data of interest. Again this one-to-one is at the highest level of abstraction of the
data unit that we are talking about. If you are an expert programmer in Python, you know
how to do many-to-many also. In this simple example, we need something like this:
flights -> Noun
Weather -> Noun
Now let's answer a different question. Is it possible to print the list of all the words in the
sentence that are nouns? Yes, for this too, we will learn how to use a Python dictionary.
Getting ready
You should have Python installed, along with the nltk library, in order to run this
example.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called Dictionary.py.
[ 153 ]
Information Extraction and Text Classification Chapter 7
[ 154 ]
Information Extraction and Text Classification Chapter 7
How it works...
Now, let's understand more about dictionaries by going through the instructions we have
written so far. We are importing the nltk library into the program:
import nltk
[ 155 ]
Information Extraction and Text Classification Chapter 7
This instruction identifies the POS for words and saves the result in the class member
tagged:
self.tagged = nltk.pos_tag(self.words)
This instruction invokes the buildDictionary() function that is defined in the class:
self.buildDictionary()
This instruction initializes a empty dictionary variable in the class. These two instructions
iterate over all the tagged pos list elements and then assign each word to
the dictionary as key and the POS as value of the key:
self.dictionary = {}
for (word, pos) in self.tagged:
self.dictionary[word] = pos
[ 156 ]
Information Extraction and Text Classification Chapter 7
This instruction iterates over all the dictionary keys and puts the key of dictionary into
a local variable called key:
for key in self.dictionary.keys():
This instruction extracts the value (POS) of the given key (word) and stores it in a local
variable called value:
value = self.dictionary[key]
These four instructions check whether a given key (word) is already in the reverse
dictionary variable (rdictionary). If it is, then we append the currently found word to the
list. If the word is not found, then we create a new list of size one with the current word as
the member:
if value not in self.rdictionary:
self.rdictionary[value] = [key]
else:
self.rdictionary[value].append(key)
This function returns the POS for the given word by looking into dictionary. If the value
is not found, a special value of None is returned:
def getPOSForWord(self, word):
return self.dictionary[word] if word in self.dictionary else None
[ 157 ]
Information Extraction and Text Classification Chapter 7
These two instructions define a function that returns all the words in the sentence with a
given POS by looking into rdictionary (reverse dictionary). If the POS is not found, a
special value of None is returned:
def getWordsForPOS(self, pos):
return self.rdictionary[pos] if pos in self.rdictionary else None
We define a variable called sentence, which stores the string that we are interested in
parsing:
sentence = "All the flights got delayed due to bad weather"
Initialize the LearningDictionary() class with sentence as a parameter. Once the class
object is created, it is assigned to the learning variable:
learning = LearningDictionary(sentence)
We create a list of words that we are interested in knowing the POS of. If you see carefully,
we have included a few words that are not in the sentence:
words = ["chair", "flights", "delayed", "pencil", "weather"]
We create a list of pos for which we are interested in seeing the words that belong to these
POS classifications:
pos = ["NN", "VBS", "NNS"]
These instructions iterate over all the words, take one word at a time, check whether
the word is in the dictionary by calling the isWordPresent() function of the object, and
then print its status. If the word is present in the dictionary, then we print the POS for the
word:
for word in words:
status = learning.isWordPresent(word)
print("Is '{}' present in dictionary ? : '{}'".format(word, status))
if status is True:
print("\tPOS For '{}' is '{}'".format(word,
learning.getPOSForWord(word)))
[ 158 ]
Information Extraction and Text Classification Chapter 7
In these instructions, we iterate over all the pos. We take one word at a time and then print
the words that are in this POS using the getWordsForPOS() function.
for pword in pos:
print("POS '{}' has '{}' words".format(pword,
learning.getWordsForPOS(pword)))
Let's try to learn how the vehicle numbers give some clues about what they mean:
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called Features.py.
[ 159 ]
Information Extraction and Text Classification Chapter 7
[ 160 ]
Information Extraction and Text Classification Chapter 7
How it works...
Now, let's see what our program does. These two instructions import
the nltk and random libraries into the current program:
import nltk
import random
We are defining a list of Python tuples, where the first element in the tuple is the vehicle
number and the second element is the predefined label that is applied to the number.
These instructions define that all the numbers are classified into three labels—rtc, gov,
and oth:
sampledata = [
('KA-01-F 1034 A', 'rtc'),
('KA-02-F 1030 B', 'rtc'),
('KA-03-FA 1200 C', 'rtc'),
('KA-01-G 0001 A', 'gov'),
[ 161 ]
Information Extraction and Text Classification Chapter 7
This instruction shuffles all of the data in the sampledata list to make sure that the
algorithm is not biased by the order of elements in the input sequence:
random.shuffle(sampledata)
These are the test vehicle numbers for which we are interested in finding the category:
testdata = [
'KA-01-G 0109',
'KA-02-F 9020 AC',
'KA-02-FA 0801',
'KA-01 9129'
]
This instruction creates a list of feature tuples, where the first member in the tuple is feature
dictionary and the second member in tuple is the label of the data. After this instruction, the
input vehicle numbers in sampledata are no longer visible. This is one of the key things to
remember:
featuresets = [(vehicleNumberFeature(vn), cls) for (vn, cls) in sampledata]
This instruction trains NaiveBayesClassifier with the feature dictionary and the labels
that are applied to featuresets. The result is available in the classifier object, which we
will use further:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
[ 162 ]
Information Extraction and Text Classification Chapter 7
These instructions iterate over the test data and then print the label of the data from the
classification done using vehicleNumberFeature. Observe the output carefully. You will
see that the feature extraction function that we have written does not perform well in
labeling the numbers correctly:
for num in testdata:
feature = vehicleNumberFeature(num)
print("(simple) %s is of type %s" %(num, classifier.classify(feature)))
These instructions define a new function called vehicleNumberFeature that returns the
feature dictionary with two keys. One key, vehicle_class, returns the character at
position 6 in the string, and vehicle_prev has the character at position 5. These kinds of
clues are very important to make sure we eliminate bad labeling of data:
def vehicleNumberFeature(vnumber):
return {
'vehicle_class': vnumber[6],
'vehicle_prev': vnumber[5]
}
This instruction creates a list of featuresets and input labels by iterating over of all the
input trained data. As before, the original input vehicle numbers are no longer present here:
featuresets = [(vehicleNumberFeature(vn), cls) for (vn, cls) in sampledata]
These instructions loop through the testdata and print the classification of the input
vehicle number based on the trained dataset. Here, if you observe carefully, the false-
positive is not there:
for num in testdata:
feature = vehicleNumberFeature(num)
print("(dual) %s is of type %s" %(num, classifier.classify(feature)))
[ 163 ]
Information Extraction and Text Classification Chapter 7
Invoke both the functions and print the results on the screen.
learnSimpleFeatures()
learnFeatures()
If we observe carefully, we realize that the first function's results have one false positive,
where it cannot identify the gov vehicle. This is where the second function performs well,
as it has more features that improve accuracy.
In order to solve this problem, let's find out the features (or clues) that we
can leverage to come up with a classifier and then use the classifier to
extract sentences in large text.
If we encounter a punctuation mark like . then it ends a sentence If we
encounter a punctuation mark like . and the next word's first letter is a
capital letter, then it ends a sentence.
Let's try to write a simple classifier using these two features to mark
sentences.
Getting ready
You should have Python installed along with nltk library.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new File called Segmentation.py.
[ 164 ]
Information Extraction and Text Classification Chapter 7
[ 165 ]
Information Extraction and Text Classification Chapter 7
How it works...
This instruction imports the nltk library into the program:
import nltk
This function defines a modified feature extractor that returns a tuple containing the
dictionary of the features and True or False to tell whether this feature indicates a
sentence boundary or not:
def featureExtractor(words, i):
return ({'current-word': words[i], 'next-is-upper':
words[i+1][0].isupper()}, words[i+1][0].isupper())
[ 166 ]
Information Extraction and Text Classification Chapter 7
This function takes a sentence as input and returns a list of featuresets that is a list of
tuples, with the feature dictionary and True or False:
def getFeaturesets(sentence):
words = nltk.word_tokenize(sentence)
featuresets = [featureExtractor(words, i) for i in range(1, len(words) -
1) if words[i] == '.']
return featuresets
This function takes the input text, breaks it into words, and then traverses through each
word in the list. Once it encounters a full stop, it calls classifier to conclude whether it
has encountered a sentence end. If the classifier returns True, then the sentence is
found and we move on to the next word in the input. The process is repeated for all words
in the input:
def segmentTextAndPrintSentences(data):
words = nltk.word_tokenize(data)
for i in range(0, len(words) - 1):
if words[i] == '.':
if classifier.classify(featureExtractor(words, i)[0]) == True:
print(".")
else:
print(words[i], end='')
else:
print("{} ".format(words[i]), end='')
print(words[-1])
These instructions define a few variables for training and evaluation of our classifier:
# copied the text from https://en.wikipedia.org/wiki/India
traindata = "India, officially the Republic of India (Bhārat Gaṇarājya),[e]
is a country in South Asia. it is the seventh-largest country by area, the
second-most populous country (with over 1.2 billion people), and the most
populous democracy in the world. It is bounded by the Indian Ocean on the
south, the Arabian Sea on the southwest, and the Bay of Bengal on the
southeast. It shares land borders with Pakistan to the west;[f] China,
Nepal, and Bhutan to the northeast; and Myanmar (Burma) and Bangladesh to
the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and
the Maldives. India's Andaman and Nicobar Islands share a maritime border
with Thailand and Indonesia."
testdata = "The Indian subcontinent was home to the urban Indus Valley
Civilisation of the 3rd millennium BCE. In the following millennium, the
oldest scriptures associated with Hinduism began to be composed. Social
stratification, based on caste, emerged in the first millennium BCE, and
Buddhism and Jainism arose. Early political consolidations took place under
the Maurya and Gupta empires; the later peninsular Middle Kingdoms
[ 167 ]
Information Extraction and Text Classification Chapter 7
Extract all the features from the traindata variable and store it in traindataset:
traindataset = getFeaturesets(traindata)
Invoke the function on testdata and print all the found sentences as output on the screen:
segmentTextAndPrintSentences(testdata)
Classifying documents
In this recipe, we will learn how to write a classifier that can be used to classify documents.
In our case, we will classify rich site summary (RSS) feeds. The list of categories is known
ahead of time, which is important for the classification task.
In this information age, there are vast amounts of text available. Its humanly impossible for
us to properly categorize all information for further consumption. This is where
categorization algorithms help us to properly categorize the newer sets of documents that
are being produced based on the training given on sample data.
Getting ready
You should have Python installed, along with the nltk and feedparser libraries.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called DocumentClassify.py.
[ 168 ]
Information Extraction and Text Classification Chapter 7
[ 169 ]
Information Extraction and Text Classification Chapter 7
How it works...
Let's see how this document classification works. Importing three libraries into the
program:
import nltk
import random
import feedparser
This instruction defines a new dictionary with two RSS feeds pointing to Yahoo! sports.
They are pre-categorized. The reason we have selected these RSS feeds is that data is readily
available and categorized for our example:
urls = {
'mlb': 'https://sports.yahoo.com/mlb/rss.xml',
'nfl': 'https://sports.yahoo.com/nfl/rss.xml',
}
[ 170 ]
Information Extraction and Text Classification Chapter 7
Initializing the empty dictionary variable feedmap to keep the list of RSS feeds in memory
until the program terminates:
feedmap = {}
Getting the list of stopwords in English and storing it in the stopwords variable:
stopwords = nltk.corpus.stopwords.words('english')
This function, featureExtractor(), takes list of words and then adds them to the
dictionary, where each key is the word and the value is True. The dictionary is returned,
which are the features for the given input words:
def featureExtractor(words):
features = {}
for word in words:
if word not in stopwords:
features["word({})".format(word)] = True
return features
Iterate over all the keys() of the dictionary called urls and store the key in a variable
called category:
for category in urls.keys():
Download one feed and store the result in the feedmap[category] variable using
the parse() function from the feedparser module:
feedmap[category] = feedparser.parse(urls[category])
Display the url that is being downloaded on the screen, using Python's built-in the print
function:
print("downloading {}".format(urls[category]))
[ 171 ]
Information Extraction and Text Classification Chapter 7
Iterate over all the RSS entries and store the current entry in a variable
called entry variable:
for entry in feedmap[category]['entries']:
Take the summary (news text) of the RSS feed item into the data variable:
data = entry['summary']
We brea summary into words based on space so that we can pass these to nltk for feature
extraction:
words = data.split()
Store all words in the current RSS feed item, along with category it belongs to, in a tuple:
sentences.append((category, words))
Extract all the features of sentences and store them in the variable featuresets. Later,
do shuffle() on this array so that all the elements in the list are randomized for the
algorithm:
featuresets = [(featureExtractor(words), category) for category, words in
sentences]
random.shuffle(featuresets)
Create two datasets, one trainset and the other testset, for our analysis:
total = len(featuresets)
off = int(total/2)
trainset = featuresets[off:]
testset = featuresets[:off]
[ 172 ]
Information Extraction and Text Classification Chapter 7
Print the informative features about this data using the built-in function of classifier:
classifier.show_most_informative_features(5)
Take four sample entries from the nfl RSS item. Try to tag the document based
on title (remember, we have classified them based on summary):
for (i, entry) in enumerate(feedmap['nfl']['entries']):
if i < 4:
features = featureExtractor(entry['title'].split())
category = classifier.classify(features)
print('{} -> {}'.format(category, entry['title']))
Let's try to write a program that leverages the feature extraction concept to find the POS of
the words in the sentence.
Getting ready
You should have Python installed, along with nltk.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called ContextTagger.py.
[ 173 ]
Information Extraction and Text Classification Chapter 7
[ 174 ]
Information Extraction and Text Classification Chapter 7
How it works...
Let's see how the current program works. This instruction imports the nltk libarary into
the program:
import nltk
Some sample strings that indicate the dual behavior of the words, address, laugh:
sentences = [
"What is your address when you're in Bangalore?",
"the president's address on the state of the economy.",
"He addressed his remarks to the lawyers in the audience.",
"In order to address an assembly, we should be ready",
"He laughed inwardly at the scene.",
"After all the advance publicity, the prizefight turned out to be a
laugh.",
"We can learn to laugh a little at even our most serious foibles."
]
[ 175 ]
Information Extraction and Text Classification Chapter 7
This function takes sentence strings and returns a list of lists, where the inner lists contain
the words along with their POS tags:
def getSentenceWords():
sentwords = []
for sentence in sentences:
words = nltk.pos_tag(nltk.word_tokenize(sentence))
sentwords.append(words)
return sentwords
In order to set up a baseline and see how bad the tagging can be, this function explains
how UnigramTagger can be used to print the POS of the words just by looking at the
current word. We are feeding the sample text to it as learning. This tagger performs very
badly when compared to the built-in tagger that nltk comes with. But this is just for our
understanding:
def noContextTagger():
tagger = nltk.UnigramTagger(getSentenceWords())
print(tagger.tag('the little remarks towards assembly are
laughable'.split()))
This function does feature extraction on a given set of words and returns a dictionary of the
last three characters of the current word and previous word information:
def wordFeatures(words, wordPosInSentence):
# extract all the ing forms etc
endFeatures = {
'last(1)': words[wordPosInSentence][-1],
'last(2)': words[wordPosInSentence][-2:],
'last(3)': words[wordPosInSentence][-3:],
}
# use previous word to determine if the current word is verb or noun
if wordPosInSentence > 1:
endFeatures['prev'] = words[wordPosInSentence - 1]
else:
endFeatures['prev'] = '|NONE|'
return endFeatures
[ 176 ]
Information Extraction and Text Classification Chapter 7
We are building a featuredata list. It contains tuples of featurelist and tag members,
which we will use to classify using NaiveBayesClassifier:
allsentences = getSentenceWords()
featureddata = []
for sentence in allsentences:
untaggedSentence = nltk.tag.untag(sentence)
featuredsentence = [(wordFeatures(untaggedSentence, index), tag) for
index, (word, tag) in enumerate(sentence)]
featureddata.extend(featuredsentence)
We take 50% for training and 50% of the feature extracted data to test our classifier:
breakup = int(len(featureddata) * 0.5)
traindata = featureddata[breakup:]
testdata = featureddata[:breakup]
These two functions print the results of two preceding functions' computations.
noContextTagger()
withContextTagger()
[ 177 ]
Advanced NLP Recipes
8
In this chapter, we will go through the following recipes:
Introduction
So far, we have seen how to process input text, identify parts of speech, and extract
important information (named entities). We've learned a few computer science concepts
also, such as grammars, parsers, and so on. In this chapter, we will dig deeper into
advanced topics in natural language processing (NLP), which need several techniques to
properly understand and solve them.
Advanced NLP Recipes Chapter 8
The input format of the data that is flowing through each of the components
The output format of the data that is coming out of each of the components
Making sure that data flow is controlled between components by adjusting the
velocity of data inflow and outflow
For example, if you are familiar with Unix/Linux systems and have some exposure to
working on a shell, you'd have seen the | operator, which is the shell's abstraction of a data
pipe. We can leverage the | operator to build pipelines in the Unix shell.
Let's take an example in Unix (for a quick understanding): how do I find the number of files
in a given directory ?
We need a component (or a command in the Unix context) that reads the
directory and lists all the files in it
We need another component (or a command in the Unix context) that reads the
lines and prints the count of lines
The ls command
The wc command
[ 179 ]
Advanced NLP Recipes Chapter 8
If we can build a pipeline where we take the output from ls and feed it to wc, we are done.
With this knowledge, let's get back to the NLP pipeline requirements:
In this recipe, let's try to build the simplest possible pipeline; it acquires data from a remote
RSS feed and then prints the identified named entities in each document.
Getting ready
You should have Python installed, along with the nltk, queue, feedparser, and uuid
libraries.
How to do it...
1. Open Atom editor (or your favorite programming editor).
2. Create a new file called PipelineQ.py.
[ 180 ]
Advanced NLP Recipes Chapter 8
[ 181 ]
Advanced NLP Recipes Chapter 8
How it works...
Let's see how to build this pipeline:
import nltk
import threading
import queue
import feedparser
import uuid
These five instructions import five Python libraries into the current program:
threads = []
[ 182 ]
Advanced NLP Recipes Chapter 8
Creating a new empty list to keep track of all the threads in the program:
queues = [queue.Queue(), queue.Queue()]
This instruction defines a new function, extractWords(), which reads a sample RSS feed
from the internet and stores the words, along with a unique identifier for this text:
def extractWords():
We are defining a sample URL (entertainment news) from the India Times website:
url = 'https://timesofindia.indiatimes.com/rssfeeds/1081479906.cms'
We are taking the first five entries from the RSS feed and storing the current item in a
variable called entry:
for entry in feed['entries'][:5]:
The title of the current RSS feed item is stored in a variable called text:
text = entry['title']
This instruction skips the titles that contain sensitive words. Since we are reading the data
from the internet, we have to make sure that the data is properly sanitized:
if 'ex' in text:
continue
Break the input text into words using the word_tokenize() function and store the result
into a variable called words:
words = nltk.word_tokenize(text)
[ 183 ]
Advanced NLP Recipes Chapter 8
Create a dictionary called data with two key-value pairs, where we are storing the UUID
and input words under the UUID and input keys respectively:
data = {'uuid': uuid.uuid4(), 'input': words}
This instruction stores the dictionary in the first queue, queues[0]. The second argument is
set to true, which indicates that if the queue is full, pause the thread:
queues[0].put(data, True)
A well-designed pipeline understands that it should control the inflow and outflow of the
data according to the component's computation capacity. If not, the entire pipeline
collapses. This instruction prints the current RSS item that we are processing along with its
unique ID:
print(">> {} : {}".format(data['uuid'], text))
This instruction defines a new function called extractPOS(), which reads from the first
queue, processes the data, and saves the POS of the words in the second queue:
def extractPOS():
These instructions check whether the first queue is empty. When the queue is empty, we
stop processing:
if queues[0].empty():
break
In order to make this program robust, pass the feedback from the first queue. This is left as
an exercise to the reader. This is the else part, which indicates there is some data in the first
queue:
else:
Take the first item from the queue (in FIFO order):
data = queues[0].get()
[ 184 ]
Advanced NLP Recipes Chapter 8
Update the first queue, mentioning that we are done with processing the item that is just
extracted by this thread:
queues[0].task_done()
Store the POS tagged word list in the second queue so that the next phase in the pipeline
will execute things. Here also, we are using true for the second parameter, which will make
sure that the thread will wait if there is no free space in the queue:
queues[1].put({'uuid': data['uuid'], 'input': postags}, True)
This instruction defines a new function, extractNE(), which reads from the second queue,
processes the POS tagged words, and prints the named entities on screen:
def extractNE():
This instruction picks an element from the second queue and stores it in a data variable:
else:
data = queues[1].get()
This instruction marks the completion of data processing on the element that was just
picked from the second queue:
postags = data['input']
queues[1].task_done()
This instruction extracts the named entities from the postags variable and stores it in a
variable called chunks:
chunks = nltk.ne_chunk(postags, binary=False)
[ 185 ]
Advanced NLP Recipes Chapter 8
print()
This instruction defines a new function, runProgram, which does the pipeline setup using
threads:
def runProgram():
These three instructions create a new thread with extractWords() as the function, start
the thread and add the thread object (e) to the list called threads:
e = threading.Thread(target=extractWords())
e.start()
threads.append(e)
These instructions create a new thread with extractPOS() as the function, start the thread,
and add the thread object (p) to the list variable threads:
p = threading.Thread(target=extractPOS())
p.start()
threads.append(p)
These instructions create a new thread using extractNE() for the code, start the thread,
and add the thread object (n) to the list threads:
n = threading.Thread(target=extractNE())
n.start()
threads.append(n)
These two instructions release the resources that are allocated to the queues once all the
processing is done:
queues[0].join()
queues[1].join()
[ 186 ]
Advanced NLP Recipes Chapter 8
These two instructions iterate over the threads list, store the current thread object in a
variable, t, call the join() function to mark the completion of the thread, and release
resources allocated to the thread:
for t in threads:
t.join()
This is the section of the code that is invoked when the program is run with the main
thread. The runProgram() is called, which simulates the entire pipeline:
if __name__ == '__main__':
runProgram()
Sentiment/emotion dimension
Sense dimension
Mere presence of certain words
There are many algorithms available for this; all of them vary in the degree of complexity,
the resources needed, and the volume of data we are dealing with.
In this recipe, we will use the TF-IDF algorithm to solve the similarity problem. So first, let's
understand the basics:
Term frequency (TF): This technique tries to find the relative importance (or
frequency) of the word in a given document
[ 187 ]
Advanced NLP Recipes Chapter 8
Inverse document frequency (IDF) : This technique makes sure that words that
are frequently used (a, the, and so on) should be given lower weight when
compared to the words that are rarely used.
Since both TF and IDF values are decomposed to numbers (fractions), we will do a
multiplication of these two values for each term against every document and build M
vectors of N dimensions (where N is the total number of documents and M are the unique
words in all the documents).
Once we have these vectors, we need to find the cosine similarity using the following
formula on these vectors:
Getting ready
You should have Python installed, along with the nltk and scikit libraries. Having some
understanding of mathematics is helpful.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called Similarity.py.
[ 188 ]
Advanced NLP Recipes Chapter 8
[ 189 ]
Advanced NLP Recipes Chapter 8
How it works...
Let's see how we are solving the text similarity problem. These four instructions import the
necessary libraries that are used in the program:
import nltk
import math
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
[ 190 ]
Advanced NLP Recipes Chapter 8
This instruction defines sample sentences on which we want to find the similarity.
self.statements = [
'ruled india',
'Chalukyas ruled Badami',
'So many kingdoms ruled India',
'Lalbagh is a botanical garden in India'
]
Converts the sentence to lower case and extracts all the words
Finds the frequency distribution of these words using the nltk FreqDist function
Iterates over all the dictionary keys, builds the normalized floating values, and
stores them in a dictionary
Returns the dictionary that contains the normalized score for each word in the
sentence
We are defining an IDF that finds the IDF value for all the words in all the documents:
def IDF(self):
def idf(TotalNumberOfDocuments, NumberOfDocumentsWithThisWord):
return 1.0 +
math.log(TotalNumberOfDocuments/NumberOfDocumentsWithThisWord)
numDocuments = len(self.statements)
uniqueWords = {}
idfValues = {}
for sentence in self.statements:
for word in nltk.word_tokenize(sentence.lower()):
if word not in uniqueWords:
uniqueWords[word] = 1
else:
uniqueWords[word] += 1
for word in uniqueWords:
idfValues[word] = idf(numDocuments, uniqueWords[word])
return idfValues
[ 191 ]
Advanced NLP Recipes Chapter 8
We define a local function called idf(), which is the formula to find the IDF of a
given word
We iterate over all the statements and convert them to lowercase
Find how many times each word is present across all the documents
Build the IDF value for all words and return the dictionary containing these IDF
values
We are now defining a TF_IDF (TF multiplied by IDF) for all the documents against a given
search string.
def TF_IDF(self, query):
words = nltk.word_tokenize(query.lower())
idf = self.IDF()
vectors = {}
for sentence in self.statements:
tf = self.TF(sentence)
for word in words:
tfv = tf[word] if word in tf else 0.0
idfv = idf[word] if word in idf else 0.0
mul = tfv * idfv
if word not in vectors:
vectors[word] = []
vectors[word].append(mul)
return vectors
[ 192 ]
Advanced NLP Recipes Chapter 8
Now, in order to find the similarity, as we discussed initially, we need to find the Cosine
similarity on all the input vectors. We can do all the math ourselves. But this time, let's try
to use scikit to do all the computations for us.
def cosineSimilarity(self):
vec = TfidfVectorizer()
matrix = vec.fit_transform(self.statements)
for j in range(1, 5):
i = j - 1
print("\tsimilarity of document {} with others".format(i))
similarity = cosine_similarity(matrix[i:j], matrix)
print(similarity)
In the previous functions, we learned how to build TF and IDF values and finally get the TF
x IDF values for all the documents.
This is the demo() function and it runs all the other functions we have defined before:
def demo(self):
inputQuery = self.statements[0]
vectors = self.TF_IDF(inputQuery)
self.displayVectors(vectors)
self.cosineSimilarity()
[ 193 ]
Advanced NLP Recipes Chapter 8
We are creating a new object for the TextSimilarityExample() class and then invoking
the demo() function.
similarity = TextSimilarityExample()
similarity.demo()
Identifying topics
In the previous chapter, we learned how to do document classification. Beginners might
think document classification and topic identification are the same, but there is a slight
difference.
Topic identification is the process of discovering topics that are present in the input
document set. These topics can be multiple words that occur uniquely in a given text.
Let's take an example. When we read arbitrary text that contains a mention of Sachin
Tendulkar, score, win we can understand that the sentence is describing cricket. But we
may be wrong as well.
In order to find all these types of topics in a given input text, we use the Latent Dirichlet
allocation algorithm (we could use TF-IDF as well, but since we have already explored it in
a previous recipe, let's see how LDA works in identifying the topic).
Getting ready
You should have Python installed, along with the nltk, gensim, and feedparser libraries.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called IdentifyingTopic.py.
3. Type the following source code:
[ 194 ]
Advanced NLP Recipes Chapter 8
[ 195 ]
Advanced NLP Recipes Chapter 8
How it works...
Let's see how the topic identification program works. These five instructions import the
necessary libraries into the current program.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from gensim import corpora, models
import nltk
import feedparser
[ 196 ]
Advanced NLP Recipes Chapter 8
Download all the documents mentioned in the URL and store the list of dictionaries into a
variable called feed:
url = 'https://sports.yahoo.com/mlb/rss.xml'
feed = feedparser.parse(url)
Empty the list to keep track of all the documents that we are going to analyze further:
self.documents = []
Take the top five documents from the feed variable and store the current news item into a
variable called entry:
for entry in feed['entries'][:5]:
Display an informational message to the user that we have collected N documents from the
given url:
print("INFO: Fetching documents from {} completed".format(url))
[ 197 ]
Advanced NLP Recipes Chapter 8
We are interested in extracting words that are in the English alphabet. So, this tokenizer is
defined to break the text into tokens, where each token consists of letters from a to z and A-
Z. By doing so, we can be sure that punctuation and other bad data doesn't come into the
processing.
tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
Define a empty list called cleaned, which is used to store all the cleaned and tokenized
documents:
self.cleaned = []
Iterate over all the documents we have collected using the getDocuments() function:
for doc in self.documents:
Convert the document to lowercase to avoid treating the same word differently because
they are case sensitive:
lowercase_doc = doc.lower()
Break the sentence into words. The output is a list of words stored in a variable
called words:
words = tokenizer.tokenize(lowercase_doc)
Ignore all the words from the sentence if they belong to the English stop word category and
store all of them in the non_stopped_words variable:
non_stopped_words = [i for i in words if not i in en_stop]
[ 198 ]
Advanced NLP Recipes Chapter 8
Store the sentence that is tokenized and cleaned in a variable called self.cleaned (class
member).
self.cleaned.append(non_stopped_words)
Show a diagnostic message to the user that we have finished cleaning the documents:
print("INFO: Cleaning {} documents
completed".format(len(self.documents)))
This instruction defines a new function, doLDA, which runs the LDA analysis on the cleaned
documents:
def doLDA(self):
Before we directly process the cleaned documents, we create a dictionary from these
documents:
dictionary = corpora.Dictionary(self.cleaned)
The input corpus is defined as a bag of words for each cleaned sentence:
corpus = [dictionary.doc2bow(cleandoc) for cleandoc in self.cleaned]
Create a model on the corpus with the number of topics defined as 2 and set the vocabulary
size/mapping using the id2word parameter:
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=2, id2word =
dictionary)
Print two topics, where each topic should contain four words on the screen:
print(ldamodel.print_topics(num_topics=2, num_words=4))
[ 199 ]
Advanced NLP Recipes Chapter 8
When the current program is invoked as the main program, create a new object
called topicExample from the IdentifyingTopicExample() class and invoke
the run() function on the object.
if __name__ == '__main__':
topicExample = IdentifyingTopicExample()
topicExample.run()
Summarizing text
In this information overload era, there is so much information that is available in print/text
form. Its humanly impossible for us to consume all this data. In order to make the
consumption of this data easier, we have been trying to invent algorithms that can help
simplify large text into a summary (or a gist) that we can easily digest.
By doing this, we will save time and also make things easier for the network.
In this recipe, we will use the gensim library, which has built-in support for this
summarization using the TextRank algorithm (https://web.eecs.umich.edu/~mihalcea/
papers/mihalcea.emnlp04.pdf).
Getting ready
You should have Python installed, along with the bs4 and gensim libraries.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called Summarize.py.
[ 200 ]
Advanced NLP Recipes Chapter 8
[ 201 ]
Advanced NLP Recipes Chapter 8
How it works...
Let's see how we our summarization program works.
from gensim.summarization import summarize
from bs4 import BeautifulSoup
import requests
These three instructions import the necessary libraries into the current program:
[ 202 ]
Advanced NLP Recipes Chapter 8
We are defining a dictionary called URLs whose keys are the title of the paper that is auto
generated and the value is the URL to the paper:
urls = {
'Daff: Unproven Unification of Suffix Trees and Redundancy':
'http://scigen.csail.mit.edu/scicache/610/scimakelatex.21945.none.html',
'CausticIslet: Exploration of Rasterization':
'http://scigen.csail.mit.edu/scicache/790/scimakelatex.1499.none.html'
}
Download the content of the url using the requests library's get() method and store the
response object into a variable, r:
r = requests.get(url)
Use BeautifulSoup() to parse the text from the r object using the HTML parser and store
the return object in a variable called soup:
soup = BeautifulSoup(r.text, 'html.parser')
Strip out all the HTML tags and extract only the text from the document into the
variable data:
data = soup.get_text()
Find the position of the text Introduction and skip past towards end of string, to mark is
as starting offset from which we want to extract the substring.
pos1 = data.find("1 Introduction") + len("1 Introduction")
[ 203 ]
Advanced NLP Recipes Chapter 8
Find the second position in the document, exactly at the beginning of the related
work section:
pos2 = data.find("2 Related Work")
Now, extract the introduction of the paper, which is between these two offsets:
text = data[pos1:pos2].strip()
Display the URL and the title of the paper on the screen:
print("PAPER URL: {}".format(url))
print("TITLE: {}".format(key))
Call the summarize() function on the text, which returns shortened text as per
the text rank algorithm:
print("GENERATED SUMMARY: {}".format(summarize(text)))
Resolving anaphora
In many natural languages, while forming sentences, we avoid the repeated use of certain
nouns with pronouns to simplify the sentence construction.
For example:
Ravi is a boy. He often donates money to the poor.
In this example, there are two statements:
Ravi is a boy.
He often donates money to the poor.
[ 204 ]
Advanced NLP Recipes Chapter 8
When we start analyzing the second statement, we cannot make a decision about who is
donating the money without knowing about the first statement. So, we should associate He
with Ravi to get the complete sentence meaning. All this reference resolution happens
naturally in our mind.
If we observe the previous example carefully, first the subject is present; then the pronoun
comes up. So the direction of the flow is from left to right. Based on this flow, we can call
these types of sentences anaphora.
This is another class of example where the direction of expression is the reverse order (first
the pronoun and then the noun). Here too, He is associated with Ravi. These types of
sentences are called as Cataphora.
The earliest available algorithm for this anaphora resolution dates back to the 1970; Hobbs
has presented a paper on this. An online version of this paper is available here: https://
www.isi.edu/~hobbs/pronoun-papers.html.
In this recipe, we will try to write a very simple Anaphora resolution algorithm using what
we have learned just now.
Getting ready
You should have python installed, along with the nltk library and gender datasets.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called Anaphora.py.
[ 205 ]
Advanced NLP Recipes Chapter 8
[ 206 ]
Advanced NLP Recipes Chapter 8
How it works...
Let see how our simple Anaphora resolution algorithm works.
import nltk
from nltk.chunk import tree2conlltags
from nltk.corpus import names
import random
These four instructions import the necessary modules and functions that are used in the
program. We are defining a new class called AnaphoraExample:
class AnaphoraExample:
We are defining a new constructor for this class, which doesn't take any parameters:
def __init__(self):
These two instructions load all the male and female names from the nltk.names corpus
and tag them as male/female before storing them in two lists called male/female.
males = [(name, 'male') for name in names.words('male.txt')]
females = [(name, 'female') for name in names.words('female.txt')]
[ 207 ]
Advanced NLP Recipes Chapter 8
This instruction creates a unique list of males and females. random.shuffle() ensures
that all of the data in the list is randomized:
combined = males + females
random.shuffle(combined)
This instruction invokes the feature() function on the gender and stores all the names in a
variable called training:
training = [(self.feature(name), gender) for (name, gender) in
combined]
We are creating a NaiveBayesClassifier object called _classifier using the males and
females features that are stored in a variable called training:
self._classifier = nltk.NaiveBayesClassifier.train(training)
This function defines the simplest possible feature, which categorizes the given name as
male or female just by looking at the last letter of the name:
def feature(self, word):
return {'last(1)' : word[-1]}
This function takes a word as an argument and tries to detect the gender as male or female
using the classifier we have built:
def gender(self, word):
return self._classifier.classify(self.feature(word))
This is the main function that is of interest to us, as we are going to detect anaphora on the
sample sentences:
def learnAnaphora(self):
These are four examples with mixed complexity expressed in anaphora form:
sentences = [
"John is a man. He walks",
"John and Mary are married. They have two kids",
"In order for Ravi to be successful, he should follow John",
"John met Mary in Barista. She asked him to order a Pizza"
]
This instruction iterates over all the sentences by taking one sentence at a time to a local
variable called sent:
for sent in sentences:
[ 208 ]
Advanced NLP Recipes Chapter 8
This instruction tokenizes, assigns parts of speech, extracts chunks (named entities), and
returns the chunk tree to a variable called chunks:
chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)),
binary=False)
This variable is used to store all the names and pronouns that help us resolve anaphora:
stack = []
This instruction shows the current sentence that is being processed on the user's screen:
print(sent)
This instruction flattens the tree chunks to a list of items expressed in IOB format:
items = tree2conlltags(chunks)
We are traversing through all chunked sentences that are in IOB format (tuple with three
elements):
for item in items:
If the POS of the word is NNP and IOB letter for this word is B-PERSON or O, then we mark
this word as a Name:
if item[1] == 'NNP' and (item[2] == 'B-PERSON' or item[2]
== 'O'):
stack.append((item[0], self.gender(item[0])))
If the POS of the word is CC, then also we will add this to the stack variable:
elif item[1] == 'CC':
stack.append(item[0])
If the POS of the word is PRP, then we will add this to the stack variable:
elif item[1] == 'PRP':
stack.append(item[0])
We are creating a new object called anaphora from AnaphoraExample() and invoking
the learnAnaphora() function on the anaphora object. Once this function execution
completes, we see the list of words for every sentence.
[ 209 ]
Advanced NLP Recipes Chapter 8
anaphora = AnaphoraExample()
anaphora.learnAnaphora()
Sentence Description
She is my date Here the sense of the word date is not the calendar
date but expresses a human relationship.
You have taken too many leaves to Here the word leaves has multiple senses:
skip cleaning leaves in the garden •
The first word leave means taking a break
•
The second one actually refers to tree leaves
Like this, there are many combinations of senses possible in sentences.
One of the challenges we have faced for senses identification is to find a proper
nomenclature to describe these senses. There are many English dictionaries available that
describe the behavior of words and all possible combinations of those. Of them
all, WordNet is the most structured, preferred, and widely accepted source of sense usage.
In this recipe, we will see examples of senses from the WordNet library and use the built-in
nltk library to find out the sense of words.
Lesk is the oldest algorithm that was coined to tackle this sense detection. You will see,
however, that this one too is not accurate in some cases.
[ 210 ]
Advanced NLP Recipes Chapter 8
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called WordSense.py.
3. Type the following source code:
[ 211 ]
Advanced NLP Recipes Chapter 8
How it works...
Let's see how our program works. This instruction imports the nltk library into the
program:
import nltk
These are the three words with different senses of expression. They are stored as a list in a
variable called words.
words = ['wind', 'date', 'left']
Iterate over all the words in the list by storing the current word in a variable
called word.
[ 212 ]
Advanced NLP Recipes Chapter 8
Invoke the synsets() function from the wordnet module and store the result in
the syns variable.
Take the first three synsets from the list, iterate through them, and take the
current one in a variable called syn.
Invoke the examples() function on the syn object and take the first two
examples as the iterator. The current value of the iterator is available in the
variable example.
Print the word, synset's name, and example sentence finally.
In these two instructions, we are traversing through the maps variable, taking the current
tuple into variable m, invoking the nltk.wsd.lesk() function, and displaying the
formatted results on screen.
for m in maps:
print("Sense '{}' for '{}' -> '{}'".format(m[0], m[1],
nltk.wsd.lesk(m[0], m[1], m[2])))
When the program is run, call the two functions that show the results on the user's screen.
if __name__ == '__main__':
understandWordSenseExamples()
understandBuiltinWSD()
[ 213 ]
Advanced NLP Recipes Chapter 8
Sentence Description
I am very happy Indicates a happy emotion
She is so :( We know there is an iconic sadness expression here
With the increased use of text, icons, and emojis in written natural language
communication, it's becoming increasingly difficult for computer programs to understand
the emotional meaning of a sentence.
Let's try to write a program to understand the facilities nltk provides to build our own
algorithm.
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called Sentiment.py.
[ 214 ]
Advanced NLP Recipes Chapter 8
[ 215 ]
Advanced NLP Recipes Chapter 8
How it works...
Let's see how our sentiment analysis program works. These instructions import
the nltk module and sentiment_analyzer module respectively.
import nltk
import nltk.sentiment.sentiment_analyzer
We are defining a new function, wordBasedSentiment(), which we will use to learn how
to do sentiment analysis based on the words that we already know and which mean
something important to us.
def wordBasedSentiment():
We are defining a list of three words that are special to us as they represent some form of
happiness. These words are stored in the positive_words variable.
positive_words = ['love', 'hope', 'joy']
This is the sample text that we are going to analyze; the text is stored in a variable
called text.
text = 'Rainfall this year brings lot of hope and joy to
Farmers.'.split()
[ 216 ]
Advanced NLP Recipes Chapter 8
We are calling the extract_unigram_feats() function on the text using the words that
we have defined. The result is a dictionary of input words that indicate whether the given
words are present in the text or not.
analysis = nltk.sentiment.util.extract_unigram_feats(text,
positive_words)
This instruction defines a new function that we will use to understand whether some pairs
of words occur in a sentence.
def multiWordBasedSentiment():
This instruction defines a list of two-word tuples. We are interested in finding if these pairs
of words occur together in a sentence.
word_sets = [('heavy', 'rains'), ('flood', 'bengaluru')]
This is the sentence we are interested in processing and finding the features of.
text = 'heavy rains cause flash flooding in bengaluru'.split()
We are calling the extract_bigram_feats() on the input sentence against the sets of
words in the word_sets variable. The result is a dictionary that tells whether these pairs of
words are present in the sentence or not.
analysis = nltk.sentiment.util.extract_bigram_feats(text, word_sets)
Next is the sentence on which we want to run the negativity analysis. It's stored in a
variable, text.
text = 'Rainfall last year did not bring joy to Farmers'.split()
[ 217 ]
Advanced NLP Recipes Chapter 8
We are calling the mark_negation() function on the text. This returns a list of all the
words in the sentence along with a special suffix _NEG for all the words that come under the
negative sense. The result is stored in the negation variable.
negation = nltk.sentiment.util.mark_negation(text)
When the program is run, these functions are called and we see the output of three
functions in the order they are executed (top-down).
if __name__ == '__main__':
wordBasedSentiment()
multiWordBasedSentiment()
markNegativity()
In this recipe, we will write our own sentiment analysis program based on what we have
learned in the previous recipe. We will also explore the built-in vader sentiment analysis
algorithm, which helps evaluate in finding the sentiment of complex sentences.
Getting ready
You should have Python installed, along with the nltk library.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called AdvSentiment.py.
[ 218 ]
Advanced NLP Recipes Chapter 8
[ 219 ]
Advanced NLP Recipes Chapter 8
How it works...
Now, let's see how our sentiment analysis program works. These four instructions import
the necessary modules that we are going to use as part of this program.
import nltk
import nltk.sentiment.util
import nltk.sentiment.sentiment_analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
[ 220 ]
Advanced NLP Recipes Chapter 8
Since we are just experimenting, we are defining the three words using which we are going
to find the sentiment. In real-world use cases, we might use these from the corpus of a
larger dictionary.
positive_words = ['love', 'genuine', 'liked']
This instruction breaks the input sentence into words. The list of words is fed to
the mark_negation() function to identify the presence of any negativity in the sentence.
Join the result from mark_negation() to the string and see if the _NEG suffix is present;
then set the score as -1.
if '_NEG' in '
'.join(nltk.sentiment.util.mark_negation(text.split())):
score = -1
The value of score is decided to be 1 if there is a presence of the positive word in the input
text.
if True in analysis.values():
score = 1
else:
score = 0
These are the four reviews that we are interested in processing using our algorithm to print
the score.
feedback = """I love the items in this shop, very genuine and quality
is well maintained.
I have visited this shop and had samosa, my friends liked it very much.
ok average food in this shop.
Fridays are very busy in this shop, do not place orders during this
day."""
[ 221 ]
Advanced NLP Recipes Chapter 8
These instructions extract the sentences from the variable feedback by splitting on newline
(\n) and calling the score_feedback() function on this text.
print(' -- custom scorer --')
for text in feedback.split("\n"):
print("score = {} for >> {}".format(score_feedback(text), text))
The result will be the score and sentence on the screen. This instruction defines
the advancedSentimentAnalyzer() function, which will be used to understand the built-
in features of NLTK sentiment analysis.
def advancedSentimentAnalyzer():
We are defining five sentences to analyze. you'll note that we are also using emoticons
(icons) to see how the algorithm works.
sentences = [
':)',
':(',
'She is so :(',
'I love the way cricket is played by the champions',
'She neither likes coffee nor tea',
]
[ 222 ]
Advanced NLP Recipes Chapter 8
Iterate over all the sentences and store the current one in the variable sentence
Display the currently processed sentence on screen
Invoke the polarity_scores() function on this sentence; store the result in a
variable called kvp
Traverse through the dictionary kvp and print the key (negativity, neutral,
positivity, or compound types) and the score computed for these types
When the current program is invoked, call these two functions to display the results on
screen.
if __name__ == '__main__':
advancedSentimentAnalyzer()
mySentimentAnalyzer()
In order to successfully build a conversational engine, it should take care of the following
things:
NLTK has a module, nltk.chat, which simplifies building these engines by providing a
generic framework.
[ 223 ]
Advanced NLP Recipes Chapter 8
Engines Modules
Eliza nltk.chat.eliza Python module
Iesha nltk.chat.iesha Python module
Rude nltk.chat.rudep Python module
Suntsu nltk.chat.suntsu module
Zen nltk.chat.zen module
In order to interact with these engines we can just load these modules in our Python
program and invoke the demo() function.
This recipe will show us how to use built-in engines and also write our own simple
conversational engine using the framework provided by the nltk.chat module.
Getting ready
You should have Python installed, along with the nltk library. Having an understanding of
regular expressions also helps.
How to do it...
1. Open atom editor (or your favorite programming editor).
2. Create a new file called Conversational.py.
[ 224 ]
Advanced NLP Recipes Chapter 8
[ 225 ]
Advanced NLP Recipes Chapter 8
[ 226 ]
Advanced NLP Recipes Chapter 8
How it works...
Let's try to understand what we are trying to achieve here. This instruction imports
the nltk library into the current program.
import nltk
These if, elif, else instructions are typical branching instructions that decide which chat
engine's demo() function is to be invoked depending on the argument that is present in
the whichOne variable. When the user passes an unknown engine name, it displays a
message to the user that it's not aware of this engine.
if whichOne == 'eliza':
nltk.chat.eliza.demo()
elif whichOne == 'iesha':
nltk.chat.iesha.demo()
elif whichOne == 'rude':
nltk.chat.rude.demo()
elif whichOne == 'suntsu':
nltk.chat.suntsu.demo()
elif whichOne == 'zen':
nltk.chat.zen.demo()
else:
print("unknown built-in chat engine {}".format(whichOne))
It's a good practice to handle all known and unknown cases also; it makes our programs
more robust in handling unknown situations.
This instruction defines a new function called myEngine(); this function does not take any
parameters.
def myEngine():
[ 227 ]
Advanced NLP Recipes Chapter 8
This is a single instruction where we are defining a nested tuple data structure and
assigning it to chat pairs.
chatpairs = (
(r"(.*?)Stock price(.*)",
("Today stock price is 100",
"I am unable to find out the stock price.")),
(r"(.*?)not well(.*)",
("Oh, take care. May be you should visit a doctor",
"Did you take some medicine ?")),
(r"(.*?)raining(.*)",
("Its monsoon season, what more do you expect ?",
"Yes, its good for farmers")),
(r"How(.*?)health(.*)",
("I am always healthy.",
"I am a program, super healthy!")),
(r".*",
("I am good. How are you today ?",
"What brings you here ?"))
)
We are defining a subfunction called chat() inside the myEngine() function. This is
permitted in Python. This chat() function displays some information to the user on the
screen and calls the nltk built-in nltk.chat.util.Chat() class with the chatpairs
variable. It passes nltk.chat.util.reflections as the second argument. Finally we call
the chatbot.converse() function on the object that's created using the chat() class.
def chat():
print("!"*80)
print(" >> my Engine << ")
print("Talk to the program using normal english")
print("="*80)
print("Enter 'quit' when done")
chatbot = nltk.chat.util.Chat(chatpairs,
nltk.chat.util.reflections)
chatbot.converse()
[ 228 ]
Advanced NLP Recipes Chapter 8
This instruction calls the chat() function, which shows a prompt on the screen and accepts
the user's requests. It shows responses according to the regular expressions that we have
built before:
chat()
These instructions will be called when the program is invoked as a standalone program (not
using import).
if __name__ == '__main__':
for engine in ['eliza', 'iesha', 'rude', 'suntsu', 'zen']:
print("=== demo of {} ===".format(engine))
builtinEngines(engine)
print()
myEngine()
Invoke the built-in engines one after another (so that we can experience them)
Once all the five built-in engines are excited, they call our myEngine(), where
our customer engine comes into play
[ 229 ]
Applications of Deep Learning
9
in NLP
In this chapter, we will cover the following recipes:
Introduction
In recent times, deep learning has become very prominent in the application of text, voice,
and image data to obtain state-of-the-art results, which are primarily used in the creation of
applications in the field of artificial intelligence. However, these models turn out to be
producing such results in all the fields of application. In this chapter, we will be covering
various applications in NLP/text processing.
Convolutional neural networks and recurrent neural networks are central themes in deep
learning that you will keep meeting across the domain.
Applications of Deep Learning in NLP Chapter 9
Instead of a fixed size, fully connected layers save the number of neurons and
hence the computational power requirement of the machine.
Only a small size of filter weights is used to hover across the matrix, rather than
each pixel connected to the next layers. So this is a better way of summarization
of the input image into the next layers.
During backpropagation, only the weights of the filter need to be updated based
on the backpropagated errors, hence the higher efficiency.
[ 231 ]
Applications of Deep Learning in NLP Chapter 9
Local connectivity (neural connections exist only between spatially local regions)
An optional progressive decrease in spatial resolution (as the number of features
is gradually increased)
After convolution, the convolved feature/activation map needs to be reduced based on the
most important features, as the same operation reduces the number of points and improves
computational efficiency. Pooling is an operation typically performed to reduce
unnecessary representations. Brief details about pooling operations are given as follows:
Due to the operation of convolution, it is natural that the size of pixels/input data
size reduces over the stages. But in some cases, we would really like to maintain
the size across operations. A hacky way to achieve this is padding with zeros at
the top layer accordingly.
[ 232 ]
Applications of Deep Learning in NLP Chapter 9
Padding: The following diagram (its width and breadth) will be shrunk
consecutively; this is undesirable in deep networks, and padding keeps the size
of the picture constant or controllable in size throughout the network.
A simple equation for calculating the activation map size based on given input
width, filter size, padding, and stride is shown as follows. This equation gives an
idea of how much computational power is needed, and so on.
Calculation of activation map size: In the following formula, the size of the
activation map obtained from the convolutional layer is:
Where, W is the width of original image, F is the filter size, P is padding size (1 for
a single layer of padding, 2 for a double layer of padding, and so on), S is stride
length
For example, consider an input image of size 224 x 224 x 3 (3 indicates Red, Green,
and Blue channels), with a filter size of 11 x 11 and number of filters as 96. The
stride length is 4 and there is no padding. What is the activation map size
generated out from these filters?
[ 233 ]
Applications of Deep Learning in NLP Chapter 9
[ 234 ]
Applications of Deep Learning in NLP Chapter 9
In AlexNet, all techniques such as convolution, pooling, and padding have been
used, and finally get connected with the fully connected layer.
Applications of CNNs
CNNs are used in various applications, a few of them are as follows:
[ 235 ]
Applications of Deep Learning in NLP Chapter 9
[ 236 ]
Applications of Deep Learning in NLP Chapter 9
Vanishing and exploding gradient problems do not occur in LSTM as the same is an
additive model rather than multiplicative model which is the case with RNN.
Language modeling: Given a sequence of words, the task is to predict the next
probable word
Text generation: To generate text from the writings of some authors
Machine translation: To convert one language into other language (English to
Chinese and so on.)
Chat bot: This application is very much like machine translation; however
question and answer pairs are used to train the model
Generating an image description: By training together with CNNs, RNNs can be
used to generate a caption/description of the image
[ 237 ]
Applications of Deep Learning in NLP Chapter 9
Getting ready
The 20 newsgroups dataset from scikit-learn have been utilized to illustrate the concept.
Number of observations/emails considered for analysis are 18,846 (train observations -
11,314 and test observations - 7,532) and its corresponding classes/categories are 20, which
are shown in the following:
>>> from sklearn.datasets import fetch_20newsgroups
>>> newsgroups_train = fetch_20newsgroups(subset='train')
>>> newsgroups_test = fetch_20newsgroups(subset='test')
>>> x_train = newsgroups_train.data
>>> x_test = newsgroups_test.data
>>> y_train = newsgroups_train.target
>>> y_test = newsgroups_test.target
>>> print ("List of all 20 categories:")
>>> print (newsgroups_train.target_names)
>>> print ("\n")
>>> print ("Sample Email:")
>>> print (x_train[0])
>>> print ("Sample Target Category:")
>>> print (y_train[0])
>>> print (newsgroups_train.target_names[y_train[0]])
In the following screenshot, a sample first data observation and target class category has
been shown. From the first observation or email we can infer that the email is talking about
a two-door sports car, which we can classify manually into autos category which is 8.
[ 238 ]
Applications of Deep Learning in NLP Chapter 9
Target value is 7 due to the indexing starts from 0), which is validating
our understanding with actual target class 7
How to do it...
Using NLP techniques, we have pre-processed the data for obtaining finalized word vectors
to map with final outcomes spam or ham. Major steps involved are:
1. Pre-processing.
2. Removal of punctuations.
3. Word tokenization.
4. Converting words into lowercase.
5. Stop word removal.
[ 239 ]
Applications of Deep Learning in NLP Chapter 9
How it works...
The NLTK package has been utilized for all the pre-processing steps, as it consists of all the
necessary NLP functionality under one single roof:
# Used for pre-processing data
>>> import nltk
>>> from nltk.corpus import stopwords
>>> from nltk.stem import WordNetLemmatizer
>>> import string
>>> import pandas as pd
>>> from nltk import pos_tag
>>> from nltk.stem import PorterStemmer
The function written (pre-processing) consists of all the steps for convenience. However, we
will be explaining all the steps in each section:
>>> def preprocessing(text):
The following line of the code splits the word and checks each character to see if it contains
any standard punctuations, if so it will be replaced with a blank or else it just don't replace
with blank:
... text2 = " ".join("".join([" " if ch in string.punctuation else ch
for ch in text]).split())
The following code tokenizes the sentences into words based on whitespaces and puts them
together as a list for applying further steps:
... tokens = [word for sent in nltk.sent_tokenize(text2) for word in
nltk.word_tokenize(sent)]
Converting all the cases (upper, lower and proper) into lower case reduces duplicates in
corpus:
... tokens = [word.lower() for word in tokens]
[ 240 ]
Applications of Deep Learning in NLP Chapter 9
As mentioned earlier, Stop words are the words that do not carry much of weight in
understanding the sentence; they are used for connecting words and so on. We have
removed them with the following line of code:
... stopwds = stopwords.words('english')
... tokens = [token for token in tokens if token not in stopwds]
Keeping only the words with length greater than 3 in the following code for removing small
words which hardly consists of much of a meaning to carry;
... tokens = [word for word in tokens if len(word)>=3]
Stemming applied on the words using Porter stemmer which stems the extra suffixes from
the words:
... stemmer = PorterStemmer()
... tokens = [stemmer.stem(word) for word in tokens]
POS tagging is a prerequisite for lemmatization, based on whether word is noun or verb or
and so on. it will reduce it to the root word
... tagged_corpus = pos_tag(tokens)
pos_tag function returns the part of speed in four formats for Noun and six formats for
verb. NN - (noun, common, singular), NNP - (noun, proper, singular), NNPS - (noun,
proper, plural), NNS - (noun, common, plural), VB - (verb, base form), VBD - (verb, past
tense), VBG - (verb, present participle), VBN - (verb, past participle), VBP - (verb, present
tense, not 3rd person singular), VBZ - (verb, present tense, third person singular)
... Noun_tags = ['NN','NNP','NNPS','NNS']
... Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
... lemmatizer = WordNetLemmatizer()
The following function, prat_lemmatize, has been created only for the reasons of
mismatch between the pos_tag function and intake values of lemmatize function. If the
tag for any word falls under the respective noun or verb tags category, n or v will be
applied accordingly in lemmatize function:
... def prat_lemmatize(token,tag):
... if tag in Noun_tags:
... return lemmatizer.lemmatize(token,'n')
... elif tag in Verb_tags:
... return lemmatizer.lemmatize(token,'v')
... else:
... return lemmatizer.lemmatize(token,'n')
[ 241 ]
Applications of Deep Learning in NLP Chapter 9
After performing tokenization and applied all the various operations, we need to join it
back to form stings and the following function performs the same:
... pre_proc_text = " ".join([prat_lemmatize(token,tag) for token,tag
in tagged_corpus])
... return pre_proc_text
After the pre-processing step has been completed, processed TF-IDF vectors have to be sent
to the following deep learning code:
# Deep Learning modules
>>> import numpy as np
>>> from keras.models import Sequential
>>> from keras.layers.core import Dense, Dropout, Activation
>>> from keras.optimizers import Adadelta,Adam,RMSprop
>>> from keras.utils import np_utils
The following image produces the output after firing up the preceding Keras code. Keras
has been installed on Theano, which eventually works on Python. A GPU with 6 GB
memory has been installed with additional libraries (CuDNN and CNMeM) for four to five
times faster execution, with a choking of around 20% memory; hence only 80% memory out
of 6 GB is available;
[ 242 ]
Applications of Deep Learning in NLP Chapter 9
The following code explains the central part of the deep learning model. The code is self-
explanatory, with the number of classes considered 20, batch size 64, and number of epochs
to train, 20:
# Definition hyper parameters
>>> np.random.seed(1337)
>>> nb_classes = 20
>>> batch_size = 64
>>> nb_epochs = 20
The following code converts the 20 categories into one-hot encoding vectors in which 20
columns are created and the values against the respective classes are given as 1. All other
classes are given as 0:
>>> Y_train = np_utils.to_categorical(y_train, nb_classes)
In the following building blocks of Keras code, three hidden layers (1000, 500, and 50
neurons in each layer respectively) are used, with dropout as 50% for each layer with Adam
as an optimizer:
#Deep Layer Model building in Keras
#del model
>>> model = Sequential()
>>> model.add(Dense(1000,input_shape= (10000,)))
>>> model.add(Activation('relu'))
>>> model.add(Dropout(0.5))
>>> model.add(Dense(500))
>>> model.add(Activation('relu'))
>>> model.add(Dropout(0.5))
>>> model.add(Dense(50))
>>> model.add(Activation('relu'))
>>> model.add(Dropout(0.5))
>>> model.add(Dense(nb_classes))
>>> model.add(Activation('softmax'))
>>> model.compile(loss='categorical_crossentropy', optimizer='adam')
>>> print (model.summary())
[ 243 ]
Applications of Deep Learning in NLP Chapter 9
The architecture is shown as follows and describes the flow of the data from a start of 10,000
as input. Then there are 1000, 500, 50, and 20 neurons to classify the given email into one
of the 20 categories:
[ 244 ]
Applications of Deep Learning in NLP Chapter 9
The model has been fitted with 20 epochs, in which each epoch took about 2 seconds. The
loss has been minimized from 1.9281 to 0.0241. By using CPU hardware, the time
required for training each epoch may increase as a GPU massively parallelizes the
computation with thousands of threads/cores:
Finally, predictions are made on the train and test datasets to determine the accuracy,
precision, and recall values:
#Model Prediction
>>> y_train_predclass =
model.predict_classes(x_train_2,batch_size=batch_size)
>>> y_test_predclass =
model.predict_classes(x_test_2,batch_size=batch_size)
>>> from sklearn.metrics import accuracy_score,classification_report
>>> print ("\n\nDeep Neural Network - Train
accuracy:"),(round(accuracy_score( y_train, y_train_predclass),3))
>>> print ("\nDeep Neural Network - Test accuracy:"),(round(accuracy_score(
y_test,y_test_predclass),3))
>>> print ("\nDeep Neural Network - Train Classification Report")
>>> print (classification_report(y_train,y_train_predclass))
>>> print ("\nDeep Neural Network - Test Classification Report")
>>> print (classification_report(y_test,y_test_predclass))
[ 245 ]
Applications of Deep Learning in NLP Chapter 9
[ 246 ]
Applications of Deep Learning in NLP Chapter 9
It appears that the classifier is giving a good 99.9% accuracy on the train dataset and 80.7%
on the test dataset.
Getting ready
The IMDB dataset from Keras has a set of words and its respective sentiment. The following
is the pre-processing of the data:
>>> import pandas as pd
>>> from keras.preprocessing import sequence
>>> from keras.models import Sequential
>>> from keras.layers import Dense, Dropout, Activation
>>> from keras.layers import Embedding
>>> from keras.layers import Conv1D, GlobalMaxPooling1D
>>> from keras.datasets import imdb
>>> from sklearn.metrics import accuracy_score,classification_report
In this set of parameters, we did put maximum features or number of words to be extracted
are 6,000 with maximum length of an individual sentence as 400 words:
# set parameters:
>>> max_features = 6000
>>> max_length = 400
>>> (x_train, y_train), (x_test, y_test) =
imdb.load_data(num_words=max_features)
>>> print(len(x_train), 'train observations')
>>> print(len(x_test), 'test observations')
[ 247 ]
Applications of Deep Learning in NLP Chapter 9
The dataset has an equal number of train and test observations, in which we will build a
model on 25,000 observations and test the trained model on the test data with 25,000 data
observations. A sample of data can be seen in this screenshot:
The following code is used to create the dictionary mapping of a word and its respective
integer index value:
# Creating numbers to word mapping
>>> wind = imdb.get_word_index()
>>> revind = dict((v,k) for k,v in wind.iteritems())
>>> print (x_train[0])
>>> print (y_train[0])
We see the first observation as a set of numbers rather than any English word, because the
computer can only understand and work with numbers rather than characters, words, and
so on:
[ 248 ]
Applications of Deep Learning in NLP Chapter 9
The following screenshot describes the stage after converting a number mapping into
textual format. Here, dictionaries are utilized to reverse a map from integer format to text
format:
How to do it...
The major steps involved are described as follows:
How it works...
The following code does perform padding operation for adding extra sentences which can
make up to maximum length of 400 words. By doing this, data will become even and easier
to perform computation using neural networks:
#Pad sequences for computational efficiency
>>> x_train = sequence.pad_sequences(x_train, maxlen=max_length)
>>> x_test = sequence.pad_sequences(x_test, maxlen=max_length)
>>> print('x_train shape:', x_train.shape)
>>> print('x_test shape:', x_test.shape)
The following deep learning code describes the application of Keras code to create a CNN
1D model:
# Deep Learning architecture parameters
>>> batch_size = 32
>>> embedding_dims = 60
[ 249 ]
Applications of Deep Learning in NLP Chapter 9
In the following screenshot, the entire model summary has been displayed, indicating the
number of dimensions and its respective number of neurons utilized. These directly impact
the number of parameters that will be utilized in computation from input data into the final
target variable, whether it is 0 or 1. Hence a dense layer has been utilized at the last layer of
the network:
[ 250 ]
Applications of Deep Learning in NLP Chapter 9
The following code performs model fitting operation on training data in which both X and Y
variables are used to train data by batch wise:
>>> model.fit(x_train, y_train,batch_size=batch_size,epochs=epochs,
validation_split=0.2)
The model has been trained for three epochs, in which each epoch consumes 5 seconds on
the GPU. But if we observe the following iterations, even though the train accuracy is
moving up, validation accuracy is decreasing. This phenomenon can be identified as model
overfitting. This indicates that we need to try some other ways to improve the model
accuracy rather than just increase the number of epochs. Other ways we probably should
look at are increasing the architecture size and so on. Readers are encouraged to experiment
with various combinations.
The following code is used for prediction of classes for both train and test data:
#Model Prediction
>>> y_train_predclass =
model.predict_classes(x_train,batch_size=batch_size)
>>> y_test_predclass = model.predict_classes(x_test,batch_size=batch_size)
>>> y_train_predclass.shape = y_train.shape
>>> y_test_predclass.shape = y_test.shape
[ 251 ]
Applications of Deep Learning in NLP Chapter 9
The following screenshot describes various measurable metrics to judge the model
performance. From the result, the train accuracy seems significantly high at 96%; however,
the test accuracy is at a somewhat lower value of 88.2 %. This could be due to model
overfitting:
[ 252 ]
Applications of Deep Learning in NLP Chapter 9
Getting ready
The IMDB dataset from Keras has set of words and its respective sentiment. Here is the pre-
processing of the data:
>>> from __future__ import print_function
>>> import numpy as np
>>> import pandas as pd
>>> from keras.preprocessing import sequence
>>> from keras.models import Sequential
>>> from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
>>> from keras.datasets import imdb
>>> from sklearn.metrics import accuracy_score,classification_report
# Loading data
>>> (x_train, y_train), (x_test, y_test) =
imdb.load_data(num_words=max_features)
>>> print(len(x_train), 'train observations')
>>> print(len(x_test), 'test observations')
How to do it...
The major steps involved are described as follows:
[ 253 ]
Applications of Deep Learning in NLP Chapter 9
How it works...
# Pad sequences for computational efficiently
>>> x_train_2 = sequence.pad_sequences(x_train, maxlen=max_len)
>>> x_test_2 = sequence.pad_sequences(x_test, maxlen=max_len)
>>> print('x_train shape:', x_train_2.shape)
>>> print('x_test shape:', x_test_2.shape)
>>> y_train = np.array(y_train)
>>> y_test = np.array(y_test)
The following deep learning code describes the application of Keras code to create a
bidirectional LSTM model:
Bidirectional LSTMs have a connection from both forward and backward, which enables
them to fill in the middle words to get connected well with left and right words:
# Model Building
>>> model = Sequential()
>>> model.add(Embedding(max_features, 128, input_length=max_len))
>>> model.add(Bidirectional(LSTM(64)))
>>> model.add(Dropout(0.5))
>>> model.add(Dense(1, activation='sigmoid'))
>>> model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
# Print model architecture
>>> print (model.summary())
Here is the architecture of the model. The embedding layer has been used to reduce the
dimensions to 128, followed by bidirectional LSTM, ending up with a dense layer for
modeling sentiment either zero or one:
[ 254 ]
Applications of Deep Learning in NLP Chapter 9
LSTM models take longer than CNNs because LSTMs are not easily parallelizable with
GPU (4x to 5x), whereas CNNs (100x) are massively parallelizable. One important
observation: even after an increase in the training accuracy, the validation accuracy was
decreasing. This situation indicates overfitting.
The following code has been used for predicting the class for both train and test data:
#Model Prediction
>>> y_train_predclass = model.predict_classes(x_train_2,batch_size=1000)
>>> y_test_predclass = model.predict_classes(x_test_2,batch_size=1000)
>>> y_train_predclass.shape = y_train.shape
>>> y_test_predclass.shape = y_test.shape
[ 255 ]
Applications of Deep Learning in NLP Chapter 9
It appears that LSTM did provide slightly less test accuracy compared with CNN; however,
with careful tuning of the model parameters, we can obtain better accuracies in RNNs
compared with CNNs.
[ 256 ]
Applications of Deep Learning in NLP Chapter 9
Getting ready
The Alice in Wonderland dataset has been used to extract words and create a visualization
using the dense network made to be like the encoder-decoder architecture:
>>> from __future__ import print_function
>>> import os
""" First change the following directory link to where all input files do
exist """
>>> os.chdir("C:\\Users\\prata\\Documents\\book_codes\\NLP_DL")
>>> import nltk
>>> from nltk.corpus import stopwords
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag
>>> from nltk.stem import PorterStemmer
>>> import string
>>> import numpy as np
>>> import pandas as pd
>>> import random
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.preprocessing import OneHotEncoder
>>> import matplotlib.pyplot as plt
>>> def preprocessing(text):
... text2 = " ".join("".join([" " if ch in string.punctuation else ch for
ch in text]).split())
... tokens = [word for sent in nltk.sent_tokenize(text2) for word in
nltk.word_tokenize(sent)]
... tokens = [word.lower() for word in tokens]
... stopwds = stopwords.words('english')
... tokens = [token for token in tokens if token not in stopwds]
... tokens = [word for word in tokens if len(word)>=3]
... stemmer = PorterStemmer()
... tokens = [stemmer.stem(word) for word in tokens]
... tagged_corpus = pos_tag(tokens)
... Noun_tags = ['NN','NNP','NNPS','NNS']
... Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
... lemmatizer = WordNetLemmatizer()
... def prat_lemmatize(token,tag):
... if tag in Noun_tags:
... return lemmatizer.lemmatize(token,'n')
... elif tag in Verb_tags:
... return lemmatizer.lemmatize(token,'v')
... else:
... return lemmatizer.lemmatize(token,'n')
... pre_proc_text = " ".join([prat_lemmatize(token,tag) for token,tag in
tagged_corpus])
... return pre_proc_text
>>> lines = []
[ 257 ]
Applications of Deep Learning in NLP Chapter 9
How to do it...
The major steps involved are described here:
How it works...
The following code creates dictionary, which is a mapping of word to index and index to
word (vice versa). As we knew, models simply do not work on character/word input.
Hence, we will be converting words into numeric equivalents (particularly integer
mapping), and once the computation has been performed using the neural network model,
the reverse of the mapping (index to word) will be applied to visualize them. The counter
from the collections library has been used for efficient creation of dictionaries:
>>> import collections
>>> counter = collections.Counter()
>>> for line in lines:
... for word in nltk.word_tokenize(line):
... counter[word.lower()]+=1
>>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())}
>>> idx2word = {v:k for k,v in word2idx.items()}
[ 258 ]
Applications of Deep Learning in NLP Chapter 9
The following code applies word-to-integer mapping and extracts the tri-grams from the
embedding. Skip-gram is the methodology in which the central word is connected to both
left and right adjacent words for training, and if during testing phase if it predicts correctly:
>>> xs = []
>>> ys = []
>>> for line in lines:
... embedding = [word2idx[w.lower()] for w in nltk.word_tokenize(line)]
... triples = list(nltk.trigrams(embedding))
... w_lefts = [x[0] for x in triples]
... w_centers = [x[1] for x in triples]
... w_rights = [x[2] for x in triples]
... xs.extend(w_centers)
... ys.extend(w_lefts)
... xs.extend(w_centers)
... ys.extend(w_rights)
The following code describes that the length of the dictionary is the vocabulary size.
Nonetheless, based on user specification, any custom vocabulary size can be chosen. Here,
we are considering all words though!
>>> print (len(word2idx))
>>> vocab_size = len(word2idx)+1
Based on vocabulary size, all independent and dependent variables are transformed into
vector representations with the following code, in which the number of rows would be the
number of words and the number of columns would be the vocabulary size. The neural
network model basically maps the input and output variables over the vector space:
>>> ohe = OneHotEncoder(n_values=vocab_size)
>>> X = ohe.fit_transform(np.array(xs).reshape(-1, 1)).todense()
>>> Y = ohe.fit_transform(np.array(ys).reshape(-1, 1)).todense()
>>> Xtrain, Xtest, Ytrain, Ytest,xstr,xsts = train_test_split(X, Y,xs,
test_size=0.3, random_state=42)
>>> print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)
Out of total 13,868 observations, train and test are split into 70% and 30%, which are created
as 9,707 and 4,161 respectively:
The central part of the model is described in the following few lines of deep learning code
using Keras software. It is a convergent-divergent code, in which initially the dimensions of
all input words are squeezed to achieve the output format.
[ 259 ]
Applications of Deep Learning in NLP Chapter 9
While doing so, the dimensions are reduced to 2D in the second layer. After training the
model, we will extract up to the second layer for predictions on test data. This literally
works similar to the conventional encoder-decoder architecture:
>>> from keras.layers import Input,Dense,Dropout
>>> from keras.models import Model
>>> np.random.seed(42)
>>> BATCH_SIZE = 128
>>> NUM_EPOCHS = 20
>>> input_layer = Input(shape = (Xtrain.shape[1],),name="input")
>>> first_layer = Dense(300,activation='relu',name = "first")(input_layer)
>>> first_dropout = Dropout(0.5,name="firstdout")(first_layer)
>>> second_layer = Dense(2,activation='relu',name="second") (first_dropout)
>>> third_layer = Dense(300,activation='relu',name="third") (second_layer)
>>> third_dropout = Dropout(0.5,name="thirdout")(third_layer)
>>> fourth_layer = Dense(Ytrain.shape[1],activation='softmax',name =
"fourth")(third_dropout)
>>> history = Model(input_layer,fourth_layer)
>>> history.compile(optimizer = "rmsprop",loss= "categorical_crossentropy",
metrics=["accuracy"])
By carefully observing the accuracy on both the training and validation datasets, we can
find that the best accuracy values are not even crossing 6%. This happens due to limited
data and architecture of deep learning models. In order to make this really work, we need at
least gigabytes of data and large architectures. Models too need to be trained for very long.
Due to practical constraints and illustration purposes, we have just trained for 20 iterations.
However, readers are encouraged to try various combinations to improve the accuracy.
[ 260 ]
Applications of Deep Learning in NLP Chapter 9
#in inches
>>> plt.figure(figsize=(8, 8))
>>> for i, label in enumerate(labels):
... x = xvals[i]
... y = yvals[i]
... plt.scatter(x, y)
... plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',
ha='right',va='bottom')
>>> plt.xlabel("Dimension 1")
>>> plt.ylabel("Dimension 2")
>>> plt.show()
[ 261 ]
Applications of Deep Learning in NLP Chapter 9
The following image describes the visualization of the words in two-dimensional space.
Some words are closer to each other than other words, which indicates closeness and
relationships with nearby words. For example, words such as never, ever, and ask are
very close to each other.
[ 262 ]
Advanced Applications of Deep
10
Learning in NLP
In this chapter, we will cover the following advanced recipes:
Introduction
Deep learning techniques are being utilized well to solve some open-ended problems. This
chapter discusses these types of problems, in which a simple yes or no would be difficult to
say. We are hopeful that you will enjoy going through these recipes to obtain the viewpoint
of what cutting-edge works are going on in this industry at the moment, and try to learn
some basic building blocks of the same with relevant coding snippets.
Advanced Applications of Deep Learning in NLP Chapter 10
Getting ready...
The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used
to train the network for automated text generation. Data can be downloaded from http://
www.gutenberg.org/ for the raw file used for training:
The following code is used to create a dictionary of characters to indices and vice-versa
mapping, which we will be using to convert text into indices at later stages. This is because
deep learning models cannot understand English and everything needs to be mapped into
indices to train these models:
>>> path = 'C:\\Users\\prata\\Documents\\book_codes\\ NLP_DL\\
shakespeare_final.txt'
>>> text = open(path).read().lower()
>>> characters = sorted(list(set(text)))
>>> print('corpus length:', len(text))
>>> print('total chars:', len(characters))
[ 264 ]
Advanced Applications of Deep Learning in NLP Chapter 10
How to do it...
Before training the model, various preprocessing steps are involved to make it work. The
following are the major steps involved:
1. Preprocessing: Prepare X and Y data from the given entire story text file and
converting them into indices vectorized format.
2. Deep learning model training and validation: Train and validate the deep
learning model.
3. Text generation: Generate the text with the trained model.
How it works...
The following lines of code describe the entire modeling process of generating text from
Shakespeare's writings. Here we have chosen character length. This needs to be considered
as 40 to determine the next best single character, which seems to be very fair to consider.
Also, this extraction process jumps by three steps to avoid any overlapping between two
consecutive extractions, to create a dataset more fairly:
# cut the text in semi-redundant sequences of maxlen characters
>>> maxlen = 40
>>> step = 3
>>> sentences = []
>>> next_chars = []
>>> for i in range(0, len(text) - maxlen, step):
... sentences.append(text[i: i + maxlen])
... next_chars.append(text[i + maxlen])
... print('nb sequences:', len(sentences))
The following screenshot depicts the total number of sentences considered, 193798, which
is enough data for text generation:
The next code block is used to convert the data into a vectorized format for feeding into
deep learning models, as the models cannot understand anything about text, words,
sentences and so on. Initially, total dimensions are created with all zeros in the NumPy
array and filled with relevant places with dictionary mappings:
# Converting indices into vectorized format
>>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool)
>>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool)
[ 265 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The deep learning model is created with RNN, more specifically Long Short-Term Memory
networks with 128 hidden neurons, and the output is in the dimensions of the characters.
The number of columns in the array is the number of characters. Finally, the softmax
function is used with the RMSprop optimizer. We encourage readers to try with other
various parameters to check out how results vary:
#Model Building
>>> model = Sequential()
>>> model.add(LSTM(128, input_shape=(maxlen, len(characters))))
>>> model.add(Dense(len(characters)))
>>> model.add(Activation('softmax'))
>>> model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(lr=0.01))
>>> print (model.summary())
As mentioned earlier, deep learning models train on number indices to map input to output
(given a length of 40 characters, the model will predict the next best character). The
following code is used to convert the predicted indices back to the relevant character by
determining the maximum index of the character:
# Function to convert prediction into index
>>> def pred_indices(preds, metric=1.0):
... preds = np.asarray(preds).astype('float64')
... preds = np.log(preds) / metric
... exp_preds = np.exp(preds)
[ 266 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The model will be trained over 30 iterations with a batch size of 128. And also, the diversity
has been changed to see the impact on the predictions:
# Train and Evaluate the Model
>>> for iteration in range(1, 30):
... print('-' * 40)
... print('Iteration', iteration)
... model.fit(X, y,batch_size=128,epochs=1)
... start_index = random.randint(0, len(text) - maxlen - 1)
... for diversity in [0.2, 0.7,1.2]:
... print('\n----- diversity:', diversity)
... generated = ''
... sentence = text[start_index: start_index + maxlen]
... generated += sentence
... print('----- Generating with seed: "' + sentence + '"')
... sys.stdout.write(generated)
... for i in range(400):
... x = np.zeros((1, maxlen, len(characters)))
... for t, char in enumerate(sentence):
... x[0, t, char2indices[char]] = 1.
... preds = model.predict(x, verbose=0)[0]
... next_index = pred_indices(preds, diversity)
... pred_char = indices2char[next_index]
... generated += pred_char
... sentence = sentence[1:] + pred_char
... sys.stdout.write(pred_char)
... sys.stdout.flush()
... print("\nOne combination completed \n")
[ 267 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The results are shown in the next screenshot to compare the first iteration (Iteration 1)
and final iteration (Iteration 29). It is apparent that with enough training, the text
generation seems to be much better than with Iteration 1:
[ 268 ]
Advanced Applications of Deep Learning in NLP Chapter 10
[ 269 ]
Advanced Applications of Deep Learning in NLP Chapter 10
Though the text generation altogether seems to be a bit magical, we have generated text
using Shakespeare's writings, proving that with the right training and handling, we can
imitate any writer's style of writing.
Getting ready...
Facebook's bAbI data has been used for this example, and the same can be downloaded
from http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz. It
consists of about 20 types of tasks, among which we have taken the first one, a single
supporting-fact-based question-and-answer system.
After unzipping the file, go to the en-10k folder and use the files starting with qa1_single
supporting-fact for both the train and test files. The following code is used for extraction
of stories, questions, and answers in this particular order to create the data required for
training:
>>> from __future__ import division, print_function
>>> import collections
>>> import itertools
>>> import nltk
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import os
>>> import random
>>> def get_data(infile):
... stories, questions, answers = [], [], []
... story_text = []
... fin = open(Train_File, "rb")
... for line in fin:
... line = line.decode("utf-8").strip()
... lno, text = line.split(" ", 1)
... if "\t" in text:
[ 270 ]
Advanced Applications of Deep Learning in NLP Chapter 10
After extraction, it seems that about 10k observations were created in the data for both train
and test datasets:
How to do it...
After extraction of basic datasets, we need to follow these steps:
1. Preprocessing: Create a dictionary and map the story, question and answers to
vocab to map into vector format.
2. Model development and validation: Train the deep learning models and test on
the validation data sample.
3. Predicting outcomes based on the trained model: Trained models are utilized
for predicting outcomes on test data.
[ 271 ]
Advanced Applications of Deep Learning in NLP Chapter 10
How it works...
After train and test data creation, the remaining methodology is described as follows.
First, we will create a dictionary for vocabulary, in which for every word from the story,
question and answer data mapping is created. Mappings are used to convert words into
integer numbers and subsequently into vector space:
# Building Vocab dictionary from Train and Test data
>>> dictnry = collections.Counter()
>>> for stories,questions,answers in [data_train,data_test]:
... for story in stories:
... for sent in story:
... for word in nltk.word_tokenize(sent):
... dictnry[word.lower()] +=1
... for question in questions:
... for word in nltk.word_tokenize(question):
... dictnry[word.lower()]+=1
... for answer in answers:
... for word in nltk.word_tokenize(answer):
... dictnry[word.lower()]+=1
>>> word2indx = {w:(i+1) for i,(w,_) in enumerate(dictnry.most_common() )}
>>> word2indx["PAD"] = 0
>>> indx2word = {v:k for k,v in word2indx.items()}
>>> vocab_size = len(word2indx)
>>> print("vocabulary size:",len(word2indx))
The following screenshot depicts all the words in the vocabulary. It has only 22 words,
including the PAD word, which has been created to fill blank spaces or zeros:
The following code is used to determine the maximum length of words. By knowing this,
we can create a vector of maximum size, which can incorporate all lengths of words:
# compute max sequence length for each entity
>>> story_maxlen = 0
>>> question_maxlen = 0
>>> for stories, questions, answers in [data_train,data_test]:
... for story in stories:
... story_len = 0
... for sent in story:
... swords = nltk.word_tokenize(sent)
... story_len += len(swords)
... if story_len > story_maxlen:
... story_maxlen = story_len
[ 272 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The maximum length of words for story is 14, and for questions it is 4. For some of the
stories and questions, the length could be less than the maximum length; those words will
be replaced with 0 (or PAD word). The reason? This padding of extra blanks will make all
the observations of equal length. This is for computation efficiency, or else it will be difficult
to map different lengths, or creating parallelization in GPU for computation will be
impossible:
Following snippets of code does import various functions from respective classes which we
will be using in the following section:
>>> from keras.layers import Input
>>> from keras.layers.core import Activation, Dense, Dropout, Permute
>>> from keras.layers.embeddings import Embedding
>>> from keras.layers.merge import add, concatenate, dot
>>> from keras.layers.recurrent import LSTM
>>> from keras.models import Model
>>> from keras.preprocessing.sequence import pad_sequences
>>> from keras.utils import np_utils
Word-to-vectorized mapping is being performed in the following code after considering the
maximum lengths for story, question, and so on, while also considering vocab size, all of
which we have computed in the preceding segment of code:
# Converting data into Vectorized form
>>> def data_vectorization(data, word2indx, story_maxlen, question_maxlen):
... Xs, Xq, Y = [], [], []
... stories, questions, answers = data
... for story, question, answer in zip(stories, questions, answers):
... xs = [[word2indx[w.lower()] for w in nltk.word_tokenize(s)]
for s in story]
... xs = list(itertools.chain.from_iterable(xs))
... xq = [word2indx[w.lower()] for w in nltk.word_tokenize (question)]
... Xs.append(xs)
... Xq.append(xq)
... Y.append(word2indx[answer.lower()])
... return pad_sequences(Xs, maxlen=story_maxlen), pad_sequences(Xq,
maxlen=question_maxlen),np_utils.to_categorical(Y, num_classes=
len(word2indx))
[ 273 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The following image describes the dimensions of train and test data for story, question, and
answer segments accordingly:
[ 274 ]
Advanced Applications of Deep Learning in NLP Chapter 10
By reading the model summary in following image, you can see how blocks are connected
and the see total number of parameters required to be trained to tune the model:
[ 275 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The model accuracy has significantly improved from the first iteration (train accuracy =
19.35% and validation accuracy = 28.98%) to the 40th (train accuracy = 82.22% and validation
accuracy = 84.51%), which can be shown in the following image:
Following code does plot both training & validation accuracy change with respective to
change in epoch:
# plot accuracy and loss plot
>>> plt.title("Accuracy")
>>> plt.plot(history.history["acc"], color="g", label="train")
>>> plt.plot(history.history["val_acc"], color="r", label="validation")
>>> plt.legend(loc="best")
>>> plt.show()
[ 276 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The change in accuracy with the number of iterations is shown in the following image. It
seems that the accuracy has improved marginally rather than drastically after 10 iterations:
In the following code, results are predicted which is finding probability for each respective
class and also applying argmax function for finding the class where the probability is
maximum:
# get predictions of labels
>>> ytest = np.argmax(Ytest, axis=1)
>>> Ytest_ = model.predict([Xstest, Xqtest])
>>> ytest_ = np.argmax(Ytest_, axis=1)
# Select Random questions and predict answers
>>> NUM_DISPLAY = 10
>>> for i in random.sample(range(Xstest.shape[0]),NUM_DISPLAY):
... story = " ".join([indx2word[x] for x in Xstest[i].tolist() if x != 0])
... question = " ".join([indx2word[x] for x in Xqtest[i].tolist()])
... label = indx2word[ytest[i]]
... prediction = indx2word[ytest_[i]]
... print(story, question, label, prediction)
After training the model enough and achieving better accuracies on validation data such as
84.51%, it is time to verify with actual test data to see how much the predicted answers are
in line with the actual answers.
[ 277 ]
Advanced Applications of Deep Learning in NLP Chapter 10
Out of ten randomly drawn questions, the model was unable to predict the correct question
only once (for the sixth question; the actual answer is bedroom and the predicted answer is
office). This means we have got 90% accuracy on the sample. Though we may not be able
to generalize the accuracy value, this gives some idea to reader about the prediction ability
of the model:
Getting ready...
Alice in Wonderland data has been used for this purpose and the same data can be
downloaded from http://www.umich.edu/~umfandsf/other/ebooks/alice30.txt. In the
initial data preparation stage, we have extracted N-grams from continuous text file data,
which looks like a story file:
>>> from __future__ import print_function
>>> import os
""" First change the following directory link to where all input files do
exist """
>>> os.chdir("C:\\Users\\prata\\Documents\\book_codes\\NLP_DL")
>>> from sklearn.model_selection import train_test_split
>>> import nltk
>>> import numpy as np
>>> import string
[ 278 ]
Advanced Applications of Deep Learning in NLP Chapter 10
# File reading
>>> with open('alice_in_wonderland.txt', 'r') as content_file:
... content = content_file.read()
>>> content2 = " ".join("".join([" " if ch in string.punctuation else ch
for ch in content]).split())
>>> tokens = nltk.word_tokenize(content2)
>>> tokens = [word.lower() for word in tokens if len(word)>=2]
N-grams are selected with the following N value. In the following code, we have chosen N as
3, which means each piece of data has three words consecutively. Among them, two pre-
words (bi-grams) used to predict the next word in each observation. Readers are
encouraged to change the value of N and see how the model predicts the words:
# Select value of N for N grams among which N-1 are used to predict last
Nth word
>>> N = 3
>>> quads = list(nltk.ngrams(tokens,N))
>>> newl_app = []
>>> for ln in quads:
... newl = " ".join(ln)
... newl_app.append(newl)
How to do it...
After extracting basic data observations, we need to perform the following operations:
[ 279 ]
Advanced Applications of Deep Learning in NLP Chapter 10
How it works...
Vectorization of the given words (X and Y words) to vector space using CountVectorizer
from scikit-learn:
# Vectorizing the words
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer()
>>> x_trigm = []
>>> y_trigm = []
>>> for l in newl_app:
... x_str = " ".join(l.split()[0:N-1])
... y_str = l.split()[N-1]
... x_trigm.append(x_str)
... y_trigm.append(y_str)
>>> x_trigm_check = vectorizer.fit_transform(x_trigm).todense()
>>> y_trigm_check = vectorizer.fit_transform(y_trigm).todense()
# Dictionaries from word to integer and integer to word
>>> dictnry = vectorizer.vocabulary_
>>> rev_dictnry = {v:k for k,v in dictnry.items()}
>>> X = np.array(x_trigm_check)
>>> Y = np.array(y_trigm_check)
>>> Xtrain, Xtest, Ytrain, Ytest,xtrain_tg,xtest_tg = train_test_split(X,
Y,x_trigm, test_size=0.3,random_state=42)
>>> print("X Train shape",Xtrain.shape, "Y Train shape" , Ytrain.shape)
>>> print("X Test shape",Xtest.shape, "Y Test shape" , Ytest.shape)
After converting the data into vectorized form, we can see that the column value remains
the same, which is the vocabulary length (2559 of all possible words):
[ 280 ]
Advanced Applications of Deep Learning in NLP Chapter 10
(first_dropout)
>>> third_layer = Dense(1000,activation='relu',name="third") (second_layer)
>>> third_dropout = Dropout(0.5,name="thirdout")(third_layer)
>>> fourth_layer = Dense(Ytrain.shape[1],activation='softmax',name =
"fourth")(third_dropout)
>>> history = Model(input_layer,fourth_layer)
>>> history.compile(optimizer = "adam",loss="categorical_crossentropy",
metrics=["accuracy"])
>>> print (history.summary())
This screenshot depicts the complete architecture of the model, consisting of a convergent-
divergent structure:
# Model Training
>>> history.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,epochs=NUM_EPOCHS,
verbose=1,validation_split = 0.2)
[ 281 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The model is trained on data with 100 epochs. Even after a significant improvement in the
train accuracy (from 5.46% to 63.18%), there is little improvement in the validation accuracy
(6.63% to 10.53%). However, readers are encouraged to try various settings to improve the
validation accuracy further:
# Model Prediction
>>> Y_pred = history.predict(Xtest)
# Sample check on Test data
>>> print ("Prior bigram words", "|Actual", "|Predicted","\n")
>>> for i in range(10):
... print (i,xtest_tg[i],"|",rev_dictnry[np.argmax(Ytest[i])],
"|",rev_dictnry[np.argmax(Y_pred[i])])
Less validation accuracy provides a hint that the model might not predict the word very
well. The reason could be the very-high-dimensional aspect of taking the word rather than
the character level (character dimensions are 26, which is much less than the 2559 value of
words). In the following screenshot, we have predicted about two times out of 10.
However, it is very subjective to say whether it is a yes or no. Sometimes, the word
predicted could be close but not the same:
[ 282 ]
Advanced Applications of Deep Learning in NLP Chapter 10
Getting ready...
The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence
Markup Language (AIML), which is customized syntax such as XML file has been used to
train the model. In this file, questions and answers are mapped. For each question, there is a
particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from
https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip
the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in
Python:
>>> import os
""" First change the following directory link to where all input files do
exist """
>>> os.chdir("C:\\Users\\prata\\Documents\\book_codes\\NLP_DL")
>>> import numpy as np
>>> import pandas as pd
# File reading
>>> with open('bot.txt', 'r') as content_file:
... botdata = content_file.read()
>>> Questions = []
>>> Answers = []
AIML files have unique syntax, similar to XML. The pattern word is used to represent the
question and the template word for the answer. Hence, we are extracting respectively:
>>> for line in botdata.split("</pattern>"):
... if "<pattern>" in line:
... Quesn = line[line.find("<pattern>")+len("<pattern>"):]
... Questions.append(Quesn.lower())
>>> for line in botdata.split("</template>"):
... if "<template>" in line:
... Ans = line[line.find("<template>")+len("<template>"):]
... Ans = Ans.lower()
... Answers.append(Ans.lower())
[ 283 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The question and answers are joined to extract the total vocabulary used in the modeling, as
we need to convert all words/characters into numeric representation. The reason is the same
as mentioned before—deep learning models can't read English and everything is in
numbers for the model.
How to do it...
After extracting the question-and-answer pairs, the following steps are needed to process
the data and produce the results:
[ 284 ]
Advanced Applications of Deep Learning in NLP Chapter 10
How it works...
The question and answers are utilized to create the vocabulary of words to index mapping,
which will be utilized for converting words into vector mappings:
# Creating Vocabulary
>>> import nltk
>>> import collections
>>> counter = collections.Counter()
>>> for i in range(len(QnAdata)):
... for word in nltk.word_tokenize(QnAdata.iloc[i][2]):
... counter[word]+=1
>>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())}
>>> idx2word = {v:k for k,v in word2idx.items()}
>>> idx2word[0] = "PAD"
>>> vocab_size = len(word2idx)+1
>>> print (vocab_size)
Encoding and decoding functions are used to convert text to indices and indices to text
respectively. As we know, Deep learning models work on numeric values rather than text
or character data:
>>> def encode(sentence, maxlen,vocab_size):
... indices = np.zeros((maxlen, vocab_size))
... for i, w in enumerate(nltk.word_tokenize(sentence)):
... if i == maxlen: break
... indices[i, word2idx[w]] = 1
... return indices
>>> def decode(indices, calc_argmax=True):
... if calc_argmax:
... indices = np.argmax(indices, axis=-1)
... return ' '.join(idx2word[x] for x in indices)
The following code is used to vectorize the question and answers with the given maximum
length for both questions and answers. Both might be different lengths. In some pieces of
data, the question length is greater than answer length, and in a few cases, it's length is less
than answer length. Ideally, the question length is good to catch the right answers.
Unfortunately in this case, question length is much less than the answer length, which is a
very bad example to develop generative models:
>>> question_maxlen = 10
>>> answer_maxlen = 20
>>> def create_questions(question_maxlen,vocab_size):
[ 285 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The following code is an important part of the chatbot. Here we have used recurrent
networks, repeat vector, and time-distributed networks. The repeat vector used to match
dimensions of input to output values. Whereas time-distributed networks are used to
change the column vector to the output dimension's vocabulary size:
>>> n_hidden = 128
>>> question_layer = Input(shape=(question_maxlen,vocab_size))
>>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2)
(question_layer)
>>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn)
>>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode)
>>> regularized_layer = ActivityRegularization(l2=1)(dense_layer)
>>> softmax_layer = Activation('softmax')(regularized_layer)
>>> model = Model([question_layer],[softmax_layer])
>>> model.compile(loss='categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
>>> print (model.summary())
[ 286 ]
Advanced Applications of Deep Learning in NLP Chapter 10
The following model summary describes the change in flow of model size across the model.
The input layer matches the question's dimension and the output matches the answer's
dimension:
# Model Training
>>> quesns_train_2 = quesns_train.astype('float32')
>>> answs_train_2 = answs_train.astype('float32')
>>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30,
validation_split=0.05)
The results are a bit tricky in the following screenshot even though the accuracy is
significantly higher. The chatbot model might produce complete nonsense, as most of the
words are padding here. The reason? The number of words in this data is less:
[ 287 ]
Advanced Applications of Deep Learning in NLP Chapter 10
# Model prediction
>>> ans_pred = model.predict(quesns_train_2[0:3])
>>> print (decode(ans_pred[0]))
>>> print (decode(ans_pred[1]))
The following screenshot depicts the sample output on test data. The output does not seem
to make sense, which is an issue with generative models:
Our model did not work well in this case, but still some areas of improvement are possible
going forward with generative chatbot models. Readers can give it a try:
Have a dataset with lengthy questions and answers to catch signals well
Create a larger architecture of deep learning models and train over longer
iterations
Make question-and-answer pairs more generic rather than factoid-based, such as
retrieving knowledge and so on, where generative models fail miserably
[ 288 ]
Index
[ 290 ]
natural language processing (NLP) 8, 178 writing 116, 117, 118, 119, 120
neural word vector visualization recursive descent
used, for visualizing high-dimensional words in parsing 132, 134, 135
2D 256, 258, 259, 262 regex stemmer
NLP pipeline writing 91, 93
creating 179, 180, 183, 184, 185, 187 regex tokenizer
NLTK data writing 89, 90, 91
reference 9 regular expressions
using 77, 78, 79, 81, 82
O Rich Site Summary (RSS) feed
one-to-one mappings 153 about 48
contents, reading from 48, 50
P
padding 233
S
part-of-speech (POS) 94, 121, 147 sentence
PDF file about 148
reading, in Python 37, 38, 39 segmenting, classification used 164, 166, 167
polysemy sentiment analysis
computing 28 performing 214, 216, 217, 218
pooling 232 set of characters
POS tagger creating 85, 87
writing, with context 173, 175, 176 shift-reduce
probabilistic CFG parsing 136, 138, 139
writing 112, 113, 114, 116 stemming 58, 59, 60
projective dependency stopwords 63, 64, 66
parsing 139, 140, 141, 142 stories
proper nouns 148 processing, for common vocabulary extraction
PyPDF2 library 69, 70, 73, 75
installing 37 string operations
Python exploring 34, 35, 37
about 153 importance 32, 33, 34
PDF file, reading in 37, 38, 39 substring occurrences
word documents, reading in 40, 41, 43, 44 searching 83, 84, 85
R T
ranges of character taggers
creating 85, 86 about 94
recurrent neural networks (RNN) training 104, 106, 107
about 235, 264 writing 98, 99, 101, 102, 103
applications, in NLP 237 tagging 94
language modeling, for predicting next best word term frequency (TF) 187
278, 280, 282 text similarity problem
used, for building generative chatbot 283, 284, solving 187, 189, 190, 191, 192, 193
287 text
recursive CFG summarizing 200, 202, 203, 204
[ 291 ]
TextRank algorithm frequency distribution operations, exploring on
reference link 200 17, 19, 20
TF-IDF algorithm word documents
used, for solving similarity problem 187 reading, in Python 40, 41, 43, 44
tokenization 55, 56, 57 word sense
topic identification 194, 196, 197, 198, 200 disambiguating 210, 211, 213
WordNet
U hypernyms, exploring 25
user-defined corpus hyponyms, exploring 25
creating 44, 45, 46, 47 polysemy, computing 29
used, for exploring senses of ambiguous word
W 20, 21, 24
WordnetLemmatizer
web corpus files
using, of NLTK 60, 62, 63
go to
it-eb.com
for more...