NLP Lab Manual LP 6

Laboratory Practice-VI Class: BE(Computer Engineering)
G. S. MOZE COLLEGE OF ENGINEERING

BALEWADI, PUNE
DEPARTMENT OF COMPUTER
ENGINEERING
SEMESTER-II
[A.Y. : 2023 - 2024]
Laboratory Practice-VI (410256)

Natural Language Processing 410252(A)
LABORATORY MANUAL
Department of Computer Engineering , Page 1

Savitribai Phule Pune University

Fourth Year of Computer Engineering (2019 course)
410256: Laboratory Practice-VI
Teaching Scheme: Credit Examination Scheme:
PR: 02 Hours/Week 01 TW: 50 Marks
Course Objectives:
• To understand the fundamental concepts and techniques of natural language processing (NLP)
Course Outcomes:
On completion of the course, student will be able to-
CO1: Apply basic principles of elective subjects to problem solving and modelling.
CO2: Use tools and techniques in the area of software development to build mini projects.
CO3: Design and develop applications on subjects of their choice.
CO4: Generate and manage deployment, administration & security.
Department of Computer Engineering Page 5

List of Assignments
Sr. Title
No.
Group 1
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using

1 NLTK library. Use porter stemmer and snowball stemmer for stemming. Use any technique
for lemmatization.
Input / Dataset –use any sample sentence
2
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF
on data. Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset
3 Perform text cleaning, perform lemmatization (any method), remove stop words (any
method), label encoding. Create representations using TF-IDF. Save outputs.
Dataset:https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-
Preprocessing/News_dataset.pickle
4 Create a transformer from scratch using the Pytorch library
5 Morphology is the study of the way words are built up from smaller meaning bearing units. Study
and understand the concepts of morphology by the use of add delete table

Group 2
Mini Project (Fine tune transformers on your preferred task)
Finetune a pretrained transformer for any of the following tasks on any relevant dataset of your
choice:
• Neural Machine Translation
6
• Classification
• Summarization
7 Mini Project - POS Taggers For Indian Languages
Mini Project -Feature Extraction using seven moment variants

8
9 Mini Project -Feature Extraction using Zernike Moments

Group 1
Assignment No 1
Problem Statement:
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library. Use
porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.
Input / Dataset –use any sample sentence
Objective:
To understand the fundamental concepts and techniques of natural language processing (NLP).
CO Relevance: CO1

Contents for Theory:
Introduction to NLP (Natural Language Processing)
Computers speak their own language, the binary language. Thus, they are limited in how they can
interact with us humans; expanding their language and understanding our own is crucial to set them free
from their boundaries.
NLP is an abbreviation for natural language processing, which encompasses a set of tools, routines, and
techniques computers can use to process and understand human communications. Not to be confused
with speech recognition, NLP deals with understanding the meaning of words other than interpreting
audio signals into those words.
If you think NLP is just a futuristic idea, you may be shocked to know that we are likely to interact with
NLP every day when we perform queries in Google when we use translators online when we talk with
Google Assistant or Siri. NLP is everywhere, and to implement it in your projects is now very reachable
thanks to libraries such as NLTK, which provide a huge abstraction of the complexity.
Department of Computer Engineering




Laboratory Practice-VI Class: BE (Computer Engineering)
Class: BE(Computer Engineering)

Laboratory Practice-VI Class: BE (Computer Engineering)
Class: BE(Computer Engineering)
Conclusion- In this way we have performed tokenization using NLTK. And porter and snowball
stemming. Using SpaCy library performed lemmatization.
Viva Questions
1. What is difference between porter and snowball stemmer?
2. What is lemmatization?
3. Differentiate between lemmatization and stemming.
4. What are different python libraries used for lemmatization.
5. Why do we need tokenization?
Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator:

Group 1
Assignment No:2
Title of the Assignment:
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF on data.
Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset
Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).
CO Relevance: CO1

Theory:







Conclusion: In this way we have implement Bag-of-Word (BOW), Tf-Idf appoarch by using
python library. And word2vec model implemented successfully by using genism library.
Viva Questions:
1. What is Tf-Idf?
2. Differentiate between continuous-bags-of-words and skip-gram.
3. What is mean by Word Embedding? And what are techniques of word embedding?
4. Why word embedding is required?
5. Which library used in word embedding?
.
Date:
Marks obtained:
Sign of course
coordinator:
Name of course
Coordinator:

Group 1
Assignment No 3

Perform text cleaning, perform lemmatization (any method), remove stop words (any method), label
encoding. Create representations using TF-IDF. Save outputs.
Dataset: https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-
Preprocessing/News_dataset.pickle
processing (NLP).
CO Relevance: CO1




Conclusion:
Hence, we perform text cleaning, lemmatization, remove stop words, label encoding and create
representations using TF-IDF.
Viva Question:
1. What is lemmatization?
2. Differentiate between lemmatization and stemming?
3. How calculate TF-IDF?
4. What is pickle library?
5. How perform text cleaning?
Date:
Marks obtained:

Group 1
Assignment No 4

Create a transformer from scratch using the Pytorch library.
processing (NLP).
CO Relevance: CO1




Department of Computer Engineering , ZCOER, Narhe, Pune-41 Page 40






Conclusion:
Implement a transformer from scratch using the Pytorch library.
Viva Question:
• What are the types of embedding in transformer?
• What are the applications of Transformer
• What is the purpose of transformers in NLP?
• What is a Transformer?
• What is Self Attention?
Date:
Marks obtained:
Name of course Coordinator :

Group 1
Assignment No 5

Morphology is the study of the way words are built up from smaller meaning bearing units. Study and understand
the concepts of morphology by the use of add delete table.
processing (NLP).
CO Relevance: CO1

Morphology:
Morphology is the study of the way words are built up from smaller meaning bearing units i.e., morphemes.
A morpheme is the smallest meaningful linguistic unit. For eg:
• बच्चों(bachchoM) consists of two morphemes, बच्ा(bachchaa) has the information of the root word
noun "बच्ा"(bachchaa) and ओों(oM) has the information of plural and oblique case.
• played has two morphemes play and -ed having information verb "play" and "past tense", so given
word is past tense form of verb "play".
Words can be analysed morphologically if we know all variants of a given root word. We can use an 'Add-
Delete' table for this analysis.
Morph Analyser
Definition:
Morphemes are considered as smallest meaningful units of language. These morphemes can either be a root
word(play) or affix(-ed). Combination of these morphemes is called morphological process. So, word
"played" is made out of 2 morphemes "play" and "-ed". Thus finding all parts of a word(morphemes) and
thus describing properties of a word is called "Morphological Analysis". For example, "played" has
information verb "play" and "past tense", so given word is past tense form of verb "play".
Analysis of a word:
बच्चों (bachchoM) = बच्ा(bachchaa)(root) + ओों(oM)(suffix) (ओों=3 plural oblique) A linguistic paradigm is

the complete set of variants of a given lexeme. These variants can be classified according to shared
inflectional categories (eg: number, case etc) and arranged into tables.

Paradigm for बच्ा
Case/num Singular Plural
Direct बच्ा(bachchaa) बच्े(bachche)
oblique बच्े(bachche) बच्चों (bachchoM)
Algorithm to get बच्चों(bachchoM) from बच्ा(bachchaa)
1. Take Root बच्(bachch)आ(aa)
2. Delete आ(aa)
3. output बच्(bachch)
4. Add ओों(oM) to output
5. Return बच्चों (bachchoM)
Therefore आ is deleted and ओों is added to get बच्चों
Add-Delete table for बच्ा
Delete Add Number Case Variants
आ(aa) आ(aa) sing dr बच्ा(bachchaa)
आ(aa) ए(e) Plu dr बच्े(bachche)
आ(aa) ए(e) Sing ob बच्े(bachche)
आ(aa) ओों(oM) Plu ob बच्चों(bachchoM)
Paradigm Class
Words in the same paradigm class behave similarly, for Example लड़क is in the same paradigm class as
बच्, so लड़का would behave similarly as बच्ा as they share the same paradigm class.


Conclusion: Understanding the morphology of a word by the use of Add-Delete table successfully.
Viva Questions:
• What is a Morphology?
• What are types of Morphology?
• Why do we need to do Morphological Analysis?
• What is the application of morphology in linguistics?
• What is difference between inflectional and derivational morphology?
Date:
Marks obtained:
Name of course Coordinator :

Group 2
Mini-Project

1. Mini Project (Fine tune transformers on your preferred task)
Finetune a pretrained transformer for any of the following tasks on any relevant dataset of your choice:
• Neural Machine Translation
• Classification
• Summarization
2. Mini Project - POS Taggers For Indian Languages
3. Mini Project -Feature Extraction using seven moment variants
4. Mini Project -Feature Extraction using Zernike Moments
Objective of the Assignment: To understand the concept of Mini-project.
Outcome: Students will be able to learn and understand concept project.

Theory:
Abstract:
Introduction:
Software Requirement Specification:
Graphical User Interface:
Source Code:
Testing document:
Conclusion:
Date:
Marks obtained:

NLP Lab Manual LP 6

Uploaded by

Copyright:

Available Formats

NLP Lab Manual LP 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Lab Manual LP 6

Uploaded by

Copyright:

Available Formats

Laboratory Practice-VI Class: BE(Computer Engineering)

G. S. MOZE COLLEGE OF ENGINEERING

Laboratory Practice-VI (410256)

Department of Computer Engineering , Page 1

Savitribai Phule Pune University

PR: 02 Hours/Week 01 TW: 50 Marks

On completion of the course, student will be able to-

CO3: Design and develop applications on subjects of their choice.

CO4: Generate and manage deployment, administration & security.

Department of Computer Engineering Page 5

Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using

4 Create a transformer from scratch using the Pytorch library

Department of Computer Engineering Page 6

7 Mini Project - POS Taggers For Indian Languages

Mini Project -Feature Extraction using seven moment variants

9 Mini Project -Feature Extraction using Zernike Moments

Department of Computer Engineering Page 7

Department of Computer Engineering Page 8

Contents for Theory:

Introduction to NLP (Natural Language Processing)

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Title of the Assignment:

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Title of the Assignment:

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

2. Differentiate between lemmatization and stemming?

3. How calculate TF-IDF?

4. What is pickle library?

5. How perform text cleaning?

Department of Computer Engineering

Title of the Assignment:

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering , ZCOER, Narhe, Pune-41 Page 40

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Department of Computer Engineering

Implement a transformer from scratch using the Pytorch library.

Department of Computer Engineering

Title of the Assignment: