NLP Lab Manual LP 6
NLP Lab Manual LP 6
NLP Lab Manual LP 6
DEPARTMENT OF COMPUTER
ENGINEERING
SEMESTER-II
[A.Y. : 2023 - 2024]
Course Objectives:
• To understand the fundamental concepts and techniques of natural language processing (NLP)
Course Outcomes:
CO1: Apply basic principles of elective subjects to problem solving and modelling.
CO2: Use tools and techniques in the area of software development to build mini projects.
List of Assignments
Sr. Title
No.
Group 1
2
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF
on data. Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset
3 Perform text cleaning, perform lemmatization (any method), remove stop words (any
method), label encoding. Create representations using TF-IDF. Save outputs.
Dataset:https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-
Preprocessing/News_dataset.pickle
5 Morphology is the study of the way words are built up from smaller meaning bearing units. Study
and understand the concepts of morphology by the use of add delete table
Group 2
Mini Project (Fine tune transformers on your preferred task)
Finetune a pretrained transformer for any of the following tasks on any relevant dataset of your
choice:
• Neural Machine Translation
6
• Classification
• Summarization
Group 1
Assignment No 1
Problem Statement:
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library. Use
porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.
Input / Dataset –use any sample sentence
Objective:
To understand the fundamental concepts and techniques of natural language processing (NLP).
CO Relevance: CO1
Computers speak their own language, the binary language. Thus, they are limited in how they can
interact with us humans; expanding their language and understanding our own is crucial to set them free
from their boundaries.
NLP is an abbreviation for natural language processing, which encompasses a set of tools, routines, and
techniques computers can use to process and understand human communications. Not to be confused
with speech recognition, NLP deals with understanding the meaning of words other than interpreting
audio signals into those words.
If you think NLP is just a futuristic idea, you may be shocked to know that we are likely to interact with
NLP every day when we perform queries in Google when we use translators online when we talk with
Google Assistant or Siri. NLP is everywhere, and to implement it in your projects is now very reachable
thanks to libraries such as NLTK, which provide a huge abstraction of the complexity.
Conclusion- In this way we have performed tokenization using NLTK. And porter and snowball
stemming. Using SpaCy library performed lemmatization.
Viva Questions
1. What is difference between porter and snowball stemmer?
2. What is lemmatization?
3. Differentiate between lemmatization and stemming.
4. What are different python libraries used for lemmatization.
5. Why do we need tokenization?
Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator:
Group 1
Assignment No:2
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF on data.
Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset
Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).
CO Relevance: CO1
Theory:
Conclusion: In this way we have implement Bag-of-Word (BOW), Tf-Idf appoarch by using
python library. And word2vec model implemented successfully by using genism library.
Viva Questions:
1. What is Tf-Idf?
2. Differentiate between continuous-bags-of-words and skip-gram.
3. What is mean by Word Embedding? And what are techniques of word embedding?
4. Why word embedding is required?
5. Which library used in word embedding?
.
Date:
Marks obtained:
Sign of course
coordinator:
Name of course
Coordinator:
Group 1
Assignment No 3
Dataset: https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-
Preprocessing/News_dataset.pickle
Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).
CO Relevance: CO1
Conclusion:
Hence, we perform text cleaning, lemmatization, remove stop words, label encoding and create
representations using TF-IDF.
Viva Question:
1. What is lemmatization?
Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator:
Group 1
Assignment No 4
Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).
CO Relevance: CO1
Conclusion:
Viva Question:
• What are the types of embedding in transformer?
• What are the applications of Transformer
• What is the purpose of transformers in NLP?
• What is a Transformer?
• What is Self Attention?
Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator :
Group 1
Assignment No 5
Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).
CO Relevance: CO1
Morphology:
Morphology is the study of the way words are built up from smaller meaning bearing units i.e., morphemes.
A morpheme is the smallest meaningful linguistic unit. For eg:
• बच्चों(bachchoM) consists of two morphemes, बच्ा(bachchaa) has the information of the root word
noun "बच्ा"(bachchaa) and ओों(oM) has the information of plural and oblique case.
• played has two morphemes play and -ed having information verb "play" and "past tense", so given
word is past tense form of verb "play".
Words can be analysed morphologically if we know all variants of a given root word. We can use an 'Add-
Delete' table for this analysis.
Morph Analyser
Definition:
Morphemes are considered as smallest meaningful units of language. These morphemes can either be a root
word(play) or affix(-ed). Combination of these morphemes is called morphological process. So, word
"played" is made out of 2 morphemes "play" and "-ed". Thus finding all parts of a word(morphemes) and
thus describing properties of a word is called "Morphological Analysis". For example, "played" has
information verb "play" and "past tense", so given word is past tense form of verb "play".
Analysis of a word:
2. Delete आ(aa)
3. output बच्(bachch)
Paradigm Class
Words in the same paradigm class behave similarly, for Example लड़क is in the same paradigm class as
बच्, so लड़का would behave similarly as बच्ा as they share the same paradigm class.
Conclusion: Understanding the morphology of a word by the use of Add-Delete table successfully.
Viva Questions:
• What is a Morphology?
Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator :
Group 2
Mini-Project
Theory:
Abstract:
Introduction:
Source Code:
Testing document:
Conclusion:
Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator: