NLP Lab Manual LP 6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Laboratory Practice-VI Class: BE(Computer Engineering)

G. S. MOZE COLLEGE OF ENGINEERING


BALEWADI, PUNE

DEPARTMENT OF COMPUTER
ENGINEERING
SEMESTER-II
[A.Y. : 2023 - 2024]

Laboratory Practice-VI (410256)


Natural Language Processing 410252(A)
LABORATORY MANUAL

Department of Computer Engineering , Page 1


Laboratory Practice-VI Class: BE(Computer Engineering)

Savitribai Phule Pune University


Fourth Year of Computer Engineering (2019 course)
410256: Laboratory Practice-VI
Teaching Scheme: Credit Examination Scheme:

PR: 02 Hours/Week 01 TW: 50 Marks

Course Objectives:

• To understand the fundamental concepts and techniques of natural language processing (NLP)

Course Outcomes:

On completion of the course, student will be able to-

CO1: Apply basic principles of elective subjects to problem solving and modelling.

CO2: Use tools and techniques in the area of software development to build mini projects.

CO3: Design and develop applications on subjects of their choice.

CO4: Generate and manage deployment, administration & security.

Department of Computer Engineering Page 5


Laboratory Practice-VI Class: BE(Computer Engineering)

List of Assignments

Sr. Title
No.
Group 1

Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using


1 NLTK library. Use porter stemmer and snowball stemmer for stemming. Use any technique
for lemmatization.
Input / Dataset –use any sample sentence

2
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF
on data. Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset

3 Perform text cleaning, perform lemmatization (any method), remove stop words (any
method), label encoding. Create representations using TF-IDF. Save outputs.
Dataset:https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-
Preprocessing/News_dataset.pickle

4 Create a transformer from scratch using the Pytorch library

5 Morphology is the study of the way words are built up from smaller meaning bearing units. Study
and understand the concepts of morphology by the use of add delete table

Department of Computer Engineering Page 6


Laboratory Practice-VI Class: BE(Computer Engineering)

Group 2
Mini Project (Fine tune transformers on your preferred task)
Finetune a pretrained transformer for any of the following tasks on any relevant dataset of your
choice:
• Neural Machine Translation
6
• Classification
• Summarization

7 Mini Project - POS Taggers For Indian Languages

Mini Project -Feature Extraction using seven moment variants


8

9 Mini Project -Feature Extraction using Zernike Moments

Department of Computer Engineering Page 7


Laboratory Practice-VI Class: BE(Computer Engineering)

Group 1
Assignment No 1

Problem Statement:
Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library. Use
porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.
Input / Dataset –use any sample sentence

Objective:
To understand the fundamental concepts and techniques of natural language processing (NLP).

CO Relevance: CO1

Department of Computer Engineering Page 8


Laboratory Practice-VI Class: BE(Computer Engineering)

Contents for Theory:

Introduction to NLP (Natural Language Processing)

Computers speak their own language, the binary language. Thus, they are limited in how they can
interact with us humans; expanding their language and understanding our own is crucial to set them free
from their boundaries.

NLP is an abbreviation for natural language processing, which encompasses a set of tools, routines, and
techniques computers can use to process and understand human communications. Not to be confused
with speech recognition, NLP deals with understanding the meaning of words other than interpreting
audio signals into those words.

If you think NLP is just a futuristic idea, you may be shocked to know that we are likely to interact with
NLP every day when we perform queries in Google when we use translators online when we talk with
Google Assistant or Siri. NLP is everywhere, and to implement it in your projects is now very reachable
thanks to libraries such as NLTK, which provide a huge abstraction of the complexity.

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE (Computer Engineering)
Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE (Computer Engineering)
Class: BE(Computer Engineering)

Conclusion- In this way we have performed tokenization using NLTK. And porter and snowball
stemming. Using SpaCy library performed lemmatization.

Viva Questions
1. What is difference between porter and snowball stemmer?
2. What is lemmatization?
3. Differentiate between lemmatization and stemming.
4. What are different python libraries used for lemmatization.
5. Why do we need tokenization?

Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator:

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Group 1
Assignment No:2

Title of the Assignment:

Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF on data.
Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset

Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).

CO Relevance: CO1

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Theory:

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Conclusion: In this way we have implement Bag-of-Word (BOW), Tf-Idf appoarch by using
python library. And word2vec model implemented successfully by using genism library.
Viva Questions:

1. What is Tf-Idf?
2. Differentiate between continuous-bags-of-words and skip-gram.
3. What is mean by Word Embedding? And what are techniques of word embedding?
4. Why word embedding is required?
5. Which library used in word embedding?

.
Date:
Marks obtained:
Sign of course
coordinator:
Name of course
Coordinator:

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Group 1
Assignment No 3

Title of the Assignment:


Perform text cleaning, perform lemmatization (any method), remove stop words (any method), label
encoding. Create representations using TF-IDF. Save outputs.

Dataset: https://github.com/PICT-NLP/BE-NLP-Elective/blob/main/3-
Preprocessing/News_dataset.pickle

Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).

CO Relevance: CO1

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Conclusion:

Hence, we perform text cleaning, lemmatization, remove stop words, label encoding and create
representations using TF-IDF.

Viva Question:
1. What is lemmatization?

2. Differentiate between lemmatization and stemming?

3. How calculate TF-IDF?

4. What is pickle library?

5. How perform text cleaning?

Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator:

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Group 1

Assignment No 4

Title of the Assignment:


Create a transformer from scratch using the Pytorch library.

Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).

CO Relevance: CO1

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering , ZCOER, Narhe, Pune-41 Page 40


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Conclusion:

Implement a transformer from scratch using the Pytorch library.

Viva Question:
• What are the types of embedding in transformer?
• What are the applications of Transformer
• What is the purpose of transformers in NLP?
• What is a Transformer?
• What is Self Attention?

Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator :

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Group 1
Assignment No 5

Title of the Assignment:


Morphology is the study of the way words are built up from smaller meaning bearing units. Study and understand
the concepts of morphology by the use of add delete table.

Objective of the Assignment: To understand the fundamental concepts and techniques of natural language
processing (NLP).

CO Relevance: CO1

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Morphology:
Morphology is the study of the way words are built up from smaller meaning bearing units i.e., morphemes.
A morpheme is the smallest meaningful linguistic unit. For eg:

• बच्चों(bachchoM) consists of two morphemes, बच्ा(bachchaa) has the information of the root word
noun "बच्ा"(bachchaa) and ओों(oM) has the information of plural and oblique case.
• played has two morphemes play and -ed having information verb "play" and "past tense", so given
word is past tense form of verb "play".

Words can be analysed morphologically if we know all variants of a given root word. We can use an 'Add-
Delete' table for this analysis.

Morph Analyser

Definition:

Morphemes are considered as smallest meaningful units of language. These morphemes can either be a root
word(play) or affix(-ed). Combination of these morphemes is called morphological process. So, word
"played" is made out of 2 morphemes "play" and "-ed". Thus finding all parts of a word(morphemes) and
thus describing properties of a word is called "Morphological Analysis". For example, "played" has
information verb "play" and "past tense", so given word is past tense form of verb "play".

Analysis of a word:

बच्चों (bachchoM) = बच्ा(bachchaa)(root) + ओों(oM)(suffix) (ओों=3 plural oblique) A linguistic paradigm is


the complete set of variants of a given lexeme. These variants can be classified according to shared
inflectional categories (eg: number, case etc) and arranged into tables.

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Paradigm for बच्ा

Case/num Singular Plural

Direct बच्ा(bachchaa) बच्े(bachche)

oblique बच्े(bachche) बच्चों (bachchoM)

Algorithm to get बच्चों(bachchoM) from बच्ा(bachchaa)

1. Take Root बच्(bachch)आ(aa)

2. Delete आ(aa)

3. output बच्(bachch)

4. Add ओों(oM) to output

5. Return बच्चों (bachchoM)

Therefore आ is deleted and ओों is added to get बच्चों

Add-Delete table for बच्ा

Delete Add Number Case Variants

आ(aa) आ(aa) sing dr बच्ा(bachchaa)

आ(aa) ए(e) Plu dr बच्े(bachche)

आ(aa) ए(e) Sing ob बच्े(bachche)

आ(aa) ओों(oM) Plu ob बच्चों(bachchoM)

Paradigm Class

Words in the same paradigm class behave similarly, for Example लड़क is in the same paradigm class as
बच्, so लड़का would behave similarly as बच्ा as they share the same paradigm class.

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Conclusion: Understanding the morphology of a word by the use of Add-Delete table successfully.

Viva Questions:

• What is a Morphology?

• What are types of Morphology?

• Why do we need to do Morphological Analysis?

• What is the application of morphology in linguistics?

• What is difference between inflectional and derivational morphology?

Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator :

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Group 2

Mini-Project

Title of the Assignment:


1. Mini Project (Fine tune transformers on your preferred task)
Finetune a pretrained transformer for any of the following tasks on any relevant dataset of your choice:
• Neural Machine Translation
• Classification
• Summarization

2. Mini Project - POS Taggers For Indian Languages

3. Mini Project -Feature Extraction using seven moment variants

4. Mini Project -Feature Extraction using Zernike Moments

Objective of the Assignment: To understand the concept of Mini-project.

Outcome: Students will be able to learn and understand concept project.

Department of Computer Engineering


Laboratory Practice-VI Class: BE(Computer Engineering)

Theory:

Abstract:

Introduction:

Software Requirement Specification:

Graphical User Interface:

Source Code:

Testing document:

Conclusion:

Date:
Marks obtained:
Sign of course coordinator:
Name of course Coordinator:

Department of Computer Engineering

You might also like