LabManual Data Mining Even 2023
LabManual Data Mining Even 2023
LabManual Data Mining Even 2023
Data Mining
(3160714)
Enrolment No 200170107049
Name Patel Tanmay Anilkumar
Branch Computer Engineering
Academic Term 2022-2023
Institute Name VGEC
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient weightage
is given to practical work. It shows importance of enhancement of skills amongst the students
and it pays attention to utilize every second of time allotted for practical amongst students,
instructors and faculty members to achieve relevant outcomes by performing the experiments
rather than having merely study type experiments. It is must for effective implementation of
competency focused outcome-based curriculum that every practical is keenly designed to serve
as a tool to develop and enhance relevant competency required by the various industry among
every student. These psychomotor skills are very difficult to develop through traditional chalk
and board content delivery method in the classroom. Accordingly, this lab manual is designed
to focus on the industry defined relevant outcomes, rather than old practice of conducting
practical to prove concept and theory.
By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each
experiment in this manual begins with competency, industry relevant skills, course outcomes as
well as practical outcomes (objectives). The students will also achieve safety and necessary
precautions to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in order that
the students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.
Data mining is a key to sentiment analysis, price optimization, database marketing, credit risk
management, training and support, fraud detection, healthcare and medical diagnoses, risk
assessment, recommendation systems and much more. It can be an effective tool in just about
any industry, including retail, wholesale distribution, service industries, telecom,
communications, insurance, education, manufacturing, healthcare, banking, science,
engineering, and online marketing or social media.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal
of errors if any.
Vishwakarma Government Engineering College
Department of Computer Engineering
CERTIFICATE
Department of this Institute (GTU Code: 017) has satisfactorily completed the Practical /
Tutorial work for the subject Data Mining (3160714) for the academic year 2022-23.
Place: ___________
Date: ___________
1
Data Mining (3160714) 200170107049
DTE’s Vision
Institute’s Vision
Institute’s Mission
Department’s Vision
Department’s Mission
Sr. CO CO CO CO CO
Objective(s) of Experiment
No. 1 2 3 4 5
Identify how data mining is an interdisciplinary field by an
1. √
Application.
Write programs to perform the following tasks of
preprocessing (any language).
2.1 Noisy data handling
Equal Width Binning
Equal Frequency/Depth Binning
2. 2.2 Normalization Techniques √
Min max normalization
Z score normalization
Decimal scaling
2.3. Implement data dispersion measure Five
Number Summary generate box plot using python
libraries
To perform hand on experiments of data preprocessing
3. √ √
with sample data on Orange tool.
Implement Apriori algorithm of association rule data
4. mining technique in any √
Programming language.
Apply association rule data mining technique on sample
5. √ √
data sets using XL Miner Analysis Tool.
Apply Classification data mining technique on sample
6. √ √
data sets in Weka.
7.1. Implement Classification technique with quality
Measures in any Programming language.
7. √
7.2 Implement Regression technique in any
Programming language.
Apply K-means Clustering Algorithm any Programming
8. √ √
language.
Perform hands on experiment on any advance mining
9. √
Techniques Using Appropriate Tool.
Solve Real world problem using Data Mining Techniques
10. √
using Python Programming Language.
Data Mining (3160714) 200170107049
1. Teacher should provide the guideline with demonstration of practical to the students
with all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the students
before starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in the
students and ensure that the respective skills and competencies are developed in the
students after the completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the
demonstration.
6. Teacher may provide additional knowledge and skills to the students even though not
covered in the manual but are expected from the students by concerned industry.
7. Give practical assignment and assess the performance of students based on task
assigned to check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the
guidelines for implementation.
1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination
scheme, skill set to be developed etc.
2. Students will have to perform experiments as per practical list given.
3. Students have to show output of each program in their practical file.
4. Students are instructed to submit practical list as per given sample list shown on next page.
5. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.
Index
(Progressive Assessment Sheet)
Sr. Objective(s) of Experiment Page Date Date of Assessme Sign. of Rema
No. No. of submis nt Teacher rks
perfor sion Marks with
mance date
1 Identify how data mining is an interdisciplinary
field by an Application.
2 Write programs to perform the following tasks
of preprocessing (any language).
2.1 Noisy data handling
Equal Width Binning
Equal Frequency/Depth Binning
2.2 Normalization Techniques
Min max normalization
Z score normalization
Decimal scaling
2.3. Implement data dispersion measure Five
Number Summary generate box plot using
python libraries
3 To perform hand on experiments of data
preprocessing with sample data on Orange tool.
4 Implement Apriori algorithm of association rule
data mining technique in any Programming
language.
5 Apply association rule data mining technique on
sample data sets using XL Miner Analysis Tool.
6 Apply Classification data mining technique on
sample data sets in Weka.
7 7.1. Implement Classification technique with
quality Measures in any Programming language.
7.2. Implement Regression technique in any
Programming language.
8 Apply K-means Clustering Algorithm any
Programming language.
9 Perform hands on experiment on any advance
mining Techniques Using Appropriate Tool.
10 Solve Real world problem using Data Mining
Techniques using Python Programming
Language.
Total
Data Mining (3160714) 200170107049
Experiment No - 1
Aim: Identify how data mining is an interdisciplinary field by an Application.
Data mining is an interdisciplinary field that involves computer science, statistics, mathematics, and
domain-specific knowledge. One application that showcases the interdisciplinary nature of data
mining
Date:
Theory:
Car price prediction systems are used to predict the prices of Cars based on various factors such as
brand, specifications, features, and market trends. These systems are valuable for both consumers
and sellers, allowing them to make informed decisions about purchasing or selling Cars. The
process of creating a car price prediction system involves the following steps:
Dataset: A cars price prediction system requires a dataset that contains information about cars and
their attributes. Here are some examples of datasets:
Cars Prices: This is a dataset of cars prices collected from various sources such as online
marketplaces and retailers. It contains information about the brand, model, specifications,
and price of cars.
Cars Specifications: This is a dataset of Cars specifications collected from various sources
such as manufacturer websites and online retailers. It contains information about the
company, segment, engine, display and other features of cars.
Preprocessing: It involves cleaning and transforming the data to make it suitable for analysis. Here
are some preprocessing techniques commonly used in cars price prediction systems:
Data Cleaning: This involves removing missing or irrelevant data, correcting errors, and
removing duplicates. For example, if a cars has missing information such as the speed, it
may be removed from the dataset or the information may be imputed.
Data Normalization: This involves scaling the data to a common range or standard deviation.
For example, prices from different retailers may be normalized to a common currency or a
common range of values.
1
Data Mining (3160714) 200170107049
Data Transformation: This involves transforming the data into a format suitable for analysis.
For example, cars brands may be encoded as binary variables to enable analysis using
machine learning algorithms.
Feature Generation: This involves creating new features from the existing data that may be
useful for analysis. For example, the age of the cars may be calculated based on the release
date, or the number of cores in the processor may be counted.
Data Reduction: This involves reducing the dimensionality of the data to improve processing
efficiency and reduce noise. For example, principal component analysis (PCA) may be used
to identify the most important features in the dataset.
These preprocessing techniques help to ensure that the data is clean, normalized, and transformed in
a way that enables accurate analysis and prediction of user preferences for car recommendations.
Data Mining Techniques: Association rule mining, clustering, and classification are all data
mining techniques that can be applied to movie recommendation systems. Here is a brief overview
of how each of these techniques can be used:
Association Rule Mining: Association rule mining is a data mining technique used to find
associations or relationships among variables in large datasets. In the context of car price
prediction, association rule mining can be used to identify patterns and relationships
between different features that might affect the price of a car. For example, the technique
can be used to find out whether the brand, speed type, features, segment type, or screen size
are related to the car price. These associations can then be used to make predictions about
the price of a car with similar features.
Clustering: Clustering is a data mining technique used to group similar data points or
objects together based on their similarities or differences. In the context of car price
prediction, clustering can be used to group cars with similar features together, such as cars
with similar colour, engine, or segment type. Clustering can help in identifying the different
price ranges for cars with similar features, which can be useful in predicting the price of a
cars based on its features.
Classification: Classification is a data mining technique used to categorize data points or
objects into pre-defined classes based on their characteristics or features. In the context of
cars price prediction, classification can be used to classify cars into different price ranges
based on their features, such as speed type, engine, segment type, and screen size. This
technique can also be used to predict the price range of a car based on its features, which can
be useful in making pricing decisions.
Observations:
In a Car price prediction system, data mining techniques are used to analyze large amounts of data
about cars and generate accurate price predictions. These systems can be used by consumers to
make informed decisions about purchasing cars, and by sellers to set prices that are competitive and
profitable.
Conclusion: Car price prediction systems provide a compelling example of how data mining can be
used to analyze and predict trends in the market. By using these systems, consumers and sellers can
make informed decisions that are based on accurate and up-to-date information.
2
Data Mining (3160714) 200170107049
Quiz:
(1) What are the different preprocessing techniques can be applied on dataset?
(2) What is the use of data mining techniques on particular system?
Suggested References:
1. Han, J., & Kamber, M. (2011). Data mining: concepts and techniques.
2. https://www.kaggle.com/code/rounakbanik/movie-recommender-systems
Problem Completeness
Knowledge Team Work
Recognition and accuracy Ethics (2)
Rubrics (2) (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
3
Data Mining (3160714) 200170107049
Experiment No - 2
Aim: Write programs to perform the following tasks of preprocessing (any language).
2.1 Noisy data handling
Equal Width Binning
Equal Frequency/Depth Binning
2.2 Normalization Techniques
Min max normalization
Z score normalization
Decimal scaling
2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries
Date:
Theory:
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing.
4
Data Mining (3160714) 200170107049
Data Preprocessing Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency)
Bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by median:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min +
nw] where w = (max – min) / (N)
Example :
5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215
W=(215-5)/3=70
Normalization techniques are used in data preprocessing to scale numerical data to a common
range. Here are three commonly used normalization techniques:
The measurement unit used can affect the data analysis. For example, changing measurement units
from meters to inches for height, or from kilograms to pounds for weight, may lead to very different
results. In general, expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight.” To help avoid
5
Data Mining (3160714) 200170107049
dependence on the choice of measurement units, the data should be normalized or standardized.
This involves transforming the data to fall within a smaller or common range such as [−1,1] or [0.0,
1.0]. (The terms standardize and normalize are used interchangeably in data preprocessing, although
in statistics, the latter term also has other connotations.) Normalizing the data attempts to give all
attributes an equal weight. Normalization is particularly useful for classification algorithms
involving neural networks or distance measurements such as nearest-neighbor classification and
clustering. If using the neural network back propagation algorithm for classification mining. There
are many methods for data normalization. We Focus on min-max normalization, z-score
normalization, and normalization by decimal scaling.
Min-Max Normalization: This technique scales the data to a range of 0 to 1. The formula for min-
max normalization is:
X_norm = (X - X_min) / (X_max - X_min)
where X is the original data, X_min is the minimum value in the dataset, and X_max is the
maximum value in the dataset.
Z-Score Normalization: This technique scales the data to have a mean of 0 and a standard
deviation of 1. The formula for z-score normalization is:
X_norm = (X - X_mean) / X_std
where X is the original data, X_mean is the mean of the dataset, and X_std is the standard deviation
of the dataset.
Decimal Scaling: This technique scales the data by moving the decimal point a certain number of
places to the left or right. The formula for decimal scaling is:
X_norm = X / 10^j
where X is the original data and j is the number of decimal places to shift.
2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries
Let’s understand this with the help of an example. Suppose we have some data such as:
11,23,32,26,16,19,30,14,16,10
Here, in the above set of data points our Five Number Summary are as follows:
First of all, we will arrange the data points in ascending order and then calculate the
summary: 10,11,14,16,16,19,23,26,30,32
Minimum value: 10
25th Percentile: 14
Calculation of 25th Percentile: (25/100)*(n+1) = (25/100)*(11) = 2.75 i.e 3rd value of the data
6
Data Mining (3160714) 200170107049
50th Percentile : 17.5
Calculation of 50th Percentile: (16+19)/2 = 17.5
75th Percentile : 26
Calculation of 75th Percentile: (75/100)*(n+1) = (75/100)*(11) = 8.25 i.e 8th value of the
data
Box plots
Boxplots are the graphical representation of the distribution of the data using Five Number
summary values. It is one of the most efficient ways to detect outliers in our dataset.
In statistics, an outlier is a data point that differs significantly from other observations. An
outlier may be due to variability in the measurement or it may indicate experimental error;
the latter are sometimes excluded from the dataset. An outlier can cause serious problems in
statistical analyses.
Program
Code:
#2.1 Noisy Data Handling
# Generate random Numbers
import random
data = random.sample(range(10, 100), 20)
data = sorted(data)
print("Random data sample: ", data)
equal_width = []
min_val = min(data)
max_val = max(data)
diff_val = (max_val - min_val) // bins
7
Data Mining (3160714) 200170107049
d.append(data[j])
j = j+1
return j, d
j=0
for i in range(1, bins+1):
j, val = range_val(j, min_val + (i * diff_val))
equal_width.append(val)
equal_freq = []
def smooth_mean(data):
smooth_data = []
for i in range(bins):
mean_data = mean(data[i])
smooth_data.append([mean_data for j in range(len(data[i]))])
return smooth_data
def smooth_median(data):
smooth_data = []
for i in range(bins):
median_data = median(data[i])
smooth_data.append([median_data for j in range(len(data[i]))])
return smooth_data
def smooth_bound(data):
smooth_data = []
for i in range(bins):
d = []
d.append(data[i][0])
for j in range(1, len(data[i])-1):
min_d = min(data[i])
max_d = max(data[i])
if (data[i][j] - min_d) <= (max_d - data[i][j]):
d.append(min_d)
8
Data Mining (3160714) 200170107049
else:
d.append(max_d)
d.append(data[i][-1])
smooth_data.append(d)
return smooth_data
Code:
# 2.2 Normalization Techniques
# Min max normalization
# Z score normalization
# Decimal scaling
from statistics import mean, stdev
import random
min_data = min(data)
max_data = max(data)
# 1. min-max normalization
new_min = 0.0
new_max = 1.0
9
Data Mining (3160714) 200170107049
min_max_norm = [min_max(i, new_min, new_max) for i in data]
print('Min-max norm: ')
print(min_max_norm)
# 2. Z score normalization
data_mean = mean(data)
data_std = stdev(data)
# 3. Decimal Scaling
# 2.3 Implement data dispersion measure Five Number Summary generate box plot using python
libraries.
if len(data)%2 == 0:
ind = len(data) // 2
else:
ind = (len(data)+1) // 2
Q1 = median(data[: ind+1])
Q2 = median(data)
Q3 = median(data[ind+1: ])
sns.boxplot(data)
plt.show()
10
Data Mining (3160714) 200170107049
Observations:
11
Data Mining (3160714) 200170107049
Conclusion:
Binning, Normalization techniques and the five number summary are both important tools in data
preprocessing that help prepare data for data mining tasks.
Quiz:
(1) What is Five Number summary? How to generate box plot using Python Libraries?
(2) What is Normalization techniques?
(3) What are the different smoothing techniques?
Suggested Reference:
https://www.geeksforgeeks.org/binning-in-data-mining/
https://www.geeksforgeeks.org/data-normalization-in-data-mining/
https://stackoverflow.com/questions/53388096/generate-box-plot-from-5-number-summary-
min-max-quantiles
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
12
Data Mining (3160714) 200170107049
Experiment No - 3
Aim: To perform hand on experiments of data preprocessing with sample data on Orange
tool.
Date:
Demonstration of Tool:
Preprocesses data with selected methods. Inputs Data: input dataset Outputs Preprocessor:
preprocessing method Preprocessed Data: data preprocessed with selected methods
Preprocessing is crucial for achieving better-quality analysis results. The Preprocess
widget offers several preprocessing methods that can be combined in a single
preprocessing pipeline. Some methods are available as separate widgets, which offer
advanced techniques and greater parameter tuning.
1. List of preprocessors. Double click the preprocessors you wish to use and shuffle their
13
Data Mining (3160714) 200170107049
order by dragging them up or down. You can also add preprocessors by dragging them
from the left menu to the right.
2. Preprocessing pipeline.
3. When the box is ticked (Send Automatically), the widget will communicate changes
automatically. Alternatively, click Send.
⮚ Preprocessed Technique :
14
Data Mining (3160714) 200170107049
Conclusion: Orange is a powerful open-source data analysis and visualization tool for machine
learning and data mining tasks. It provides a wide variety of functionalities including data
visualization, data preprocessing, feature selection, classification, regression, clustering, and more.
Its user-friendly interface and drag-and-drop workflow make it easy for non-experts to work with
and understand machine learning concepts.
Quiz:
15
Data Mining (3160714) 200170107049
Suggested Reference:
1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufman
2. https://orangedatamining.com/docs/
Marks
16
Data Mining (3160714) 200170107049
Experiment No - 4
Aim: Implement Apriori algorithm of association rule data mining technique in any Programming
language.
Date:
Objectives: To implement basic logic for association rule mining algorithm with support and
confidence measures.
.
Equipment/Instruments: Personal Computer, open-source software for programming
Program:
Implement Apriori algorithm of association rule data mining technique in any Programming
language.
Code:
# Define the dataset
transactions = [
["I1","I2","I5"],
["I2","I4"],["I2","I3"],["I1","I2","I4"],["I1","I3"],
["I2","I3"],["I1","I3"],
["I1","I2","I3","I5"],
["I1","I2","I3"]
]
# Calculate support for candidate itemsets and remove those that don't meet minimum support
threshold
itemset_counts = {itemset: 0 for itemset in candidate_itemsets}
for transaction in transactions:
for itemset in itemset_counts.keys():
if itemset.issubset(transaction):
itemset_counts[itemset] += 1
frequent_itemsets.update({itemset: count for itemset, count in itemset_counts.items() if count
>= min_support})
# Increment k
k += 1
18
Data Mining (3160714) 200170107049
Observations:
Output:
Conclusion:
Apriori algorithm is an effective and widely used approach for discovering frequent itemsets and
association rules in large transaction datasets. It has been used in various applications such as
market basket analysis, customer segmentation, and web usage mining.
Quiz:
Suggested Reference:
https://www.geeksforgeeks.org/apriori-algorithm/
Rubric wise marks obtained:
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
19
Data Mining (3160714) 200170107049
Experiment No - 5
Aim: Apply association rule data mining technique on sample data sets using Weka
Analysis Tool.
Date:
Demonstration of Tool:
1. Open the Weka Analysis Tool and load your dataset. For this example, we will use the
"supermarket.arff" dataset which contains information about customers' purchases at a
supermarket.
20
Data Mining (3160714) 200170107049
2. Preprocess the dataset by selecting the "Filter" tab and choosing the "Nominal to Binary"
filter. This will convert the nominal attributes in the dataset to binary ones, which is
necessary for association rule mining.
21
Data Mining (3160714) 200170107049
3. Select the "Associate" tab and choose the "Apriori" algorithm from the list of association
rule algorithms.
4. Set the minimum support and confidence values for the algorithm. For this example, we will
set the minimum support to 0.2 and the minimum confidence to 0.5.
5. Click on the "Start" button to run the algorithm. The results will be displayed in the output
window, showing the generated association rules based on the selected support and
confidence values.
6. Analyze the generated association rules to identify interesting patterns and insights. For
example, you may find that customers who buy bread are more likely to buy milk, or that
customers who buy vegetables are less likely to buy junk food.
22
Data Mining (3160714) 200170107049
7. You can further refine your analysis by adjusting the support and confidence values, or by
using other association rule algorithms such as FP-Growth or Eclat.
Observations: NA
Conclusion:
One of the key strengths of WEKA is its wide range of data mining techniques, including decision
trees, neural networks, clustering, and association rules, among others. These techniques are
accessible through an intuitive graphical user interface (GUI), which allows users to easily build
models and analyze data without needing advanced programming skills.
Another advantage of WEKA is a widely used spreadsheet software. This allows users to leverage
Excel's built-in data management and manipulation features, while also taking advantage of
WEKA's advanced analytics capabilities.
Quiz:
Suggested Reference:
Marks
23
Data Mining (3160714) 200170107049
Experiment No - 6
Aim: Apply Classification data mining technique on sample data sets in WEKA.
Date:
Demonstration of Tool:
WEKA:
24
Data Mining (3160714) 200170107049
Now we will be performing data mining techniques on sample data set in arff extension available in
WEKA.
Now we will be completing the process in the following steps.
Step-1:
First open the weka app. And open the “Explorer” Tag in the menu bar.
25
Data Mining (3160714) 200170107049
Step-2:
Now we will load our sample data set. Weather-Nominal.arff from the data directory under weka folder
in our system.
26
Data Mining (3160714) 200170107049
Step-3:
Now we can visualize our sample datasets available in WEKA.
Step-4:
Now we can use tools available in WEKA to partition our sample data into training data and testing
data and we can print our outcomes.
27
Data Mining (3160714) 200170107049
Conclusion:
Weka is a widely used and highly regarded data mining and machine learning tool that provides a
comprehensive suite of data preprocessing, classification, regression, clustering, and association
rule mining algorithms. It is an open-source software that is available for free and is written in Java,
making it platform-independent and easily accessible.
One of the key strengths of Weka is its extensive set of machine learning algorithms, which can be
easily applied to various types of data and problems. It offers a wide range of algorithms, including
decision trees, support vector machines, neural networks, random forests, and others, which are
supported by a comprehensive set of evaluation metrics and visualization tools.
Quiz:
Suggested Reference:
28
Data Mining (3160714) 200170107049
Rubric wise marks obtained:
Problem
Knowledge Tool Usage Demonstration
Recognition Ethics (2)
Rubrics (2) (2) (2) Total
(2)
Good Average Good Average Good Average Good Good Average
Average (1)
(2) (1) (2) (1) (2) (1) (2) (2) (1)
Marks
29
Data Mining (3160714) 200170107049
Experiment No - 7
Date:
Code (7.1) :
30
Data Mining (3160714) 200170107049
Describe our data set:
31
Data Mining (3160714) 200170107049
Scaling our data set:
Logistic Regression:
Random Forest:
Decision Tree:
Naïve Bayes:
Code (7.2) :
Import Libraries:
Scaling:
33
Data Mining (3160714) 200170107049
Linear Regression:
Random Forest:
Ridge Regression:
Lasso Regression:
Decision Tree:
Observations:
The Classification Model Algorithm Consist of Logistic , Random, Decision ,Ridge and Naïve
Bayes ranges the Accuracy score from 74% to 79% and Mean absolute error from 20% to 24%
The Regression Model Algorithm Consist of Linear , Lasso, Decision and Random Bayes ranges
the Mean square error from 8% to 24%
Conclusion:
34
Data Mining (3160714) 200170107049
Classification models are used to classify data into different categories or classes based on certain
features or attributes. This can be useful in a variety of applications, such as image recognition,
spam filtering, or fraud detection. Commonly used classification models include decision trees,
logistic regression, and naive Bayes classifiers.
Regression models, on the other hand, are used to predict a numerical value based on input features.
For example, a regression model might be used to predict the price of a house based on its size,
location, and other features. Popular regression models include linear regression, polynomial
regression, and decision trees.
Quiz:
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
Experiment No - 8
35
Data Mining (3160714) 200170107049
Aim: Apply K-means Clustering Algorithm any Programming language.
Date:
Program:
Code:
import matplotlib.pyplot as plt
data=[[2,10],[2,5],[8,4],[5,8],[7,5],[6,4],[1,2],[4,9]]
k1=[1]
k2=[2]
c1=data[0]
c2=data[1]
for i in range(2,len(data)):
E1=((data[i][0]-c1[0])**2 +(data[i][1]-c1[1])**2)**0.5
E2=((data[i][0]-c2[0])**2 +(data[i][1]-c2[1])**2)**0.5
if E1<E2:
k1.append(i+1)
c1=[(c1[0]+data[i][0])/2,(c1[1]+data[i][1])/2]
else:
k2.append(i + 1)
c2 = [(c2[0] + data[i][0]) / 2, (c2[1] + data[i][1]) / 2]
print("Cluster 1:",k1)
print("Cluster 2:",k2)
plt.scatter([data[i-1][0] for i in k1],[data[i-1][1] for i in k1],marker="*",label='Cluster 1')
plt.scatter([data[i-1][0] for i in k2],[data[i-1][1] for i in k2],label='Cluster 2')
plt.legend()
plt.show()
36
Data Mining (3160714) 200170107049
Observations:
Output:
Conclusion:
One of the key advantages of k-means is its scalability, as it can efficiently handle large datasets
with high-dimensional features. However, it also has some limitations, such as its sensitivity to
initial centroid positions, and its tendency to converge to local optima.
Quiz:
(1) What are the different distance measures?
(2) What do you mean by centroid in K-means Algorithm?
Suggested Reference:
J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann
References used by the students:
https://www.youtube.com/watch?v=CLKW6uWJtTc&ab_channel=5MinutesEngineering
Rubric wise marks obtained:
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
37
Data Mining (3160714) 200170107049
Experiment No - 9
Aim: Perform hands on experiment on any advance mining Techniques Using Appropriate Tool.
Date:
Competency and Practical Skills: Exploration and Understanding of Tool
Objectives:
1) Improve users' understanding of advance mining Techniques like Text Mining, Stream
Mining, and Web Content Mining Using Appropriate Tool
2) Familiarize with the tool
.
Equipment/Instruments: Octoparse
Demonstration of Tool:
Web Mining :-
Web mining is the way you apply data mining techniques so that you can extract
knowledge from web data. This web data could be a number of things. It could be
web documents, hyperlinks between documents and/or usage logs of websites etc.
Once you have the extracted information, you could analyze it to derive insights as per your
requirement. For instance, you could align your marketing or sales strategy based on
the results that your web mining throws up.
Since you have access to a lot of data, you have got your finger on the market pulse. You
can study customer behavior patterns to know and understand what the customers
want. With this sort of analysis of data, you can discover internal bottlenecks and
troubleshoot. Overall, you can get ahead of everyone in terms of how you anticipate
the industry trends and plan accordingly.
Web Mining Tool
A web mining tool is computer software that uses data mining techniques to identify or
discover patterns from large data sets.
There are various web mining tools available , here is a list of some of them.
(1) Download Octoparse setup and run ,then set the destination folder
39
Data Mining (3160714) 200170107049
As we can see over here there are multiple options available on the left hand side related to project
creation and management . we just need to enter the url of the site we want to scrap and we then
after the processing is done we would be able to get all the data that the octoparse has found out.
(4) Here we need to go advance section and need to insert the site link which we want to scrap.
Here I have selected amazon website to scrap.
40
Data Mining (3160714) 200170107049
(5) After we save this project then octoparse loads the website by its own and then starts to auto
scrape the website.
(6) After the auto scrapping is done then it generates a report/file of all the things it has found and
represent it an tabular format.
This is the list of all the tags links and data that octopare has found. Here it has identified 20 items
and 11 columns related to each item.
As we can see in the above figure all the fields that octoparse has identified are represented in red
square.
(7) Now we can generate report of all the data that has been gathered and used it for our purpose.
Observations: NA
Conclusion:
Octoparse is a powerful web scraping tool that allows users to extract data from websites without
the need for coding skills. It offers a user-friendly interface and a range of features such as
scheduling, data export, and cloud extraction.
The tool is highly customizable, and users can easily create their own scraping workflows with the
built-in point-and-click editor. Octoparse also provides excellent customer support and a helpful
community forum where users can share their experiences and ask for assistance
Quiz:
Suggested Reference:
Marks
Experiment No - 10
42
Data Mining (3160714) 200170107049
Aim: Solve Real world problem using Data Mining Techniques using Python Programming
Language.
Date:
Car price prediction systems are used to predict the prices of cars based on various factors such as
brand, specifications, features, and market trends. These systems are valuable for both consumers
and sellers, allowing them to make informed decisions about purchasing or selling cars. The process
of creating a car price prediction system involves the following steps:
Dataset: A car price prediction system requires a dataset that contains information about cars and
their attributes. Here are some examples of datasets:
Car Prices: This is a dataset of car prices collected from various sources such as online
marketplaces and retailers. It contains information about the brand, model, specifications,
and price of cars.
Car Specifications: This is a dataset of car specifications collected from various sources such
as manufacturer websites and online retailers. It contains information about the speed,
engine, segment, colour, and other features of cars.
Preprocessing: It involves cleaning and transforming the data to make it suitable for analysis. Here
are some preprocessing techniques commonly used in car price prediction systems:
Data Cleaning: This involves removing missing or irrelevant data, correcting errors, and
removing duplicates. For example, if a car has missing information such as the processor
speed, it may be removed from the dataset or the information may be imputed.
Data Normalization: This involves scaling the data to a common range or standard deviation.
For example, prices from different retailers may be normalized to a common currency or a
common range of values.
Data Transformation: This involves transforming the data into a format suitable for analysis.
For example, car brands may be encoded as binary variables to enable analysis using
machine learning algorithms.
Feature Generation: This involves creating new features from the existing data that may be
useful for analysis. For example, the age of the car may be calculated based on the release
date, or the number of cores in the processor may be counted.
Data Reduction: This involves reducing the dimensionality of the data to improve processing
efficiency and reduce noise. For example, principal component analysis (PCA) may be used
to identify the most important features in the dataset.
These preprocessing techniques help to ensure that the data is clean, normalized, and transformed in
a way that enables accurate analysis and prediction of user preferences for car recommendations.
43
Data Mining (3160714) 200170107049
Data Mining Techniques: Association rule mining, clustering, and classification are all data
mining techniques that can be applied to movie recommendation systems. Here is a brief overview
of how each of these techniques can be used:
Association Rule Mining: Association rule mining is a data mining technique used to find
associations or relationships among variables in large datasets. In the context of car price
prediction, association rule mining can be used to identify patterns and relationships
between different features that might affect the price of a car. For example, the technique
can be used to find out whether the brand, segment type, engine, body type, or colour size
are related to the car price. These associations can then be used to make predictions about
the price of a car with similar features.
Clustering: Clustering is a data mining technique used to group similar data points or
objects together based on their similarities or differences. In the context of car price
prediction, clustering can be used to group cars with similar features together, such as cars
with similar brand, engine, or segment type. Clustering can help in identifying the different
price ranges for cars with similar features, which can be useful in predicting the price of a
car based on its features.
Classification: Classification is a data mining technique used to categorize data points or
objects into pre-defined classes based on their characteristics or features. In the context of
car price prediction, classification can be used to classify cars into different price ranges
based on their features, such as brand type, engine, segment type, and colur size. This
technique can also be used to predict the price range of a car based on its features, which can
be useful in making pricing decisions.
Program:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import OneHotEncoder
st.title("Car Predictor")
# brand
company = st.selectbox('Brand',df['Company'].unique())
# type of car
type = st.selectbox('Type',df['TypeName'].unique())
# engine
engine = st.selectbox('Engine',[4c,v4,v6,v8,v12])
# weight
weight = st.number_input('Weight of the Car')
# Infotainment system
infotainment = st.selectbox('Infotainment',['No','Yes'])
# Luxury
luxury = st.selectbox('luxury',['No','Yes'])
# length size
length_size = st.number_input('Length Size')
# Fuel Type
fuel = st.selectbox('Fuel',['EV','Pertrol','Diesel','CNG'])
#fetures
wheel = st.selectbox('WHEEL',df['wheel_type'].unique())
45
Data Mining (3160714) 200170107049
gear = st.selectbox('GEAR',[MT,AT])
colour = st.selectbox('colour',df['col'].unique())
if st.button('Predict Price'):
# query
ppi = None
if infotainment == 'Yes':
infotainment = 1
else:
infotainment = 0
if luxury == 'Yes':
luxury = 1
else:
luxury = 0
X_ft = int(fuel.split('x')[0])
Y_ft = int(fuel.split('x')[1])
ppi = ((X_ft**2) + (Y_ft**2))**5/body_size
query =
np.array([company,type,engine,weight,infotainment,luxury,ppi,fuel,wheel,gear,music,col])
query = query.reshape(1,12)
st.title("The predicted price of this configuration is " + str(int(np.exp(pipe.predict(query)
[0]))))
46
Data Mining (3160714) 200170107049
Observations:
Conclusion:
In this project, we have analyzed a car dataset and performed various data cleaning and
preprocessing techniques. We have also extracted useful features from the existing features like the
presence of infotainment and luxury, colour, fuel type, PPI, engine brand, and type, and wheel base
and gear, length. Finally, we have built a machine learning model using the Random Forest
Regressor algorithm to predict the car prices based on the features.
The model has achieved an accuracy score of 89% on the test data, which indicates that the model
can predict the car prices with high accuracy. We have also visualized some important features that
are related to car prices, such as company, car type, engine brand, segmnet, and body type, which
can help users to make better decisions while buying a car. Overall, this project provides useful
insights into the car industry and how machine learning can be used to predict car prices.
47
Data Mining (3160714) 200170107049
Quiz:
1) What are other techniques that can be used to solve your system problem?
Completeness
Knowledge Teamwork Logic
and accuracy Ethics (2)
Rubrics (2) (2) Building (2) Total
(2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
48