Unit-4 DWM
Unit-4 DWM
Unit-4 DWM
Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of
people buy what kind of products.
Fraud Detection
Data mining is also used in the fields of credit card services and
telecommunication to detect frauds. In fraud telephone calls, it helps to
find the destination of the call, duration of the call, time of the day or
week, etc. It also analyzes the patterns that deviate from expected norms.
Data Mining - Tasks
Data mining deals with the kind of patterns that can be mined. On the
basis of the kind of data to be mined, there are two categories of functions
involved in Data Mining −
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the
database. Here is the list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales
include computer and printers, and concepts of customers include big
spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived
by the following two ways −
Data Characterization − This refers to summarizing data of class
under study. This class under study is called as Target Class.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently
purchased together. This process refers to the process of uncovering the
relationship among data and determining association rules.
For example, a retailer generates an association rule that shows that 70%
of time milk is sold with bread and only 30% of times biscuits are sold
with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or
between two item sets to analyze that if they have positive, negative or no
effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers
to forming group of objects that are very similar to each other but are
highly different from the objects in other clusters.
Performance Issues
Data Warehouse
Data Warehousing
Data warehousing is the process of constructing and using the data
warehouse. A data warehouse is constructed by integrating the data from
multiple heterogeneous sources. It supports analytical reporting,
structured and/or ad hoc queries, and decision making.
Data warehousing involves data cleaning, data integration, and data
consolidations. To integrate heterogeneous databases, we have the
following two approaches −
Query Driven Approach
Update Driven Approach
Query-Driven Approach
Now these queries are mapped and sent to the local query processor.
Disadvantages
This approach has the following disadvantages −
Update-Driven Approach
Today's data warehouse systems follow update-driven approach rather
than the traditional approach discussed earlier. In the update-driven
approach, the information from multiple heterogeneous sources is
integrated in advance and stored in a warehouse. This information is
available for direct querying and analysis.
Advantages
This approach has the following advantages −
Data Mining
Knowledge Base
Knowledge Discovery
User interface
User interface is the module of data mining system that helps the
communication between users and the data mining system. User Interface
allows the following functionalities −
Interact with the system by specifying a data mining query task.
Providing information to help focus the search.
Mining based on the intermediate data mining results.
Browse database and data warehouse schemas or data structures.
Evaluate mined patterns.
Visualize the patterns in different forms.
Data Integration
Data Cleaning
Data cleaning is a technique that is applied to remove the noisy data and
correct the inconsistencies in data. Data cleaning involves
transformations to correct the wrong data. Data cleaning is performed as a
data preprocessing step while preparing the data for a data warehouse.
Data Selection
Data Selection is the process where data relevant to the analysis task are
retrieved from the database. Sometimes data transformation and
consolidation are performed before the data selection process.
Clusters
Data Transformation
Data Selection − In this step, data relevant to the analysis task are
retrieved from the database.
Apart from these, a data mining system can also be classified based on
the kind of (a) databases mined, (b) knowledge mined, (c) techniques
utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases
mined. Database system can be classified according to different criteria
such as data models, types of data, etc. And the data mining system can
be classified accordingly.
For example, if we classify a database according to the data model, then
we may have a relational, transactional, object-relational, or data
warehouse mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge
mined. It means the data mining system is classified on the basis of
functionalities such as −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques
used. We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications
adapted. These applications are as follows −
Finance
Telecommunications
DNA
Stock Markets
E-mail
Loose Coupling − In this scheme, the data mining system may use
some of the functions of database and data warehouse system. It
fetches the data from the data respiratory managed by these
systems and performs data mining on that data. It then stores the
mining result either in a file or in a designated place in a database
or in a data warehouse.
The Data Mining Query Language (DMQL) was proposed by Han, Fu,
Wang, et al. for the DBMiner data mining system. The Data Mining
Query Language is actually based on the Structured Query Language
(SQL). Data Mining Query Languages can be designed to support ad hoc
and interactive data mining. This DMQL provides commands for
specifying primitives. The DMQL can work with databases and data
warehouses as well. DMQL can be used to define data mining tasks.
Particularly we examine how to define data warehouses and data marts in
DMQL.
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
What is classification?
Following are the examples of cases where the data analysis task is
Classification −
A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
What is prediction?
Following are the examples of cases where the data analysis task is
Prediction −
Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we are
bothered to predict a numeric value. Therefore the data analysis task is an
example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often
used for numeric prediction.
With the help of the bank loan application that we have discussed above,
let us understand the working of classification. The Data Classification
process includes two steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
Normalization − The data is transformed using
normalization. Normalization involves scaling all values for
given attribute in order to make them fall within a small
specified range. Normalization is used when in the learning
step, the neural networks or the methods involving
measurements are used.
Note − Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Cost Complexity
What is Clustering?
While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
Clustering Methods
Points to remember −
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data
objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start
with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keep on doing so until
all of the groups are merged into one or until the termination condition
holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once
a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of
hierarchical clustering −
Density-based Method
This method is based on the notion of density. The basic idea is to
continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages
Model-based methods
In this method, a model is hypothesized for each cluster to find the best
fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.
This method also provides a way to automatically determine the number
of clusters based on standard statistics, taking outlier or noise into
account. It therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user
expectation or the properties of desired clustering results. Constraints
provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application
requirement
Association Rule
Association rule mining finds interesting associations and relationships
among large sets of data items. This rule shows how frequently a itemset
occurs in a transaction. A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large
relations to show associations between items.It allows retailers to
identify relationships between the items that people buy together
frequently.
Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the
transaction.
TID Items
1 Bread, Milk
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset
and to put them into different clusters. It means here we will try to
group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or
any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below
image:
o Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear
visualization.
o Next, we will reassign each datapoint to the new centroid. For this,
we will repeat the same process of finding a median line. The
median will be like below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
o We can see in the above image; there are no dissimilar data points
on either side of the line, which means our model is formed.
Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines
the total variations within a cluster. The formula to calculate the value of
WCSS (for 3 clusters) is given below:
To measure the distance between data points and centroid, we can use
any method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below
steps:
Since the graph shows the sharp bend, which looks like an elbow, hence
it is known as the elbow method. The graph for the elbow method looks
like the below image:
Note: We can choose the number of clusters equal to the given data
points. If we choose the number of clusters equal to the data points,
then the value of WCSS becomes zero, and that will be the endpoint
of the plot.
In the above section, we have discussed the K-means algorithm, now let's
see how it can be implemented using Python.
o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters
The first step will be the data pre-processing, as we did in our earlier
topics of Regression and Classification. But for the clustering problem, it
will be different from other models. Let's discuss it:
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries
for our model, which is part of data pre-processing. The code is
given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing
mathematics calculation, matplotlib is for plotting the graph,
and pandas are for managing the dataset.
By executing the above lines of code, we will get our dataset in the
Spyder IDE. The dataset looks like the below image:
From the above dataset, we need to find some patterns in it.
Here we don't need any dependent variable for data pre-processing step as
it is a clustering problem, and we have no idea about what to determine.
So we will just add a line of code for the matrix of features.
As we can see, we are extracting only 3rd and 4th feature. It is because we
need a 2d plot to visualize the model, and some features are not required,
such as customer_id.
In the second step, we will try to find the optimal number of clusters for
our clustering problem. So, as discussed above, here we are going to use
the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot
by plotting WCSS values on the Y-axis and the number of clusters on the
X-axis. So we are going to calculate the value for WCSS for different k
values ranging from 1 to 10. Below is the code for it:
As we can see in the above code, we have used the KMeans class of
sklearn. cluster library to form the clusters.
After that, we have initialized the for loop for the iteration on a different
value of k ranging from 1 to 10; since for loop in Python, exclude the
outbound limit, so it is taken as 11 to include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have
fitted the model on a matrix of features and then plotted the graph
between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number
of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on
the dataset.
To train the model, we will use the same two lines of code as we have
used in the above section, but here instead of using i, we will use 5, as we
know there are 5 clusters that need to be formed. The code is given below:
The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent
variable y_predict to train the model.
By executing the above lines of code, we will get the y_predict variable.
We can check it under the variable explorer option in the Spyder IDE.
We can now compare the values of y_predict with our original dataset.
Consider the below image:
From the above image, we can now relate that the CustomerID 1 belongs
to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs
to cluster 4, and so on.
The last step is to visualize the clusters. As we have 5 clusters for our
model, so we will visualize each cluster one by one.
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset
and to put them into different clusters. It means here we will try to
group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or
any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below
image:
o Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear
visualization.
o As we need to find the closest cluster, so we will repeat the process
by choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new
centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this,
we will repeat the same process of finding a median line. The
median will be like below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which
is finding new centroids or K-points.
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points
on either side of the line, which means our model is formed.
Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines
the total variations within a cluster. The formula to calculate the value of
WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same
for the other two terms.
To measure the distance between data points and centroid, we can use
any method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below
steps:
Note: We can choose the number of clusters equal to the given data
points. If we choose the number of clusters equal to the data points,
then the value of WCSS becomes zero, and that will be the endpoint
of the plot.
In the above section, we have discussed the K-means algorithm, now let's
see how it can be implemented using Python.
The first step will be the data pre-processing, as we did in our earlier
topics of Regression and Classification. But for the clustering problem, it
will be different from other models. Let's discuss it:
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries
for our model, which is part of data pre-processing. The code is
given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing
mathematics calculation, matplotlib is for plotting the graph,
and pandas are for managing the dataset.
By executing the above lines of code, we will get our dataset in the
Spyder IDE. The dataset looks like the below image:
From the above dataset, we need to find some patterns in it.
Here we don't need any dependent variable for data pre-processing step as
it is a clustering problem, and we have no idea about what to determine.
So we will just add a line of code for the matrix of features.
As we can see, we are extracting only 3rd and 4th feature. It is because we
need a 2d plot to visualize the model, and some features are not required,
such as customer_id.
As we know, the elbow method uses the WCSS concept to draw the plot
by plotting WCSS values on the Y-axis and the number of clusters on the
X-axis. So we are going to calculate the value for WCSS for different k
values ranging from 1 to 10. Below is the code for it:
As we can see in the above code, we have used the KMeans class of
sklearn. cluster library to form the clusters.
After that, we have initialized the for loop for the iteration on a different
value of k ranging from 1 to 10; since for loop in Python, exclude the
outbound limit, so it is taken as 11 to include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have
fitted the model on a matrix of features and then plotted the graph
between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number
of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on
the dataset.
To train the model, we will use the same two lines of code as we have
used in the above section, but here instead of using i, we will use 5, as we
know there are 5 clusters that need to be formed. The code is given below:
The first line is the same as above for creating the object of KMeans class.
LTo visualize the clusters will use scatter plot using mtp.scatter()
function of matplotlib.
In above lines of code, we have written code for each clusters, ranging
from 1 to 5. The first coordinate of the mtp.scatter, i.e., x[y_predict == 0,
0] containing the x value for the showing the matrix of features values,
and the y_predict is ranging from 0 to 1.
Output:
The output image is clearly showing the five different clusters with
different colors. The clusters are formed between two parameters of the
dataset; Annual income of customer and Spending. We can change the
colors and labels as per the requirement or choice. We can also observe
some points from the above patterns, which are given below:
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs
to cluster 4, and so on.
The last step is to visualize the clusters. As we have 5 clusters for our
model, so we will visualize each cluster one by one.
Apriori Algorithm
Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A & B are
the frequent itemsets together, then individually A and B should also be
the frequent itemset.
Step-2: Take all supports in the transaction with higher support value
than the minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence
value than the threshold or minimum confidence.
:
Solution:
o In the first step, we will create a table that contains support count
(The frequency of each itemset individually in the dataset) of each
itemset in the given dataset. This table is called the Candidate set
or C1.
o Now, we will take out all the itemsets that have the greater support
count that the Minimum Support (2). It will give us the table for
the frequent itemset L1.
Since all the itemsets have greater or equal support count than the
minimum support, except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:
o In this step, we will generate C2 with the help of L1. In C2, we will
create the pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count
from the main transaction table of datasets, i.e., how many times
these pairs have occurred together in the given dataset. So, we will
get the below table for C2:
o For C3, we will repeat the same two processes, but now we will
form the C3 table with subsets of three itemsets together, and will
calculate the support count from the dataset. It will give the below
table:
o Now we will create the L3 table. As we can see from the above C3
table, there is only one combination of itemset that has support
count equal to the minimum support count. So, the L3 will have
only one combination, i.e., {A, B, C}.
To generate the association rules, first, we will create a new table with the
possible rules from the occurred combination {A, B.C}. For all the rules,
we will calculate the Confidence using formula sup( A ^B)/A. After
calculating the confidence value for all rules, we will exclude the rules
that have less confidence than the minimum threshold(50%).
********************************************