Unit-4 DWM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

UNIT-4

INTRODUCTION ALGORITHM AND KNOWLEDGE DISCOVERY

There is a huge amount of data available in the Information Industry. This


data is of no use until it is converted into useful information. It is
necessary to analyze this huge amount of data and extract useful
information from it.
Extraction of information is not the only process we need to perform; data
mining also involves other processes such as Data Cleaning, Data
Integration, Data Transformation, Data Mining, Pattern Evaluation and
Data Presentation. Once all these processes are over, we would be able to
use this information in many applications such as Fraud Detection,
Market Analysis, Production Control, Science Exploration, etc.

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data.


In other words, we can say that data mining is the procedure of mining
knowledge from data. The information or knowledge extracted so can be
used for any of the following applications −
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

Data Mining Applications

Data mining is highly useful in the following domains −


 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
Apart from these, data mining can also be used in the areas of production
control, customer retention, science exploration, sports, astrology, and
Internet Web Surf-Aid

Market Analysis and Management

Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of
people buy what kind of products.

Identifying Customer Requirements − Data mining helps in


identifying the best products for different customers. It uses
prediction to find the factors that may attract new customers.

Cross Market Analysis − Data mining performs


Association/correlations between product sales.

Target Marketing − Data mining helps to find clusters of model


customers who share the same characteristics such as interests,
spending habits, income, etc.

Determining Customer purchasing pattern − Data mining helps


in determining customer purchasing pattern.

Providing Summary Information − Data mining provides us


various multidimensional summary reports.

Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −

Finance Planning and Asset Evaluation − It involves cash flow


analysis and prediction, contingent claim analysis to evaluate assets.

Resource Planning − It involves summarizing and comparing the


resources and spending.

Competition − It involves monitoring competitors and market


directions.

Fraud Detection

Data mining is also used in the fields of credit card services and
telecommunication to detect frauds. In fraud telephone calls, it helps to
find the destination of the call, duration of the call, time of the day or
week, etc. It also analyzes the patterns that deviate from expected norms.
Data Mining - Tasks
Data mining deals with the kind of patterns that can be mined. On the
basis of the kind of data to be mined, there are two categories of functions
involved in Data Mining −
 Descriptive
 Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the
database. Here is the list of descriptive functions −
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales
include computer and printers, and concepts of customers include big
spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived
by the following two ways −
Data Characterization − This refers to summarizing data of class
under study. This class under study is called as Target Class.

Data Discrimination − It refers to the mapping or classification of


a class with some predefined group or class.

Mining of Frequent Patterns


Frequent patterns are those patterns that occur frequently in transactional
data. Here is the list of kind of frequent patterns −

Frequent Item Set − It refers to a set of items that frequently


appear together, for example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur
frequently such as purchasing a camera is followed by memory
card.

Frequent Sub Structure − Substructure refers to different


structural forms, such as graphs, trees, or lattices, which may be
combined with item-sets or subsequences.

Mining of Association
Associations are used in retail sales to identify patterns that are frequently
purchased together. This process refers to the process of uncovering the
relationship among data and determining association rules.
For example, a retailer generates an association rule that shows that 70%
of time milk is sold with bread and only 30% of times biscuits are sold
with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or
between two item sets to analyze that if they have positive, negative or no
effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers
to forming group of objects that are very similar to each other but are
highly different from the objects in other clusters.

Classification and Prediction

Classification is the process of finding a model that describes the data


classes or concepts. The purpose is to be able to use this model to predict
the class of objects whose class label is unknown. This derived model is
based on the analysis of sets of training data. The derived model can be
presented in the following forms −
 Classification (IF-THEN) Rules
 Decision Trees
 Mathematical Formulae
 Neural Networks
The list of functions involved in these processes are as follows −
Classification − It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based
on the analysis set of training data i.e. the data object whose class
label is well known.

Prediction − It is used to predict missing or unavailable numerical


data values rather than class labels. Regression Analysis is
generally used for prediction. Prediction can also be used for
identification of distribution trends based on available data.

Outlier Analysis − Outliers may be defined as the data objects that


do not comply with the general behavior or model of the data
available.

Evolution Analysis − Evolution analysis refers to the description


and model regularities or trends for objects whose behavior changes
over time.

Data Mining Task Primitives

 We can specify a data mining task in the form of a data mining


query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task
primitives.
Note − These primitives allow us to communicate in an interactive
manner with the data mining system. Here is the list of Data Mining Task
Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion
includes the following −
 Database Attributes
 Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of
abstraction. For example, the Concept hierarchies are one of the
background knowledge that allows data to be mined at multiple levels of
abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of
knowledge discovery. There are different interesting measures for
different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed.
These representations may include the following. −
 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes
Data Mining - Issues
Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

Mining different kinds of knowledge in databases − Different


users may be interested in different kinds of knowledge. Therefore
it is necessary for data mining to cover a broad range of knowledge
discovery task.

Interactive mining of knowledge at multiple levels of


abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing
and refining data mining requests based on the returned results.

Incorporation of background knowledge − To guide discovery


process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results − Once the


patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should
be easily understandable.

Handling noisy or incomplete data − The data cleaning methods


are required to handle the noise and incomplete objects while
mining the data regularities. If the data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.

Pattern evaluation − The patterns discovered should be interesting


because either they represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −

Efficiency and scalability of data mining algorithms − In order


to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms − The


factors such as huge size of databases, wide distribution of data,
and complexity of data mining methods motivate the development
of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a
parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data
again from scratch.

Diverse Data Types Issues


Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all
these kind of data.

Mining information from heterogeneous databases and global


information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured,
semi structured or unstructured. Therefore mining the knowledge
from them adds challenges to data mining.

Data Mining - Evaluation

Data Warehouse

A data warehouse exhibits the following characteristics to support the


management's decision-making process −

Subject Oriented − Data warehouse is subject oriented because it


provides us the information around a subject rather than the
organization's ongoing operations. These subjects can be product,
customers, suppliers, sales, revenue, etc. The data warehouse does
not focus on the ongoing operations, rather it focuses on modelling
and analysis of data for decision-making.

Integrated − Data warehouse is constructed by integration of data


from heterogeneous sources such as relational databases, flat files
etc. This integration enhances the effective analysis of data.

Time Variant − The data collected in a data warehouse is


identified with a particular time period. The data in a data
warehouse provides information from a historical point of view.

Non-volatile − Nonvolatile means the previous data is not removed


when new data is added to it. The data warehouse is kept separate
from the operational database therefore frequent changes in
operational database is not reflected in the data warehouse.

Data Warehousing
Data warehousing is the process of constructing and using the data
warehouse. A data warehouse is constructed by integrating the data from
multiple heterogeneous sources. It supports analytical reporting,
structured and/or ad hoc queries, and decision making.
Data warehousing involves data cleaning, data integration, and data
consolidations. To integrate heterogeneous databases, we have the
following two approaches −
 Query Driven Approach
 Update Driven Approach

Query-Driven Approach

This is the traditional approach to integrate heterogeneous databases. This


approach is used to build wrappers and integrators on top of multiple
heterogeneous databases. These integrators are also known as mediators.
Process of Query Driven Approach

 When a query is issued to a client side, a metadata dictionary


translates the query into the queries, appropriate for the individual
heterogeneous site involved.

 Now these queries are mapped and sent to the local query processor.

 The results from heterogeneous sites are integrated into a global


answer set.

Disadvantages
This approach has the following disadvantages −

 The Query Driven Approach needs complex integration and filtering


processes.

 It is very inefficient and very expensive for frequent queries.

 This approach is expensive for queries that require aggregations.

Update-Driven Approach
Today's data warehouse systems follow update-driven approach rather
than the traditional approach discussed earlier. In the update-driven
approach, the information from multiple heterogeneous sources is
integrated in advance and stored in a warehouse. This information is
available for direct querying and analysis.
Advantages
This approach has the following advantages −

This approach provides high performance.

The data can be copied, processed, integrated, annotated,


summarized and restructured in the semantic data store in advance.
Query processing does not require interface with the processing at
local sources.

From Data Warehousing (OLAP) to Data Mining (OLAM)

Online Analytical Mining integrates with Online Analytical Processing


with data mining and mining knowledge in multidimensional databases.
Here is the diagram that shows the integration of both OLAP and OLAM

Importance of OLAM

OLAM is important for the following reasons −



High quality of data in data warehouses − The data mining tools
are required to work on integrated, consistent, and cleaned data.
These steps are very costly in the preprocessing of data. The data
warehouses constructed by such preprocessing are valuable sources
of high quality data for OLAP and data mining as well.

Available information processing infrastructure surrounding


data warehouses − Information processing infrastructure refers to
accessing, integration, consolidation, and transformation of
multiple heterogeneous databases, web-accessing and service
facilities, reporting and OLAP analysis tools.

OLAP−based exploratory data analysis − Exploratory data


analysis is required for effective data mining. OLAM provides
facility for data mining on various subset of data and at different
levels of abstraction.

Online selection of data mining functions − Integrating OLAP


with multiple data mining functions and online analytical mining
provide users with the flexibility to select desired data mining
functions and swap data mining tasks dynamically.

Data Mining - Terminologies

Data Mining

Data mining is defined as extracting the information from a huge set of


data. In other words we can say that data mining is mining the knowledge
from data. This information can be used for any of the following
applications −
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

Data Mining Engine


Data mining engine is very essential to the data mining system. It consists
of a set of functional modules that perform the following functions −
 Characterization
 Association and Correlation Analysis
 Classification
 Prediction
 Cluster analysis
 Outlier analysis
 Evolution analysis

Knowledge Base

This is the domain knowledge. This knowledge is used to guide the


search or evaluate the interestingness of the resulting patterns.

Knowledge Discovery

Some people treat data mining same as knowledge discovery, while


others view data mining as an essential step in the process of knowledge
discovery. Here is the list of steps involved in the knowledge discovery
process −
 Data Cleaning
 Data Integration
 Data Selection
 Data Transformation
 Data Mining
 Pattern Evaluation
 Knowledge Presentation

User interface

User interface is the module of data mining system that helps the
communication between users and the data mining system. User Interface
allows the following functionalities −
 Interact with the system by specifying a data mining query task.
 Providing information to help focus the search.
 Mining based on the intermediate data mining results.
 Browse database and data warehouse schemas or data structures.
 Evaluate mined patterns.
 Visualize the patterns in different forms.
Data Integration

Data Integration is a data preprocessing technique that merges the data


from multiple heterogeneous data sources into a coherent data store. Data
integration may involve inconsistent data and therefore needs data
cleaning.

Data Cleaning

Data cleaning is a technique that is applied to remove the noisy data and
correct the inconsistencies in data. Data cleaning involves
transformations to correct the wrong data. Data cleaning is performed as a
data preprocessing step while preparing the data for a data warehouse.

Data Selection

Data Selection is the process where data relevant to the analysis task are
retrieved from the database. Sometimes data transformation and
consolidation are performed before the data selection process.

Clusters

Cluster refers to a group of similar kind of objects. Cluster analysis refers


to forming group of objects that are very similar to each other but are
highly different from the objects in other clusters.

Data Transformation

In this step, data is transformed or consolidated into forms appropriate for


mining, by performing summary or aggregation operations.

Data Mining - Knowledge Discovery

What is Knowledge Discovery?

Some people don’t differentiate data mining from knowledge discovery


while others view data mining as an essential step in the process of
knowledge discovery. Here is the list of steps involved in the knowledge
discovery process −

Data Cleaning − In this step, the noise and inconsistent data is


removed.
Data Integration − In this step, multiple data sources are
combined.

Data Selection − In this step, data relevant to the analysis task are
retrieved from the database.

Data Transformation − In this step, data is transformed or


consolidated into forms appropriate for mining by performing
summary or aggregation operations.

Data Mining − In this step, intelligent methods are applied in order


to extract data patterns.

Pattern Evaluation − In this step, data patterns are evaluated.

Knowledge Presentation − In this step, knowledge is represented.

The following diagram shows the process of knowledge discovery −

Data Mining - Systems


There is a large variety of data mining systems available. Data mining
systems may integrate techniques from the following −
 Spatial Data Analysis
 Information Retrieval
 Pattern Recognition
 Image Analysis
 Signal Processing
 Computer Graphics
 Web Technology
 Business
 Bioinformatics

Data Mining System Classification

A data mining system can be classified according to the following criteria



 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines

Apart from these, a data mining system can also be classified based on
the kind of (a) databases mined, (b) knowledge mined, (c) techniques
utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases
mined. Database system can be classified according to different criteria
such as data models, types of data, etc. And the data mining system can
be classified accordingly.
For example, if we classify a database according to the data model, then
we may have a relational, transactional, object-relational, or data
warehouse mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge
mined. It means the data mining system is classified on the basis of
functionalities such as −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Outlier Analysis
 Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques
used. We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications
adapted. These applications are as follows −
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail

Integrating a Data Mining System with a DB/DW System

If a data mining system is not integrated with a database or a data


warehouse system, then there will be no system to communicate with.
This scheme is known as the non-coupling scheme. In this scheme, the
main focus is on data mining design and on developing efficient and
effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −

No Coupling − In this scheme, the data mining system does not


utilize any of the database or data warehouse functions. It fetches
the data from a particular source and processes that data using some
data mining algorithms. The data mining result is stored in another
file.

Loose Coupling − In this scheme, the data mining system may use
some of the functions of database and data warehouse system. It
fetches the data from the data respiratory managed by these
systems and performs data mining on that data. It then stores the
mining result either in a file or in a designated place in a database
or in a data warehouse.

Semi−tight Coupling − In this scheme, the data mining system is


linked with a database or a data warehouse system and in addition
to that, efficient implementations of a few data mining primitives
can be provided in the database.

Tight coupling − In this coupling scheme, the data mining system


is smoothly integrated into the database or data warehouse system.
The data mining subsystem is treated as one functional component
of an information system.

The Data Mining Query Language (DMQL) was proposed by Han, Fu,
Wang, et al. for the DBMiner data mining system. The Data Mining
Query Language is actually based on the Structured Query Language
(SQL). Data Mining Query Languages can be designed to support ad hoc
and interactive data mining. This DMQL provides commands for
specifying primitives. The DMQL can work with databases and data
warehouses as well. DMQL can be used to define data mining tasks.
Particularly we examine how to define data warehouses and data marts in
DMQL.

Syntax for Task-Relevant Data Specification

Here is the syntax of DMQL for specifying task-relevant data −


use database database_name
or

use data warehouse data_warehouse_name


in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Syntax for Specifying the Kind of Knowledge

Here we will discuss the syntax for Characterization, Discrimination,


Association, Classification, and Prediction.
Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or
count%.
For example −
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase
items that cost $100 or more on an average; and budget spenders as
customers who purchase items at less than $100 on an average. The
mining of discriminant descriptions for customers from each of these
categories can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are predicate variables; and
W, Y, and Z are object variables.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer credit rating where
the classes are determined by the attribute credit_rating, and mine
classification is determined as classifyCustomerCreditRating.
analyze credit_rating
Prediction
The syntax for prediction is −
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

Syntax for Concept Hierarchy Specification

To specify concept hierarchies, use the following syntax −


use hierarchy <hierarchy> for <attribute_or_dimension>
We use different syntaxes to define different types of hierarchies such as−
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior

-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)

-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all

if ((price - cost) > $50) and ((price - cost) ≤ $250))


level_1: high_profit_margin < level_0: all

Syntax for Interestingness Measures Specification

Interestingness measures and thresholds can be specified by the user with


the statement −
with <interest_measure_name> threshold = threshold_value
For Example −
with support threshold = 0.05
with confidence threshold = 0.7

Syntax for Pattern Presentation and Visualization Specification

We have a syntax, which allows users to specify the display of discovered


patterns in one or more forms.

Data Mining - Classification & Prediction


There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
These two forms are as follows −
 Classification
 Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions. For example, we can build a
classification model to categorize bank loan applications as either safe or
risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and
occupation.

What is classification?

Following are the examples of cases where the data analysis task is
Classification −
A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.

A marketing manager at a company needs to analyze a customer


with a given profile, who will buy a new computer.

In both of the above examples, a model or classifier is constructed to


predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.

What is prediction?

Following are the examples of cases where the data analysis task is
Prediction −
Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we are
bothered to predict a numeric value. Therefore the data analysis task is an
example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often
used for numeric prediction.

How Does Classification Works?

With the help of the bank loan application that we have discussed above,
let us understand the working of classification. The Data Classification
process includes two steps −
 Building the Classifier or Model
 Using Classifier for Classification
Building the Classifier or Model

This step is the learning step or the learning phase.

In this step the classification algorithms build the classifier.

The classifier is built from the training set made up of database


tuples and their associated class labels.

Each tuple that constitutes the training set is referred to as a


category or class. These tuples can also be referred to as sample,
object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification
rules can be applied to the new data tuples if the accuracy is considered
acceptable.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −

Data Cleaning − Data cleaning involves removing the noise and


treatment of missing values. The noise is removed by applying
smoothing techniques and the problem of missing values is solved
by replacing a missing value with most commonly occurring value
for that attribute.

Relevance Analysis − Database may also have the irrelevant


attributes. Correlation analysis is used to know whether any two
given attributes are related.

Data Transformation and reduction − The data can be


transformed by any of the following methods.


Normalization − The data is transformed using
normalization. Normalization involves scaling all values for
given attribute in order to make them fall within a small
specified range. Normalization is used when in the learning
step, the neural networks or the methods involving
measurements are used.

Generalization − The data can also be transformed by


generalizing it to the higher concept. For this purpose we can
use the concept hierarchies.

Note − Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and


Prediction −

Accuracy − Accuracy of classifier refers to the ability of classifier.


It predict the class label correctly and the accuracy of the predictor
refers to how well a given predictor can guess the value of
predicted attribute for a new data.

Speed − This refers to the computational cost in generating and


using the classifier or predictor.

Robustness − It refers to the ability of classifier or predictor to


make correct predictions from given noisy data.

Scalability − Scalability refers to the ability to construct the


classifier or predictor efficiently; given large amount of data.

Interpretability − It refers to what extent the classifier or predictor


understands

Data Mining - Decision Tree Induction


A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that
indicates whether a customer at a company is likely to buy a computer or
not. Each internal node represents a test on an attribute. Each leaf node
represents a class.

The benefits of having a decision tree are as follows −


 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple
and fast.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a


decision tree algorithm known as ID3 (Iterative Dichotomiser). Later, he
presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a
greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition
DAlgorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Tree Pruning

Tree pruning is performed in order to remove anomalies in the training


data due to noise or outliers. The pruned trees are smaller and less
complex.
Tree Pruning Approaches
There are two approaches to prune a tree −

Pre-pruning − The tree is pruned by halting its construction early.

Post-pruning - This approach removes a sub-tree from a fully


grown tree.

Cost Complexity

The cost complexity is measured by the following two parameters −


 Number of leaves in the tree, and
 Error rate of the tree.

Data Mining - Cluster Analysis


Cluster is a group of objects that belongs to the same class. In other
words, similar objects are grouped in one cluster and dissimilar objects
are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into


classes of similar objects.
Points to Remember

A cluster of data objects can be treated as one group.

While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.

The main advantage of clustering over classification is that, it is


adaptable to changes and helps single out useful features that
distinguish different groups.

Applications of Cluster Analysis


Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.

Clustering can also help marketers discover distinct groups in their


customer base. And they can characterize their customer groups
based on the purchasing patterns.

In the field of biology, it can be used to derive plant and animal


taxonomies, categorize genes with similar functionalities and gain
insight into structures inherent to populations.

Clustering also helps in identification of areas of similar land use in


an earth observation database. It also helps in the identification of
groups of houses in a city according to house type, value, and
geographic location.

Clustering also helps in classifying documents on the web for


information discovery.

Clustering is also used in outlier detection applications such as


detection of credit card fraud.

As a data mining function, cluster analysis serves as a tool to gain


insight into the distribution of data to observe characteristics of
each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data


mining −

Scalability − We need highly scalable clustering algorithms to deal


with large databases.

Ability to deal with different kinds of attributes − Algorithms


should be capable to be applied on any kind of data such as
interval-based (numerical) data, categorical, and binary data.

Discovery of clusters with attribute shape − The clustering


algorithm should be capable of detecting clusters of arbitrary shape.
They should not be bounded to only distance measures that tend to
find spherical cluster of small sizes.

High dimensionality − The clustering algorithm should not only


be able to handle low-dimensional data but also the high
dimensional space.

Ability to deal with noisy data − Databases contain noisy, missing


or erroneous data. Some algorithms are sensitive to such data and
may lead to poor quality clusters.

Interpretability − The clustering results should be interpretable,


comprehensible, and usable.

Clustering Methods

Clustering methods can be classified into the following categories −


 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning
method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −

Each group contains at least one object.

Each object must belong to exactly one group.

Points to remember −

For a given number of partitions (say k), the partitioning method


will create an initial partitioning.
Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.

Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data
objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start
with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keep on doing so until
all of the groups are merged into one or until the termination condition
holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once
a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of
hierarchical clustering −

Perform careful analysis of object linkages at each hierarchical


partitioning.

Integrate hierarchical agglomeration by first using a hierarchical


agglomerative algorithm to group objects into micro-clusters, and
then performing macro-clustering on the micro-clusters.

Density-based Method
This method is based on the notion of density. The basic idea is to
continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages

The major advantage of this method is fast processing time.

It is dependent only on the number of cells in each dimension in the


quantized space.

Model-based methods
In this method, a model is hypothesized for each cluster to find the best
fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.
This method also provides a way to automatically determine the number
of clusters based on standard statistics, taking outlier or noise into
account. It therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user
expectation or the properties of desired clustering results. Constraints
provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application
requirement
Association Rule
Association rule mining finds interesting associations and relationships
among large sets of data items. This rule shows how frequently a itemset
occurs in a transaction. A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large
relations to show associations between items.It allows retailers to
identify relationships between the items that people buy together
frequently.
Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the
transaction.
TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke


Before we start defining the rule, let us first see the basic definitions.
Support Count( ) – Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal
to minsup threshold.
Association Rule – An implication expression of the form X -> Y,
where X and Y are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
 Support(s) –
The number of transactions that include items in the {X} and {Y}
parts of the rule as a percentage of the total number of transaction.It
is a measure of how frequently the collection of items occur together
as a percentage of all transactions.
 Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
 Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B}
as well as the no of transactions that includes all items in {A} to the
no of transactions that includes all items in {A}.
 Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that
contains items in X also.
 Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the
expected confidence, assuming that the itemsets X and Y are
independent of each other.The expected confidence is the confidence
divided by the frequency of {Y}.
 Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as
expected, greater than 1 means they appear together more than
expected and less than 1 means they appear less than
expected.Greater lift values indicate stronger association.

Example – From the above table, {Milk, Diaper}=>{Beer}


s= ({Milk, Diaper, Beer}) |T|
= 2/5
= 0.4

c= (Milk, Diaper, Beer) (Milk, Diaper)


= 2/3
= 0.67

l= Supp({Milk, Diaper, Beer}) Supp({Milk, Diaper})*Supp({Beer})


= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is
collected using bar-code scanners in supermarkets. Such databases
consists of a large number of transaction records which list all items
bought by a customer on a single purchase. So the manager could know
if certain groups of items are consistently purchased together and use
this data for adjusting store layouts, cross-selling, promotions based on
statistics.
K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to


solve the clustering problems in machine learning or data science. In this
topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means
clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process, as if
K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.

It is an iterative algorithm that divides the unlabeled


dataset into k different clusters in such a way that each
dataset belongs only one group that has similar
properties.

It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a


centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters The
algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an


iterative process.
o Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is


away from other clusters.

The below diagram explains the working of the K-means Clustering


Algorithm:
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the
input dataset).

Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset
and to put them into different clusters. It means here we will try to
group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or
any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below
image:

o Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids.
Consider the below image:

From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear
visualization.

o As we need to find the closest cluster, so we will repeat the process


by choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new
centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this,
we will repeat the same process of finding a median line. The
median will be like below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which


is finding new centroids or K-points.

o We will repeat the process by finding the center of gravity of


centroids, so the new centroids will be as shown in the below
image:
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points
on either side of the line, which means our model is formed.
Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:

How to choose the value of "K number of clusters" in K-means


Clustering?

The performance of the K-means clustering algorithm depends upon


highly efficient clusters that it forms. But choosing the optimal number of
clusters is a big task. There are some different ways to find the optimal
number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given
below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines
the total variations within a cluster. The formula to calculate the value of
WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in


CLuster3 distance(Pi C3)
2

In the above formula of WCSS,


∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same
for the other two terms.

To measure the distance between data points and centroid, we can use
any method such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below
steps:

o It executes the K-means clustering on a given dataset for different


K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of
clusters K.
o The sharp point of bend or a point of the plot looks like an arm,
then that point is considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence
it is known as the elbow method. The graph for the elbow method looks
like the below image:

Note: We can choose the number of clusters equal to the given data
points. If we choose the number of clusters equal to the data points,
then the value of WCSS becomes zero, and that will be the endpoint
of the plot.

Python Implementation of K-means Clustering Algorithm

In the above section, we have discussed the K-means algorithm, now let's
see how it can be implemented using Python.

Before implementation, let's understand what type of problem we will


solve here. So, we have a dataset of Mall_Customers, which is the data
of customers who visit the mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual


Income ($), and Spending Score (which is the calculated value of how
much a customer has spent in the mall, the more the value, the more he
has spent). From this dataset, we need to calculate some patterns, as it is
an unsupervised method, so we don't know what to calculate exactly.

The steps to be followed for the implementation are given below:

o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters

Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier
topics of Regression and Classification. But for the clustering problem, it
will be different from other models. Let's discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries
for our model, which is part of data pre-processing. The code is
given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

In the above code, the numpy we have imported for the performing
mathematics calculation, matplotlib is for plotting the graph,
and pandas are for managing the dataset.

o Importing the Dataset:


Next, we will import the dataset that we need to use. So here, we
are using the Mall_Customer_data.csv dataset. It can be imported
using the below code:

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')

By executing the above lines of code, we will get our dataset in the
Spyder IDE. The dataset looks like the below image:
From the above dataset, we need to find some patterns in it.

o Extracting Independent Variables

Here we don't need any dependent variable for data pre-processing step as
it is a clustering problem, and we have no idea about what to determine.
So we will just add a line of code for the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values

As we can see, we are extracting only 3rd and 4th feature. It is because we
need a 2d plot to visualize the model, and some features are not required,
such as customer_id.

Step-2: Finding the optimal number of clusters using the elbow


method

In the second step, we will try to find the optimal number of clusters for
our clustering problem. So, as discussed above, here we are going to use
the elbow method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot
by plotting WCSS values on the Y-axis and the number of clusters on the
X-axis. So we are going to calculate the value for WCSS for different k
values ranging from 1 to 10. Below is the code for it:

1. #finding optimal number of clusters using the elbow method


2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-
means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)
11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

As we can see in the above code, we have used the KMeans class of
sklearn. cluster library to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list,


which is used to contain the value of wcss computed for different values
of k ranging from 1 to 10.

After that, we have initialized the for loop for the iteration on a different
value of k ranging from 1 to 10; since for loop in Python, exclude the
outbound limit, so it is taken as 11 to include 10th value.

The rest part of the code is similar as we did in earlier topics, as we have
fitted the model on a matrix of features and then plotted the graph
between the number of clusters and WCSS.

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number
of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset

As we have got the number of clusters, so we can now train the model on
the dataset.

To train the model, we will use the same two lines of code as we have
used in the above section, but here instead of using i, we will use 5, as we
know there are 5 clusters that need to be formed. The code is given below:

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5,
3. init='k-means++', random_state= 42)
4. y_predict= kmeans.fit_predict(x)

The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent
variable y_predict to train the model.

By executing the above lines of code, we will get the y_predict variable.
We can check it under the variable explorer option in the Spyder IDE.
We can now compare the values of y_predict with our original dataset.
Consider the below image:

From the above image, we can now relate that the CustomerID 1 belongs
to a cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs
to cluster 4, and so on.

Step-4: Visualizing the Clusters

The last step is to visualize the clusters. As we have 5 clusters for our
model, so we will visualize each cluster one by one.

K-Means Clustering is an unsupervised learning algorithm that is used to


solve the clustering problems in machine learning or data science. In this
topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means
clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process, as if
K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.

It is an iterative algorithm that divides the unlabeled


dataset into k different clusters in such a way that each
dataset belongs only one group that has similar
properties.

It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a


centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an


iterative process.
o Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is


away from other clusters.

The below diagram explains the working of the K-means Clustering


Algorithm:
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the
input dataset).

Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset
and to put them into different clusters. It means here we will try to
group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or
any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below
image:

o Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids.
Consider the below image:

From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear
visualization.
o As we need to find the closest cluster, so we will repeat the process
by choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new
centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this,
we will repeat the same process of finding a median line. The
median will be like below image:

From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which
is finding new centroids or K-points.

o We will repeat the process by finding the center of gravity of


centroids, so the new centroids will be as shown in the below
image:

o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points
on either side of the line, which means our model is formed.
Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:

How to choose the value of "K number of clusters" in K-means


Clustering?
The performance of the K-means clustering algorithm depends upon
highly efficient clusters that it forms. But choosing the optimal number of
clusters is a big task. There are some different ways to find the optimal
number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given
below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines
the total variations within a cluster. The formula to calculate the value of
WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in


CLuster3 distance(Pi C3)
2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same
for the other two terms.

To measure the distance between data points and centroid, we can use
any method such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below
steps:

o It executes the K-means clustering on a given dataset for different


K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of
clusters K.
o The sharp point of bend or a point of the plot looks like an arm,
then that point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence
it is known as the elbow method. The graph for the elbow method looks
like the below image:

Note: We can choose the number of clusters equal to the given data
points. If we choose the number of clusters equal to the data points,
then the value of WCSS becomes zero, and that will be the endpoint
of the plot.

Python Implementation of K-means Clustering Algorithm

In the above section, we have discussed the K-means algorithm, now let's
see how it can be implemented using Python.

Before implementation, let's understand what type of problem we will


solve here. So, we have a dataset of Mall_Customers, which is the data
of customers who visit the mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual


Income ($), and Spending Score (which is the calculated value of how
much a customer has spent in the mall, the more the value, the more he
has spent). From this dataset, we need to calculate some patterns, as it is
an unsupervised method, so we don't know what to calculate exactly.

The steps to be followed for the implementation are given below:


o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters

Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier
topics of Regression and Classification. But for the clustering problem, it
will be different from other models. Let's discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries
for our model, which is part of data pre-processing. The code is
given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

In the above code, the numpy we have imported for the performing
mathematics calculation, matplotlib is for plotting the graph,
and pandas are for managing the dataset.

o Importing the Dataset:


Next, we will import the dataset that we need to use. So here, we
are using the Mall_Customer_data.csv dataset. It can be imported
using the below code:

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')

By executing the above lines of code, we will get our dataset in the
Spyder IDE. The dataset looks like the below image:
From the above dataset, we need to find some patterns in it.

o Extracting Independent Variables

Here we don't need any dependent variable for data pre-processing step as
it is a clustering problem, and we have no idea about what to determine.
So we will just add a line of code for the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values

As we can see, we are extracting only 3rd and 4th feature. It is because we
need a 2d plot to visualize the model, and some features are not required,
such as customer_id.

Step-2: Finding the optimal number of clusters using the elbow


method
In the second step, we will try to find the optimal number of clusters for
our clustering problem. So, as discussed above, here we are going to use
the elbow method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot
by plotting WCSS values on the Y-axis and the number of clusters on the
X-axis. So we are going to calculate the value for WCSS for different k
values ranging from 1 to 10. Below is the code for it:

1. #finding optimal number of clusters using the elbow method


2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-
means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)
11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

As we can see in the above code, we have used the KMeans class of
sklearn. cluster library to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list,


which is used to contain the value of wcss computed for different values
of k ranging from 1 to 10.

After that, we have initialized the for loop for the iteration on a different
value of k ranging from 1 to 10; since for loop in Python, exclude the
outbound limit, so it is taken as 11 to include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have
fitted the model on a matrix of features and then plotted the graph
between the number of clusters and WCSS.

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number
of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset

As we have got the number of clusters, so we can now train the model on
the dataset.

To train the model, we will use the same two lines of code as we have
used in the above section, but here instead of using i, we will use 5, as we
know there are 5 clusters that need to be formed. The code is given below:

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5, init='k-
means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)

The first line is the same as above for creating the object of KMeans class.

In the second line of code, we have created the dependent


variable y_predict to train the model.
By executing the above lines of code, we will get the y_predict variable.
We can check it under the variable explorer option in the Spyder IDE.
We can now compare the values of y_predict with our original dataset.
Consider the below image:

LTo visualize the clusters will use scatter plot using mtp.scatter()
function of matplotlib.

1. #visulaizing the clusters


2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = '
blue', label = 'Cluster 1') #for first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = '
green', label = 'Cluster 2') #for second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = '
red', label = 'Cluster 3') #for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = '
cyan', label = 'Cluster 4') #for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = '
magenta', label = 'Cluster 5') #for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_
[:, 1], s = 300, c = 'yellow', label = 'Centroid')
8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()

In above lines of code, we have written code for each clusters, ranging
from 1 to 5. The first coordinate of the mtp.scatter, i.e., x[y_predict == 0,
0] containing the x value for the showing the matrix of features values,
and the y_predict is ranging from 0 to 1.

Output:
The output image is clearly showing the five different clusters with
different colors. The clusters are formed between two parameters of the
dataset; Annual income of customer and Spending. We can change the
colors and labels as per the requirement or choice. We can also observe
some points from the above patterns, which are given below:

o Cluster1 shows the customers with average salary and average


spending so we can categorize these customers as
o Cluster2 shows the customer has a high income but low spending,
so we can categorize them as careful.
o Cluster3 shows the low income and also low spending so they can
be categorized as sensible.
o Cluster4 shows the customers with low income with very high
spending so they can be categorized as careless.
o Cluster5 shows the customers with high income and high spending
so they can be categorized as target, and these customers can be the
most profitable customers for the mall owner.
From the above image, we can now relate that the CustomerID 1 belongs
to a cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs
to cluster 4, and so on.

Step-4: Visualizing the Clusters

The last step is to visualize the clusters. As we have 5 clusters for our
model, so we will visualize each cluster one by one.

Apriori Algorithm

he Apriori algorithm uses frequent itemsets to generate association rules,


and it is designed to work on the databases that contain transactions. With
the help of these association rule, it determines how strongly or how
weakly two objects are connected. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset associations efficiently. It
is the iterative process for finding the frequent itemsets from the large
dataset.

This algorithm was given by the R. Agrawal and Srikant in the


year 1994. It is mainly used for market basket analysis and helps to find
those products that can be bought together. It can also be used in the
healthcare field to find drug reactions for patients.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A & B are
the frequent itemsets together, then individually A and B should also be
the frequent itemset.

Suppose there are the two transactions: A= {1,2,3,4,5}, and B=


{2,3,7}, in these two transactions, 2 and 3 are the frequent itemsets.

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database,


and select the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value
than the minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence
value than the threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Apriori Algorithm Working

We will understand the apriori algorithm using an example and


mathematical calculation:

Example: Suppose we have the following dataset that has various


transactions, and from this dataset, we need to find the frequent itemsets
and generate the association rules using the Apriori algorithm:

:
Solution:

Step-1: Calculating C1 and L1:

o In the first step, we will create a table that contains support count
(The frequency of each itemset individually in the dataset) of each
itemset in the given dataset. This table is called the Candidate set
or C1.

o Now, we will take out all the itemsets that have the greater support
count that the Minimum Support (2). It will give us the table for
the frequent itemset L1.
Since all the itemsets have greater or equal support count than the
minimum support, except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:

o In this step, we will generate C2 with the help of L1. In C2, we will
create the pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count
from the main transaction table of datasets, i.e., how many times
these pairs have occurred together in the given dataset. So, we will
get the below table for C2:

o Again, we need to compare the C2 Support count with the


minimum support count, and after comparing, the itemset with less
support count will be eliminated from the table C2. It will give us
the below table for L2

Step-3: Candidate generation C3, and L3:

o For C3, we will repeat the same two processes, but now we will
form the C3 table with subsets of three itemsets together, and will
calculate the support count from the dataset. It will give the below
table:

o Now we will create the L3 table. As we can see from the above C3
table, there is only one combination of itemset that has support
count equal to the minimum support count. So, the L3 will have
only one combination, i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:

To generate the association rules, first, we will create a new table with the
possible rules from the occurred combination {A, B.C}. For all the rules,
we will calculate the Confidence using formula sup( A ^B)/A. After
calculating the confidence value for all rules, we will exclude the rules
that have less confidence than the minimum threshold(50%).

Consider the below table:

Support Confidence Rules

2 Sup{(A ^B) ^C}/sup(A ^B)= A ^B → C


2/4=0.5=50%

2 Sup{(B^C) ^A}/sup(B ^C)= B^C → A


2/4=0.5=50%

2 Sup{(A ^C) ^B}/sup(A ^C)= A^C → B


2/4=0.5=50%

2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40% C→ A ^B

2 Sup{(A^( B ^C)}/sup(A)= A→ B^C


2/6=0.33=33.33%

2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28% B→ B^C

As the given threshold or minimum confidence is 50%, so the first three


rules A ^B → C, B^C → A, and A^C → B can be considered as the
strong association rules for the given problem.
Advantages of Apriori Algorithm

o This is easy to understand algorithm


o The join and prune steps of the algorithm can be easily
implemented on large datasets.

Disadvantages of Apriori Algorithm

o The apriori algorithm works slow compared to other algorithms.


o The overall performance can be reduced as it scans the database for
multiple times.
o The time complexity and space complexity of the apriori algorithm
is O(2D), which is very high. Here D represents the horizontal
width present in the database.

********************************************

You might also like