Data Mining Using Evolutionary Algorit Data Mining Using Evolutionary Algorithm HM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

A Term Paper on

DATA MINING USING


EVOLUTIONARY ALGORIT
ALGORITHM
HM

Submitted by:
Name: Apar Parajuli
Group: CE
Roll no.: 37
Level: 4thyr / I sem.

Submitted to:

Mr. Santosh Khanal

KATHMANDU UNIVERSITY
Department of Computer Science and Engineering
Dhulikhel, Kavre
Abstract
With the huge amount of data being generated in the world every day, at a rate far higher than
by which it can be analyzed by human comprehension alone, data mining becomes an extremely
important task for extracting as much useful information from this data as possible. The
standard data mining techniques are satisfactory to a certain extent but they are constrained by
certain limitations, and it is for these cases that evolutionary approaches are both more capable
and more efficient. In this paper I present the use of nature inspired evolutionary techniques to
data mining augmented with human interaction to handle situations for which concept
definitions are abstract and hard to define, hence not quantifiable in an absolute sense. Finally, I
propose some ideas for these techniques for future implementations.

1
CONTENTS
S.N. Page
Abstract 1

1. Introduction 3-4

2. Objectives 4

3. Methodology 4

4. Discussion 5-9

5. Conclusion 10

6. Reference and Bibliography 10

2
1 Introduction

In recent years, the massive growth in the amount of stored data has increased the demand
for effective data mining methods to discover the hidden knowledge and patterns in these
data sets. Data mining means to “mine” or extract relevant information from any available
data of concern to the user. Data mining is not a new technique but has been around for
centuries and has been used for problems like regression analysis, or knowledge discovery
from records of various types. As computers invaded almost all conceivable fields of
human knowledge and occupation, their advantages were advocated all over, but what was
observed soon enough was that with the increasing amounts of data that could be generated,
stored and analysed there was a need to define some way to sift through it and grab the
important stuff out. During the earlier days a human or a group of humans would sit down
to analyse the data by going through it manually and using statistical techniques, but the
curve of data generation was far steeper than what could realistically be processed by hand.
This led to the emergence of the field of data mining, which was essentially to define and
formalize standard techniques to extract data from large data warehouses. As data mining
evolved it was observed that the data at hand was almost always never perfect or suitable to
be fed to data mining engines and needed several steps of pre-processing before it could be
put through “mining”. Generally these inconsistencies would be in data format, level of
noise or incorrect data, unnecessary data, redundant data etc. These steps would clean,
integrate, discretize and select the most relevant attributes before performing any mining. A
whole new area called Intelligent data analysis has emerged which utilises efficient
techniques for mining data from large sets keeping in mind that the knowledge obtained is
useful at the same time also remembering that time for mining is constrained and the user
requires data as soon as possible. Some of the methods used to mine data include support
vector machines, decision trees, nearest neighbour analysis, Bayesian classification, and
latent semantic analysis. With the problems associated with conventional data mining
techniques, clever new ways to overcome these were needed, and the application of AI
techniques to the field resulted in a very powerful hybrid of techniques. Evolutionary
optimization techniques provided with a useful and novel solution to these issues, and once
data mining was enhanced with using EC many of the previously mentioned problems were
no longer big issues. Some of applications of evolutionary algorithms in data mining, which
involves human interaction, are presented in this paper. When dealing with concepts that
are abstract and hard to define or cases where there are a large or variable number of
parameters, we still do not have reliable methods for finding solutions. For certain cases
where we are unable to quantify what we want to measure, for instance ‘beauty’ in images
or ‘pleasantness’ in music, we almost always require a human to drive the solutions through
his choices. In these situations we use a combination of Evolutionary computation along
with data mining but with a human sitting and interacting with the engine to steer the
computation towards solutions or answers he is looking for. This paper begins by

3
describing some concepts in data mining and general evolutionary algorithms by giving
relevant concepts and descriptions. In the later sections we discuss some of the areas where
these are implemented and lastly we give a few ideas of where these techniques may be
implemented in the future.

2 Objectives
The main objectives of this report include:

1. To understand the basics of Evolutionary Algorithm.


2. To know about the basic of data mining.
3. To know about the advantages of using Evolutionry Algorithm for data mining.

3 Methodology
The most effective way to collect data on the chosen subject matter, that I found, was
through web. Google turned out to be very helpful for understanding the concept of data
mining. knowledge discovery and all kinds of Evolutonary algorithm.

3.1 Web Articles


Web was the primary source of information while preparing this paper. Wikipedia was
referenced for the definition of data mining, taks in the process and evolutionary
algorithm.

3.2 Youtube Vedios


While writing the paper I also refered to many youtube vedios about applications of
evolutionary algorithm and claasical methods for data mining. These videos helped a lot
in understanding the concepts of these topics.

4
4 Discussion

4.1 Overview of Data mining and Knowledge Discovery


Knowledge discovery and data mining as defined by Fayyad et al. (1996) is “the process
of identifying valid, novel, useful, and understandable patterns in data”. Data mining has
emerged particularly in situations where analysing the data manually or by using simple
queries is either impossible or very complicated (Cant´u-Paz & Kamath, 2001). Data
mining is a multi-disciplinary field that incorporates knowledge from many disciplines,
mainly from machine learning, artificial intelligence, statistics, signal and image
processing, mathematical optimization, and pattern recognition (ibid.).
Knowledge discovery and data mining consist of three main steps to convert a collection
of raw data to valuable knowledge. These three steps are data pre-processing, knowledge
extraction, and data post-processing (Freitas, 2003). The discovered knowledge should be
accurate, comprehensible, relevant and interesting for the end user in order to consider the
data mining process as successful (Cant´u-Paz & Kamath, 2001).

4.2 Data mining Pre-processing


The purpose of using data mining pre-processes is to eliminate the outliers, inconsistency
and incompleteness of data in order to obtain accurate results (Freitas, 2003). These
preprocesses are listed below:

Data cleaning: involves preparing data to the following process by removing


irrelevant data and as much noise as possible from the data. It is done to guarantee
the accuracy and the validity of the data.

Data integration: removes redundant and inconsistent data from data that is
collected from different sources.

Discretization: converts continuous values of attributes to discrete values e.g. for


the attribute Age we can set minimum value equal to 21 and maximum value equal
to 60.

Attribute selection: selects the relevant data to the analysis process from all the
data sets.

Data mining: after doing all the previous steps, data mining algorithms or
techniques can be applied to the data in order to extract the desirable knowledge

5
4.3 Data mining tasks
It is very important to define the data mining task that the algorithm should address before
designing it for application to a particular problem. There are several tasks of data mining
and each of them has specific purposes in terms of the knowledge to be discovered
(Freitas,2002).

4.3.1 Models and Patterns


In data mining the term model “is a high level description of the data set” (Hand,
20001). A model can be either descriptive or predictive. As the names imply, the
descriptive model is an unsupervised model that aims to describe the data, while
predictive model is a supervised model that aims to predict values from the data.
Patterns are used to define the important and interesting features of the data. Unusual
combination of purchased items in supermarket is an example of a pattern. Models are
used to describe the whole data set, while patterns are used to highlight particular
aspects of data.

4.3.1.1 Predective Models


According to Kambert (2001), data analysis generally can be either in a classification or
a prediction form. Regression analysis is an example of prediction tasks, namely
numeric prediction. The difference between classification and regression is that the
target value (response variable) is a quantitative value in regression modeling, while it is
a qualitative or categorical value in classification modeling.

4.3.1.2 Descriptive model

Clustering Task
Clustering simply means grouping, placing data instances into different groups or
clusters such that instances from the same clusters are similar together and easily
distinguished from the instances that belong to the other clusters (Zaki et al., 2010).

Association Analysis Task


Association analysis refers to the process of extracting association rules from a data set
that describe some interesting relations hidden in this data set. For further illustration,
imagine the market basket transactions example, where we have two items A and B and
the following rule is extracted from the data: {A} -> {B}. This rule suggests that there is
a strong relation between item A and item B in terms of the frequency of their

6
occurrence together (Tan et al., 2006). This means if there is an item A in the basket
then there is a high probability that item B will be in the basket as well.

4.4 Applications of data mining using Evolutionary Algorithm


In this section we examine some areas where data mining with interactive evolutionary
algorithms IEA techniques has been successfully applied. The approach detailed is very
general in terms that it can be used to classify any text based data and hence is not limited
to any specific discipline. The approach requires textual data in the form of reports, which
can be just normal text files corresponding to the database for which the knowledge needs
to be extracted.

4.4.1 Extracting Knowledge from a text database


This technique proposed by Sakurai (2001) details a means to extract knowledge from
any database with the help of domain dependent dictionaries. The particular application
in the paper deals with text mining from daily business reports generated by some
institution and classification of the reports based on some knowledge dictionaries. In
their experiment, two kinds of knowledge dictionaries were used, one is called the key
concept dictionary, and the other is the concept relation dictionary. The daily business
reports generated from any source are decomposed into words using lexical analysis and
the words are checked for entry in the key concept dictionary. All reports are then
classified with particular concepts; according to the words in the report, which represent
the concept in the key concept dictionary. Also each report is then checked if its key
concepts are assigned in the concept relation dictionary. Reports are then classified
according to the set of concept relations, and reports having the same text class are put
into the same group. This facilitates the end users as they can read only those reports,
which are put into groups with topics matching their interests; also it gives them and
indication of the trends of topics in reports. The key concept dictionary contains
concepts having common features, concepts and related keywords, and expressions and
phrases concerned with the target problem. An example of the key concept dictionary
can be seen in the figure below concept relation dictionary contains a relation, which
describes a condition and a result. This is a mapping from key concepts to classes. Since
creating a dictionary is time consuming and prone to errors the paper describes an
automatic way of creating a concept relation dictionary.

7
Figure 1

The relation in concept relation dictionary is like a rule and can be acquired by inductive
learning if training
raining examples are available, to do so words are extracted from the document
by lexical analysis and these words are checked if they match
ma h a expression in key concept
dictionary. Thus we have the following assumptions, concept classes are attributes,
concepts are values and test classes given by the reader are the result
result classes we want, this
forms a training example. Also for all those attributes, which do not have values, 0 is
assigned. An overview of this is clearly depicted in the figure below

Figure 2
For the inductive learning to work we need a fuzzy algorithm, as reports, which are
written by humans, are not strict in accordance with descriptions. Thus the method
described for the learning is the IDF algorithm, which is a fuzzy algorithm. This
algorithm makes rules from the generated training examples and the rules, which are
generated, have the genotype of a tree.
The whole process can be seen in figure 3 below which shows the inputs, and the
processes, which go into getting the final outputs from the input dictionaries and data.

Figure 3
The algorithm was tested on daily reports for a business concerning retail sales into 3
classes concerned with describing a sales opportunity as best, missed or other. The key
concept dictionary was composed of 13 concept classes and each concept class has its
subset of concepts. Those reports which contained contradicting descriptions were
regarded as unnecessary and training example from them were not generated. And the
results showed that by using 10 fold cross validation they were successfully able to
generate the concept relation dictionary and obtain better results than IDF on the reports
generated for retailing.
5 Conclusion
Data mining in todays world has a lot of new use and has become very crucial in many
fields. Huge amount of data has been accumulated from which a lot of useful patterns can
be found. This large search space is where evolutionary algorithm comes into work, since it
can search for patterns from these data in a very effcient way. Many conventional methods
are not as efficient as this approach. That is why using evoloutioary algorithms in data
mining can be really helpful to use and learn from data in a very efficient way .

6 Refrences
 http://Wikipedia.org/evolutionaryalgorithm
 http://Wikipedia.org/datamining
 http://Wikipedia.org/datamining/dataminingtasks
 http://aitopics.org/machinelearning/geneticalgorithm
 Freitas, A. A. (2002). Data Mining and Knowledge Discovery with Evolutionary
Algorithms. Berlin: Springer-Verlag.

10

You might also like