1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

1.

1 Introduction to Data Mining


We live in a world where vast amounts of data are collected daily. Analyzing such data is an
important need.
1.1.1 Moving toward the Information Age
“We are living in the information age” is a popular saying; however, we are actually living in
the data age. Terabytes or petabytes of data pour into our computer networks, the World Wide
Web (WWW), and various data storage devices every day from business, society, science and
engineering, medicine, and almost every other aspect of daily life.
This explosive growth of available data volume is a result of the computerization of our society
and the fast development of powerful data collection and storage tools. This necessity has led
to the birth of Data mining. Data mining has and will continue to make great strides in our
journey from the data age toward the coming information age.
1.1.2 Data Mining as the Evolution of Information Technology
Data mining can be viewed as a result of the natural evolution of information technology. The
database and data management industry evolved in the development of several critical
functionalities
• Data collection and database creation
• Data management (including data storage and retrieval and database transaction
processing)
• Advanced data analysis (involving data warehousing and data mining).

The early development of data collection and database creation mechanisms served as a
prerequisite for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems offer query
and transaction processing as common practice. Advanced data analysis has naturally become
the next step.
Since the 1960s, database and information technology has evolved systematically from
primitive file processing systems to sophisticated and powerful database systems.
The research and development in database systems since the 1970s progressed from early
hierarchical and network database systems to relational database systems, data modeling tools,
and indexing and accessing methods.
After the establishment of database management systems, database technology moved toward
the development of advanced database systems, data warehousing, and data mining for
advanced data analysis and web-based databases.
Advanced data analysis sprang up from the late 1980s onward.
Huge volumes of data have been accumulated beyond databases and data warehouses. During
the 1990s, the World Wide Web and web-based database began to appear.
Figure 1: The evolution of database system technology
In summary, the abundance of data, coupled with the need for powerful data analysis tools, has
been described as a data rich but information poor situation.
1.2 What is Data Mining?
Data mining refers to extracting or “mining” knowledge from large amounts of data. Data
mining should have been more appropriately named “knowledge mining from data”. Many
other terms carry a similar or slightly different meaning to data mining, such as knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging. Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply
an essential step in the process of knowledge discovery.
Figure 2: Data Mining as a step in the process of knowledge discovery
Knowledge discovery as a process is depicted in Figure 2 and consists of an iterative sequence
of the following steps:
1. Data cleaning - to remove noise and inconsistent data.
2. Data integration - where multiple data sources may be combined.
3. Data selection - where data relevant to the analysis task are retrieved from the database.
4. Data transformation - where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance.
5. Data mining - an essential process where intelligent methods are applied in order to extract
data patterns.
6. Pattern evaluation - to identify the truly interesting patterns representing knowledge based
on some interestingness measures.
7. Knowledge presentation - where visualization and knowledge representation techniques are
used to present the mined knowledge to the user.
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns
are presented to the user and may be stored as new knowledge in the knowledge base.
We adopt a broad view of data mining functionality: Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data. The data sources can include
databases, data warehouses, the Web, other information repositories, or data that are streamed
into the system dynamically.

Figure 3: Architecture of a typical data mining system.


Based on this view, the architecture of a typical data mining system may have the following
major components:
1. Database, data warehouse, World Wide Web, or other information repository:
This is one or a set of databases, data warehouses, spreadsheets, or other kinds of
information repositories. Data cleaning and data integration techniques may be
performed on the data.
2. Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining request.
3. Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
4. Data mining engine: This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
5. Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search toward
interesting patterns. It may use interestingness thresholds to filter out discovered
patterns. Alternatively, the pattern evaluation module may be integrated with the
mining module, depending on the implementation of the data mining method used. For
efficient data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process so as to confine the search
to only the interesting patterns.
6. User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory data mining
based on the intermediate data mining results. In addition, this component allows the
user to browse database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.
Data mining involves an integration of techniques from multiple disciplines such as database
and data warehouse technology, statistics, machine learning, high-performance computing,
pattern recognition, neural networks, data visualization, information retrieval, image and signal
processing, and spatial or temporal data analysis.

1.3 Data Mining Functionalities - What Kinds of Patterns Can Be Mined?


Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories: descriptive and
predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Data mining functionalities, and the kinds of patterns they can discover, are described below.
1. Concept/Class Description: Characterization and Discrimination
2. Mining Frequent Patterns, Associations, and Correlations
3. Classification and Regression for Predictive Analysis
4. Cluster Analysis
5. Outlier Analysis
6. Evolution Analysis
Concept/Class Description: Characterization and Discrimination
Data entries can be associated with classes or concepts. For example, in the AllElectronics
store, classes of items for sale include computers and printers, and concepts of customers
include bigSpenders and budgetSpenders. It can be useful to describe individual classes and
concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a
concept are called class/concept descriptions.
These descriptions can be derived using
(1) Data characterization – Data characterization is a summarization of the general
characteristics or features of a target class of data. The data corresponding to the user-specified
class are typically collected by a database query. For example, to study the characteristics of
software products whose sales increased by 10% in the last year, the data related to such
products can be collected by executing an SQL query.
(2) Data discrimination – Data discrimination is a comparison of the general features of target
class data objects with the general features of objects from one or a set of contrasting classes.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in data.
There are many kinds of frequent patterns, including item-sets, subsequences, and
substructures.
A frequent itemset typically refers to a set of items that frequently appear together in a
transactional data set, such as milk and bread.
A frequently occurring subsequence, such as the pattern that customers tend to purchase first a
PC, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern
A substructure can refer to different structural forms, such as graphs, trees, or lattices, which
may be combined with item sets or subsequences.
Association analysis. Suppose, as a marketing manager of AllElectronics, you would like to
determine which items are frequently purchased together within the same transactions. An
example of such a rule, mined from the AllElectronics transactional database, is
buys(X,“computer”)⇒buys(X,“software”) [support=1%,confidence=50%]
where X is a variable representing a customer. A confidence, or certainty, of 50% means that
if a customer buys a computer, there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis showed that computer and
software were purchased together.
This association rule involves a single attribute or predicate (i.e.,buys) that repeats. Association
rules that contain a single predicate are referred to a single-dimensional association rules.
Dropping the predicate notation, the above rule can be written simply as
“computer⇒software [1%, 50%]”.
Suppose, instead, that we are given the AllElectronics relational database relating to purchases.
A data mining system may find association rules like
age(X, “20...29”)∧income(X, “20K...29K”)⇒buys(X, “CD player”) [support=2%,
confidence=60%]
The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years of
age with an income of 20,000 to 29,000 and have purchased a CD player. at AllElectronics.
There is a 60% probability that a customer in this age and income group will purchase a CD
player. This is an association between more than one attribute, or predicate (i.e., age, income,
and buys) which is referred as a multidimensional association rule.
Classification and Prediction
Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data.
Regression analysis is a statistical methodology that is most often used for numeric prediction,
although other methods exist as well.
Figure 4: (a) IF-THEN RULES (b) a Decision Tree (c) a Neural network
“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural
networks.
Cluster Analysis
Unlike classification and prediction, which analyze class-labeled data objects, clustering
analyzes data objects without consulting a known class label.

Figure 5: Represents 3 Clusters where each cluster center is marked with a “+”
The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. That is, clusters of objects are formed so
that objects within a cluster have high similarity in comparison to one another, but are very
dissimilar to objects in other clusters.
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of
the data. These data objects are outliers. Most data mining methods discard outliers as noise or
exceptions. However, in some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier mining.
Rather than using statistical or distance measures, deviation-based methods identify outliers by
examining differences in the main characteristics of objects in a group.
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Although this may include characterization, discrimination, association and
correlation analysis, classification, prediction, or clustering of time related data, distinct
features of such an analysis include time-series data analysis, sequence or periodicity pattern
matching, and similarity-based data analysis.
A data mining study of stock exchange data may identify stock evolution regularities for overall
stocks and for the stocks of particular companies. Such regularities may help predict future
trends in stock market prices, contributing to your decision-making regarding stock
investments.

Classification of Data Mining Systems


Data mining is an interdisciplinary field, the confluence of a set of disciplines, including
database systems, statistics, machine learning, visualization and information science.

Figure 6: Data Mining as a confluence of multiple disciplines


Depending on the kinds of data to be mined or on the given data mining application, the data
mining system may also integrate techniques from spatial data analysis, information retrieval,
pattern recognition, image analysis, signal processing, computer graphics, Web technology,
economics, business, bioinformatics, or psychology.
Datamining systems can be categorized according to various criteria, as follows:
1. Classification according to the kinds of databases mined
2. Classification according to the kinds of knowledge mined
3. Classification according to the kinds of techniques utilized and
4. Classification according to the applications adapted.
Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, other types of data or applications
involved), each of which may require its own data mining technique. Data mining systems can
therefore be classified accordingly, we may have a relational, transactional, object-relational,
or data warehouse mining system. If classifying according to the special types of data handled,
we may have a spatial, time-series, text, stream data, multimedia data mining system, or a
World Wide Web mining system.
Classification according to the kinds of knowledge mined: Data mining systems can be
categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis.
Classification according to the kinds of techniques utilized: Data mining systems can be
categorized according to the underlying data mining techniques employed. These techniques
can be described according to the degree of user interaction involved (e.g., autonomous
systems, interactive exploratory systems, query-driven systems) or the methods of data analysis
employed (e.g., database-oriented or data warehouse– oriented techniques, machine learning,
statistics, visualization, pattern recognition, neural networks, and so on).
Classification according to the applications adapted: Data mining systems can also be
categorized according to the applications they adapt. For example, data mining systems may
be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so
on. Different applications often require the integration of application-specific methods.
Therefore, a generic, all-purpose data mining system may not fit domain-specific mining tasks.

Data Mining Task Primitives


A data mining query is defined in terms of data mining task primitives. These primitives allow
the user to interactively communicate with the data mining system during discovery in order
to direct the mining process. The data mining primitives specify the following
The set of task-relevant data to be mined: This specifies the portions of the database or the set
of data in which the user is interested. This includes the database attributes or data warehouse
dimensions of interest.
The kind of knowledge to be mined: This specifies the data mining functions to be performed,
such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the
domain to be mined is useful for guiding the knowledge discovery process and for evaluating
the patterns found. Concept hierarchies are a popular form of background knowledge, which
allow data to be mined at multiple levels of abstraction.
The interestingness measures and thresholds for pattern evaluation: They may be used to guide
the mining process or, after discovery, to evaluate the discovered patterns.
The expected representation for visualizing the discovered patterns: This refers to the form in
which discovered patterns are to be displayed, which may include rules, tables, charts, graphs,
decision trees, and cubes.
Figure 7: Primitives for specifying a data mining task
A data mining query language can be designed to incorporate these primitives, allowing users
to flexibly interact with data mining systems.
Integration of a Data Mining System with a Database or Data Warehouse
System
When a DM system works in an environment that requires it to communicate with other
information system components, such as DB and DW systems, possible integration schemes
include no coupling, loose coupling, semi tight coupling, and tight coupling. We examine each
of these schemes, as follows:
No coupling: No coupling means that a DM system will not utilize any function of a DB or
DW system. It may fetch data from a particular source (such as a file system), process data
using some data mining algorithms, and then store the mining results in another file.
Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or
DW system, fetching data from a data repository managed by these systems, performing data
mining, and then storing the mining results either in a file or in a designated place in a database
or data warehouse.
Loose coupling is better than no coupling because it can fetch any portion of data stored in
databases or data warehouses by using query processing, indexing, and other system facilities.
It incurs some advantages of the flexibility, efficiency, and other features provided by such
systems.
Semi tight coupling: Semi tight coupling means that besides linking a DM system to a DB/DW
system, efficient implementations of a few essential data mining primitives (identified by the
analysis of frequently encountered data mining functions) can be provided in the DB/DW
system.
These primitives can include sorting, indexing, aggregation, histogram analysis, multiway join,
and precomputation of some essential statistical measures, such as sum, count, max, min,
standard deviation, and so on.
Tight coupling: Tight coupling means that a DM system is smoothly integrated into the
DB/DW system. The data mining subsystem is treated as one functional component of an
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW
system. This approach is highly desirable because it facilitates efficient implementations of
data mining functions, high system performance, and an integrated information processing
environment.

Major Issues in Data Mining


The major issues in data mining regarding mining methodology, user interaction, performance,
and diverse data types These issues are:
1. Mining methodology and user interaction issues
2. Performance issues
3. Issues relating to the diversity of database types
Mining methodology and user interaction issues: These reflect the kinds of knowledge
mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge,
ad hoc mining, and knowledge visualization.
• Mining different kinds of knowledge in databases: Because different users can be
interested in different kinds of knowledge, data mining should cover a wide spectrum
of data analysis and knowledge discovery tasks, including data characterization,
discrimination, association and correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis (which includes trend and similarity
analysis). These tasks may use the same database in different ways and require the
development of numerous data mining techniques.
• Interactive mining of knowledge at multiple levels of abstraction: Because it is difficult
to know exactly what can be discovered within a database, the data mining process
should be interactive. For databases containing a huge amount of data, appropriate
sampling techniques can first be applied to facilitate interactive data exploration.
Interactive mining allows users to focus the search for patterns, providing and refining
data mining requests based on returned results.
• Incorporation of background knowledge: Background knowledge, or information
regarding the domain under study, may be used to guide the discovery process and
allow discovered patterns to be expressed in concise terms and at different levels of
abstraction. Domain knowledge related to databases, such as integrity constraints and
deduction rules, can help focus and speed up a data mining process, or judge the
interestingness of discovered patterns.
• Data mining query languages and ad hoc data mining: Relational query languages (such
as SQL) allow users to pose ad hoc queries for data retrieval. In a similar vein, high-
level data mining query languages need to be developed to allow users to describe ad
hoc data mining tasks by facilitating the specification of the relevant sets of data for
analysis, the domain knowledge, the kinds of knowledge to be mined, and the
conditions and constraints to be enforced on the discovered patterns. Such a language
should be integrated with a database or data warehouse query language and optimized
for efficient and flexible data mining.
• Presentation and visualization of datamining results: Discovered knowledge should be
expressed in high-level languages, visual representations, or other expressive forms so
that the knowledge can be easily understood and directly usable by humans. This is
especially crucial if the data mining system is to be interactive. This requires the system
to adopt expressive knowledge representation techniques, such as trees, tables, rules,
graphs, charts, crosstabs, matrices, or curves.
• Handling noisy or incomplete data: The data stored in a database may reflect noise,
exceptional cases, or incomplete data objects. When mining data regularities, these
objects may confuse the process, causing the knowledge model constructed to overfit
the data. As a result, the accuracy of the discovered patterns can be poor. Data cleaning
methods and data analysis methods that can handle noise are required, as well as outlier
mining methods for the discovery and analysis of exceptional cases.
• Pattern evaluation—the interestingness problem: A data mining system can uncover
thousands of patterns. Many of the patterns discovered may be uninteresting to the
given user, either because they represent common knowledge or lack novelty. The use
of interestingness measures or user-specified constraints to guide the discovery process
and reduce the search space is another active area of research.
Performance issues:
These include efficiency, scalability, and parallelization of data mining algorithms.
• Efficiency and scalability of data mining algorithms: To effectively extract information
from a huge amount of data in databases, data mining algorithms must be efficient and
scalable. In other words, the running time of a data mining algorithm must be
predictable and acceptable in large databases. From a data base perspective on
knowledge discovery, efficiency and scalability are key issues in the implementation of
data mining systems.
Issues relating to the diversity of database types:
• Handling of relational and complex types of data: Because relational databases and data
warehouses are widely used, the development of efficient and effective data mining
systems for such data is important. However, other databases may contain complex data
objects, hyper-text and multimedia data, spatial data, temporal data, or transaction data.
It is unrealistic to expect one system to mine all kinds of data. Specific data mining
systems should be constructed for mining specific kinds of data. Therefore, one may
expect to have different data mining systems for different kinds of data.
• Mining information from heterogeneous databases and global information systems:
Local- and wide-area computer networks (such as the Internet) connect many sources
of data, forming huge, distributed, and heterogeneous databases. The discovery of
knowledge from different sources of structured, semi-structured, or unstructured data
with diverse data semantics poses great challenges to data mining. Data mining may
help disclose high-level data regularities in multiple heterogeneous databases that are
unlikely to be discovered by simple query systems and may improve information
exchange and interoperability in heterogeneous databases. Web mining, which
uncovers interesting knowledge about Web contents, Web structures, Web usage, and
Web dynamics, becomes a very challenging and fast-evolving field in data mining.
The above issues are considered major requirements and challenges for the further evolution
of data mining technology.

You might also like