Data Mining Group 6

Département d’informatique
Faculté des sciences

Université de Douala
M2R – Laboratoire d’Informatique Appliquée
UE : BIG DATA
DATA MINING
Sous la supervision de
Dr MOSKOLAI
Rédigé par
GROUPE 6
N°
1. JOSHUA MotasayObassy
2. SOBGOU TSOBFACK Dieudonné
3. MBUYA JOSIAM Anyam Agwu
Année académique 2020 / 2021
1 Contents
1. INTRODUCTION ............................................................................................................................. 2
1.2 WHY DATA MINING................................................................................................................ 2
2 DATA MINING PROCESS ................................................................................................................ 2
2.1 DATA COLLECTION.................................................................................................................. 3
2.1.1 BASIC DATA TYPES .......................................................................................................... 3
2.1.1.1 NON-DEPENDENCY-ORIENTED DATA TYPES .............................................................. 3
2.1.1.2 DEPENDENCY-ORIENTED DATA .................................................................................. 4
2.2 DATA PROCESSING ..................................................................................................................... 5
2.2.1 FEATURE EXTRACTION .................................................................................................... 5
2.2.2 DATA CLEANSING ............................................................................................................ 6
2.2.3 DATA TRANSFORMATION AND REDUCTION ..................................................................... 7
2.3 ANALYTICAL PHASE............................................................................................................. 9
2.3.1 ASSOCIATION PATTERN MINING .................................................................................... 9
2.3.1.1 Apriori Algorithm ....................................................................................................... 10
2.3.2 Further Improvement of the Apriori Method ................................................................. 15
2.3.3 Mining Frequent Patterns Without Candidate Generation ........................................... 16
2.3.4 FP-Growth Method: Construction of FP-Tree .................................................................. 16
3 References ................................................................................................................................... 19
Big Data : Data mining 1 Page 1

1. INTRODUCTION
Data mining is the study of collecting, cleaning, processing, analyzing, and gaining usefulinsights
from data. A wide variation exists in terms of the problem domains, applications,formulations, and
data representations that are encountered in real applications.
1.2 WHY DATA MINING

In the modern age, virtually all automated systems generate some form of data eitherfor
diagnostic or analysis purposes. This has resulted in a deluge of data, which has beenreaching the
order of petabytes or exabytes. Some examples of different kinds of data areas follows:
World Wide Web: The number of documents on the indexed Web is now on the order of billions,
User accesses to such documents create Web access logs at servers and customer behavior profiles at
commercial sites. user access logs canbe mined to determine frequent patterns of accesses or unusual
patterns of possiblyunwarranted behavior
Financial interactions: Most common transactions of everyday life, such as using an automated
teller machine (ATM) card or a credit card, can create data in an automated way. Such transactions
can be mined for many useful insights such as fraud or other unusual activity.
User interactions: Many forms of user interactions create large volumes of data. For example, the
use of a telephone typically creates a record at the telecommunication company with details about
the duration and destination of the call. Many phone companies routinely analyze such data to
determine relevant patterns of behavior that can be used to make decisions about network capacity,
promotions, pricing, or customer targeting.
Sensor technologies and the Internet of Things: A recent trend is the development of low-cost
wearable sensors, smartphones, and other smart devices that can communicate with one another. By
one estimate, the number of such devices exceeded the number of people on the planet in 2008 [1].
2 DATA MINING PROCESS

Data mining process is a pipeline containing many phases such asdata cleaning, feature extraction,
and algorithmic design

2.1DATA COLLECTION
Data collection may require the use of specialized hardware such as asensor network, manual labor
such as the collection of user surveys, or software toolssuch as a Web document crawling engine to
collect documents.
2.1.1 BASIC DATA TYPES

Data mining consist of a wide variety of types that are available for analysis. They can be classified
in to two main categories, Non-dependency-oriented data and dependency-oriented data.
2.1.1.1 NON-DEPENDENCY-ORIENTED DATA TYPES

This typically refers to simple data types such as multidimensional data or text data. These data
types are the simplest and most commonlyencountered. This data typically contains a set of records.
A record is also referred to as a data point, instance, example, transaction, entity, tuple, object, or
feature-vector, depending on the application at hand. Each record contains a set of fields, which are
also referred to as attributes, dimensions, and features.

Relational database systems were traditionally designed to handle this kind of data, even in their
earliest forms. For example,consider the demographic data set illustrated in Table above, the
demographic properties of an individual, such as age, gender, and ZIP code, are illustrated. Data types
underthiscategoryare;
 Quantitative multi-dimensional data

Data set in which all fields are quantitative is also referred to as quantitative data or numeric data.
 Categorical and Mixed attribute data
Data set which consist of both numerical and categorical data.
 Binary and Set data
These are special case of multidimensional categorical or multidimension quantitative data in
which attributes takes one of two discrete values. That is, 1 indicates element should be included in
the set and 0 otherwise.
2.1.1.2 DEPENDENCY-ORIENTED DATA

Datamining is all about finding relationships between data items. The presence of preexisting
dependencies therefore changes the expectedrelationships in the data, and what may be considered
interesting from the perspective of these expected relationships. Several types of dependencies may
exist that may either be explicit or implicit.
 Implicit dependencies: In this case, the dependencies between data items are not explicitly
specified but are known to “typically” exist in that domain. For example, consecutive temperature
values collected by a sensor are likely to be extremely similar to one another.
 Explicit dependencies: This typically refers to graph or network data in which edges are used to
specify explicit relationships.
Different dependency-oriented data types include :
 Text data : it is a raw form, a text document that corresponds to a string. Each string is a
sequence of character in the document. It can be used to analyze the frequency of words in a
document.
 Time-series data: It contains values that are generated by continuous measurement over time.
For example, an environmental sensor will measure the temperature continuously, whereas an

electrocardiogram (ECG) will measure the parameters of a subject’sheart rhythm. They are most often
used in forecasting financial market analysis.
 Discrete sequences and Strings: it is considered as a categorical analog of time series data. For
example, a sequence of web accesses in which web pages and originating IP address of the request
are collected, such data are referred to as strings. Other examples include event log on web servers,
Biological data such as information about characteristics and functions of protein etc.
 Spatial data: These are data measured over specific location. For example, sea-surface
temperatures are oftencollected by meteorologists to forecast the occurrence of hurricanes. They are
closelyrelated to time-series data.
 Network and Graph data:Data values may correspond to nodes in the network, whereasthe
relationships among the data values may correspond to the edges in the network. Some examples
are;
o Web graph: Here nodes correspond to the Web pages, and the edges correspond to
hyperlinks. The nodes have text attributes corresponding to the content in the page.
o Social networks: In this case, the nodes correspond to social network actors, whereas the
edges correspond to friendship links. The nodes may have attributes corresponding to social page
content.
o Chemical compound databases: In this case, the nodes correspond to the elements and the
edges correspond to the chemical bonds between the elements. The structures in these chemical
compounds are very useful for identifying important reactive and pharmacological properties of these
compounds.
2.2 DATA PROCESSING

Data preprocessing phase is perhaps the most crucial one in the data mining process. This phase
begins after the collection of the data, and itconsists of the following steps, feature extraction, data
cleansing, feature selection and transformation.
2.2.1 FEATURE EXTRACTION

In this phase raw data is transformed in to meaningful database features for processing. Features
that are most relevant are extracted based on the type of application. For example, in a credit-card
fraud detectionapplication, the amount of a charge, the repeat frequency, and the location are
oftengood indicators of fraud. Different types of data require different types of features.

Sensor data: Sensor data is often collected as large volumes of low-level signals, which are massive.
The low-level signals are sometimes converted to higher-level features using wavelet or Fourier
transforms.
Image data: In its most primitive form, image data are represented as pixels. At a slightly higher
level, color histograms can be used to represent the features in different segments of an image.
Web logs: Web logs are typically represented as text strings in a prespecified format. it is relatively
easyto convert Web access logs into a multidimensional representation of (the relevant)categorical
and numeric attributes.
Network traffic: In many intrusion-detection applications, the characteristics of the network

packets are used to analyze intrusions or other interesting activity. A variety of features may be
extracted from thesepackets, such as the number of bytes transferred, the network protocol used.
Document data: Document data is often available in raw and unstructured form, and the data may
contain rich linguistic relations between different entities. Named-entity recognition is an important
subtask of information extraction. Thisapproach locates and classifies atomic elements in text into
predefined expressionsof names of persons, organizations, locations, actions, and numeric quantities.
2.2.2 DATA CLEANSING

This is the process of detecting and correcting inaccurate or corrupt data in a data set. This process
is important because several source of errors and missing entries arise during data collection process.
Some data collection technologies, such as sensors, are inherently inaccurate becauseof the
hardware limitations associated with collection and transmission. Sometimessensors may drop
readings because of hardware failure or battery exhaustion.
Users may not want to specify their information for privacy reasons, or they mayspecify incorrect
values intentionally. For example, it has often been observed thatusers sometimes specify their
birthday incorrectly on automated registration sitessuch as those of social networks. In some cases,
users may choose to leave severalfields empty.
Methods are needed to remove or correct missing and erroneous entries from thedata. There are
several important aspects of data cleaning:

Handling missing entries: Many entries in the data may remain unspecified because of weaknesses
in data collection or the inherent nature of the data. Such missing entries may need to be estimated.
The process of estimating missing entries is also referred to as imputation. This may not be practical
when most record contain missing entries. Also, errors created by the imputation algorithm may
affect the result of data mining algorithm. It is most preferably to work with missing data because it
avoids additional biases inherent in the imputation process.
In the classification problem, a single attribute is treated specially, and the other featuresare used
to estimate its value. In this case, there are many classification methods that can also be used for
missingvalue estimation (such as matrix completion method).
In a time-series data set,the average of the values at the time stamp just before or after the
missing attribute maybe used for estimation.
For the case of spatialdata, the estimation process is quite similar, where the average of values at
neighboringspatial locations may be used.
In case where different features represent different scales of reference and maytherefore not be
comparable to one another For example, an attribute such as age is drawnon a very different scale
than an attribute such as salary. The latter attribute is typicallyorder of magnitude larger than the
former. As a result, any aggregate function computedon the different features (e.g., Euclidean
distances) will be dominated by the attribute oflarger magnitude. To address this problem, it is
common to use standardization.
2.2.3 DATA TRANSFORMATION AND REDUCTION

The goal of data reduction is to represent it more compactly. When the data size is smaller,it is
much easier to apply sophisticated and computationally expensive algorithms. Data reduction does
result in some loss of information.The use of a more sophisticated algorithm may sometimes
compensate for the loss in information resulting from data reduction. Different types of data
reduction are used in variousapplications; Data sampling, feature selection, Data reduction and axis
rotation and Data reduction with typed transformation.
 Data sampling: in data sampling, the underlying data is sampled to create a much smaller
database. The type of sampling usedmayvarywith application.

 Sampling for static data;In the unbiased sampling approach, a predefined fraction f of the data
points is selected and retained for analysis. This is extremely simple to implement, and can be
achieved in two different ways, depending upon whether or not replacement is used. In biased
sampling, some part of the data are intentionally emphasized because of greater importance to
analysis. A stratified sample, first partitions the data into a set of desired strata, and then
independently samplesfrom each of these strata based on predefined proportions in an application-
specificway.
 Reservoir Sampling for data streams:In this case, the data set is not static and cannot be stored
on disk. for each incoming data point in the stream, one must use a set of efficiently implementable
operations to maintain the sample. Admission and control decision is made dynamically, to decide on
whether to include a new data point in the sample and how to eject data from the sample to insert
new data point in the sample.
 Feature subset selection: Some features are discarded when they are known to be irrelevant.
To determine which feature is relevant depends on the application. The are two types of features
selection. Unsupervised feature selection and supervised feature selection. Feature selection is an
important part in data mining because it defines quality of the input data.
 Dimensionality Reduction with Axis Rotation:In real data sets, a significant number of
correlations exist among different attributes. Insome cases, hard constraints or rules between
attributes may uniquely define some attributesin terms of others. For example, the date of birth of an
individual (represented quantitatively) is perfectly correlated with his or her age. These correlations
and constraintscorrespond to implicit redundancies because they imply that knowledge of some
subsetsof the dimensions can be used to predict the values of the other dimensions. The two
methods to achieve this goal is Principal Component Analysis (PCA) and Singular Value Decomposition
(SVD). PCA and SVD are primarily used for data reduction and compression, they have many other
applications in data mining.
 Dimensionality Reduction with type transformation:Data is transformedfrom a more complex
type to a less complex type, such as multidimensional data. Thus, these methods serve the dual
purpose of data reduction and type portability.

2.3 ANALYTICAL PHASE
The final part of the mining process is to designeffective analytical methods from the processed
data. Four problems in data mining are consideredfundamental to the mining process. These
problems correspond to clustering, classification,association pattern mining, and outlier detection,
and they are encountered repeatedly inthe context of many data mining applications.
2.3.1 ASSOCIATION PATTERN MINING

In its most primitive form, the association pattern mining problem is defined in the context of
sparse binary databases, where the data matrix contains only 0 or 1 entries, and most entries take on
the value of 0. Most customer transaction databases are of this type. For example, if each column in
the data matrix corresponds to an item, and a customer transaction represents a row, the (i, j)th entry
is 1, if customer transaction icontains item j as one of the items that was bought.
Association pattern mining problem has a wide variety of application, such as;
 Supermarket data: the determination of frequent item set provide a useful insight about
targeted marketing and shelf placement of items.
 Text mining: frequent text mining helps to identify co occurring terms and keywords.
 Generalization to dependency-oriented data types:
The following are some definitions and Properties used in Frequent pattern mining model.
o Definition (Support) The support of an itemset I is defined as the fraction of the transactions in
the database T = {T1. . . Tn} thatcontain I as a subset.
o Definition (Frequent Itemset Mining) Given a set of transactions T = {T1. . . Tn}, where each
transaction Ti is a subset of items from U, determine all item sets I that occur as a subset of at least a
predefined fraction minsup of the transactions in T .
o Definition (Frequent Itemset Mining: Set-wise Definition) Given a set of sets T = {T1. . . Tn},
where each element of the set Ti is drawn on the universe of elements U, determine all sets I that occur
as a subset of at least a predefined fractionminsup of the sets in T .
o Property (Support Monotonicity Property) The support of every subset J of I is at least equal
to that of the support of itemset I. sup(J) ≥ sup(I) ∀ J ⊆ I.
o Property (Downward Closure Property) Every subset of a frequent itemset is also frequent.

o Definition (Maximal Frequent Item sets) A frequent itemset is maximal at a given minimum
support level minsup, if it is frequent, and no superset of it is frequent.
o Definition (Confidence) Let X and Y be two sets of items. The confidence conf(X ∪ Y )of the rule
X ∪ Y is the conditional probability of X ∪ Y occurring in a transaction, given that the transaction
contains X. Therefore, the confidence conf(X ⇒ Y )is defined as follows:
( ∪ )
( ⇒ ) = ( ).
o Definition (Association Rules) Let X and Y be two sets of items. Then, the rule X ⇒ Y is said to
be an association rule at a minimum support of minsup and minimum confidence of minconf, if it
satisfies both the following criteria:
1. The support of the itemset X ∪ Y is at least minsup.
2. The confidence of the rule X ⇒ Y is at least minconf.
o Property (Confidence Monotonicity) Let X1, X2, and I be item sets such that X1⊂ X2⊂ I. Then
the confidence of X2 ⇒ I - X2is at least that of X1⇒ I - X1.
conf(X2⇒ I - X2) ≥ conf(X1⇒ I - X1)
2.3.1.1 Apriori Algorithm

The Apriorialgorithm uses the downward closure property in order to prune the candidate search
space. The downward closure property imposes a clear structure on the set of frequent patterns. This
is useful for avoidingwasteful counting of support levels of item sets that are known not to be
frequent
Let Fkdenote the set of frequent k-item sets, and Ckdenote the set of candidate k-item sets
Example: Market basketanalysis
This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such association scan help retailers
develop marketing strategies by gaining insight into which items are frequently purchased together
by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and
what kind of bread) on the same trip to the supermarket. Such information can lead to increased
sales by helping retailers do selective marketing and plan their shelfspace.

In order to achieve this, process the point-of-sale data collected with barcode scanners to find
dependencies among items.
TID List of item IDs

T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
There are nine transactions in this database, that is, |D| = 9.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-

itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of eachitem.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying
minimum support. In our example, all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to
generate a candidate set of 2-itemsets, C2. No candidates are removed from C2 during the prune
step because each subset of the candidates is alsofrequent.
4. Next, the transactions in D are scanned and the support count of each candidate itemset In C2
is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimumsupport.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first get C3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the Apriori property
that all subsets of a frequent itemset must also be frequent, we can determine that the four latter
candidates cannot possibly befrequent.
7. The transactions in D are scanned in order to determine L3, consisting of thosecandidate
3-itemsets in C3 having minimumsupport.

8. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets,C4.
The rule of joining is that L2 is that k=3 and we should have k-2 (i.e 1) element in common and this
element should be the first element.

Back to our example above,
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. –
Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below, each listed with its confidence.
– R1: I1 ^ I2 => I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected
– R2: I1 ^ I5 => I2

• R2 is Selected.
– R3: I2 ^ I5=>I1
• R3 is Selected.
– R4: I1 => I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2 => I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5 => I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected. In this way, we have found three strong association rules
2.3.2 FurtherImprovement of the Apriori Method

- Major computational challenges
 Multiple scans of transaction database
 Hugenumber of candidates
 Tedious workload of support counting for candidates
- ImprovingApriori:generalideas
 Reduce passes of transaction database scans
 Shrinknumber of candidates
 Facilitate support counting of candidates
- Completeness: any association rule mining algorithm should get the same set of frequent
itemsets.

2.3.3 Mining Frequent Patterns WithoutCandidate Generation
• Compress a large database into a compact, Frequent- Pattern tree (FP-tree) structure
- highly condensed, but complete for frequent pattern mining
- avoidcostlydatabase scans
• Develop an efficient, FP-tree-based frequent pattern mining method
- A divide-and-conquer methodology: decompose mining tasks into smaller ones
TID List of Items
T100 I1, I2, I5
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
• Consider the same previous exampleof a database, D , consisting of 9transactions.

• Suppose min. support count requiredis 2 (i.e. min_sup = 2/9 = 22 %)
• The first scan of database is same as Apriori, which derives the set of 1-itemsets &their support
counts.
• The set of frequent items is sortedin the order of descending support count.
• The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2,I5:2}
- Avoid candidate generation: sub-database test only!
2.3.4 FP-GrowthMethod: Construction of FP-Tree

• First, create the root of the tree, labeled with “null”.
• Scan the database D a second time. (First time we scanned it to create 1-itemset and then L).
• The items in each transaction are processed in L order (i.e. sorted order).
• A branch is created for each transaction with items having their support count separated by
colon.
• Whenever the same node is encountered in another transaction, we just increment the support

count of the common node or Prefix.
• To facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links.
Now, the problem of mining frequent patterns in database is transformed to that of mining the FP-
Tree.
2.3.4.1 Mining the FP-Tree by Creating Conditional (sub) pattern bases

Steps:
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree
co-occurring with suffix pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent
patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.
Item Conditionalpattern base Conditional FP-Tree Frequentpattern generated

I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> {I2, I5:2}, {I1, I5:2}, {I2, I1 I5: 2}
I4 {(I2 I1: 1),(I2: 1)} <I2: 2> {I2, I4: 2}

I3 {(I2 I1: 2), (I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> {I2, I3:4}, {I1, I3: 4} , {I2, I1, I3: 2}
I1 {(I2: 4)} <I2: 4> {I2, I1: 4}
Now, Following the above mentioned steps:
• Lets start from I5. The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
• Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2 I1: 1} and {I2 I1 I3:
1}, which forms its conditional pattern base.
So considering table’s last column of frequent Pattern generation we have generated a 3-item
frequent set as {I2, I1 I5: 2}&{I2, I1, I3: 2}}. Similarly 2-Item frequent sets are {I2, I5:2},{I1, I5:2}, {I2,
I4: 2}, {I2, I3:4}, {I1, I3: 4}&{I2, I1: 4}. Also, we can see that we have arrived at distinct sets. None of
the sets are similar in the case.
In comparison to the Apriori Algorithm, we have generated only the frequent patterns in the item
sets rather than all the combinations of different items. For example, here we haven’t generated
{I3, I4} or {I3, I5} since they are not frequently bought together items which is the main essence
behind the association rules criteria and FP growth algorithm.

REFERENCES
C. Aggarwal. Managing and mining sensor data. Springer, 2013.
G. Adomavicius, and A. Tuzhilin. Toward the next generation of recommender systems: A survey of
the state-of-the-art and possible extensions. IEEE Transactions onKnowledge and Data Engineering,
17(6), pp. 734–749, 2005.
R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A tree projection algorithmfor generation of

frequent item sets. Journal of parallel and Distributed Computing,61(3), pp. 350–371, 2001. Also
available as IBM Research Report, RC21341, 1999.
R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth-first generation of longpatterns. ACM KDD

Conference, pp. 108–118, 2000. Also available as “Depth-firstgeneration of large itemsets for
association rules.”IBM Research Report, RC21538,1999.
C. Aggarwal. Outlier analysis. Springer, 2013.
C. Aggarwal. Social network data analytics. Springer, 2011.
C. Aggarwal, and P. Yu. The igrid index: reversing the dimensionality curse for similarity indexing in
high-dimensional space. KDD Conference, pp. 119–129, 2000.

Page I
1

Data Mining Group 6

Uploaded by

Copyright:

Available Formats

Data Mining Group 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Group 6

Uploaded by

Copyright:

Available Formats

Département d’informatique

Faculté des sciences

1.2 WHY DATA MINING................................................................................................................ 2

2 DATA MINING PROCESS ................................................................................................................ 2

2.1 DATA COLLECTION.................................................................................................................. 3

2.1.1 BASIC DATA TYPES .......................................................................................................... 3

2.1.1.1 NON-DEPENDENCY-ORIENTED DATA TYPES .............................................................. 3

2.1.1.2 DEPENDENCY-ORIENTED DATA .................................................................................. 4

2.2 DATA PROCESSING ..................................................................................................................... 5

2.2.1 FEATURE EXTRACTION .................................................................................................... 5

2.2.2 DATA CLEANSING ............................................................................................................ 6

2.2.3 DATA TRANSFORMATION AND REDUCTION ..................................................................... 7

2.3 ANALYTICAL PHASE............................................................................................................. 9

2.3.1 ASSOCIATION PATTERN MINING .................................................................................... 9

2.3.1.1 Apriori Algorithm ....................................................................................................... 10

2.3.2 Further Improvement of the Apriori Method ................................................................. 15

2.3.3 Mining Frequent Patterns Without Candidate Generation ........................................... 16

2.3.4 FP-Growth Method: Construction of FP-Tree .................................................................. 16

Big Data : Data mining 1 Page 1

1.2 WHY DATA MINING

2 DATA MINING PROCESS

Big Data : Data mining 1 Page 2

2.1.1 BASIC DATA TYPES

2.1.1.1 NON-DEPENDENCY-ORIENTED DATA TYPES

Big Data : Data mining 1 Page 3

 Quantitative multi-dimensional data

2.1.1.2 DEPENDENCY-ORIENTED DATA

Different dependency-oriented data types include :

Big Data : Data mining 1 Page 4

2.2 DATA PROCESSING

2.2.1 FEATURE EXTRACTION

Big Data : Data mining 1 Page 5

Network traffic: In many intrusion-detection applications, the characteristics of the network

2.2.2 DATA CLEANSING

Big Data : Data mining 1 Page 6

2.2.3 DATA TRANSFORMATION AND REDUCTION

Big Data : Data mining 1 Page 7

Big Data : Data mining 1 Page 8

2.3.1 ASSOCIATION PATTERN MINING

Big Data : Data mining 1 Page 9

2. The confidence of the rule X ⇒ Y is at least minconf.

2.3.1.1 Apriori Algorithm

Example: Market basketanalysis

Big Data : Data mining 1 Page 10

TID List of item IDs

There are nine transactions in this database, that is, |D| = 9.

Big Data : Data mining 1 Page 11

Big Data : Data mining 1 Page 12

Big Data : Data mining 1 Page 13

Let minimum confidence threshold is , say 70%.

• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%

• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%

Big Data : Data mining 1 Page 14

• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%

• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%

• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%

• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%

• R6 is Selected. In this way, we have found three strong association rules

2.3.2 FurtherImprovement of the Apriori Method