Data Mining Group 6
Data Mining Group 6
Data Mining Group 6
UE : BIG DATA
DATA MINING
Sous la supervision de
Dr MOSKOLAI
Rédigé par
GROUPE 6
N°
1. JOSHUA MotasayObassy
2. SOBGOU TSOBFACK Dieudonné
3. MBUYA JOSIAM Anyam Agwu
Année académique 2020 / 2021
1 Contents
1. INTRODUCTION ............................................................................................................................. 2
3 References ................................................................................................................................... 19
World Wide Web: The number of documents on the indexed Web is now on the order of billions,
User accesses to such documents create Web access logs at servers and customer behavior profiles at
commercial sites. user access logs canbe mined to determine frequent patterns of accesses or unusual
patterns of possiblyunwarranted behavior
Financial interactions: Most common transactions of everyday life, such as using an automated
teller machine (ATM) card or a credit card, can create data in an automated way. Such transactions
can be mined for many useful insights such as fraud or other unusual activity.
User interactions: Many forms of user interactions create large volumes of data. For example, the
use of a telephone typically creates a record at the telecommunication company with details about
the duration and destination of the call. Many phone companies routinely analyze such data to
determine relevant patterns of behavior that can be used to make decisions about network capacity,
promotions, pricing, or customer targeting.
Sensor technologies and the Internet of Things: A recent trend is the development of low-cost
wearable sensors, smartphones, and other smart devices that can communicate with one another. By
one estimate, the number of such devices exceeded the number of people on the planet in 2008 [1].
Implicit dependencies: In this case, the dependencies between data items are not explicitly
specified but are known to “typically” exist in that domain. For example, consecutive temperature
values collected by a sensor are likely to be extremely similar to one another.
Explicit dependencies: This typically refers to graph or network data in which edges are used to
specify explicit relationships.
Text data : it is a raw form, a text document that corresponds to a string. Each string is a
sequence of character in the document. It can be used to analyze the frequency of words in a
document.
Time-series data: It contains values that are generated by continuous measurement over time.
For example, an environmental sensor will measure the temperature continuously, whereas an
Image data: In its most primitive form, image data are represented as pixels. At a slightly higher
level, color histograms can be used to represent the features in different segments of an image.
Web logs: Web logs are typically represented as text strings in a prespecified format. it is relatively
easyto convert Web access logs into a multidimensional representation of (the relevant)categorical
and numeric attributes.
Document data: Document data is often available in raw and unstructured form, and the data may
contain rich linguistic relations between different entities. Named-entity recognition is an important
subtask of information extraction. Thisapproach locates and classifies atomic elements in text into
predefined expressionsof names of persons, organizations, locations, actions, and numeric quantities.
Some data collection technologies, such as sensors, are inherently inaccurate becauseof the
hardware limitations associated with collection and transmission. Sometimessensors may drop
readings because of hardware failure or battery exhaustion.
Users may not want to specify their information for privacy reasons, or they mayspecify incorrect
values intentionally. For example, it has often been observed thatusers sometimes specify their
birthday incorrectly on automated registration sitessuch as those of social networks. In some cases,
users may choose to leave severalfields empty.
Methods are needed to remove or correct missing and erroneous entries from thedata. There are
several important aspects of data cleaning:
In the classification problem, a single attribute is treated specially, and the other featuresare used
to estimate its value. In this case, there are many classification methods that can also be used for
missingvalue estimation (such as matrix completion method).
In a time-series data set,the average of the values at the time stamp just before or after the
missing attribute maybe used for estimation.
For the case of spatialdata, the estimation process is quite similar, where the average of values at
neighboringspatial locations may be used.
In case where different features represent different scales of reference and maytherefore not be
comparable to one another For example, an attribute such as age is drawnon a very different scale
than an attribute such as salary. The latter attribute is typicallyorder of magnitude larger than the
former. As a result, any aggregate function computedon the different features (e.g., Euclidean
distances) will be dominated by the attribute oflarger magnitude. To address this problem, it is
common to use standardization.
Data sampling: in data sampling, the underlying data is sampled to create a much smaller
database. The type of sampling usedmayvarywith application.
Association pattern mining problem has a wide variety of application, such as;
Supermarket data: the determination of frequent item set provide a useful insight about
targeted marketing and shelf placement of items.
Text mining: frequent text mining helps to identify co occurring terms and keywords.
Generalization to dependency-oriented data types:
The following are some definitions and Properties used in Frequent pattern mining model.
o Definition (Support) The support of an itemset I is defined as the fraction of the transactions in
the database T = {T1. . . Tn} thatcontain I as a subset.
o Definition (Frequent Itemset Mining) Given a set of transactions T = {T1. . . Tn}, where each
transaction Ti is a subset of items from U, determine all item sets I that occur as a subset of at least a
predefined fraction minsup of the transactions in T .
o Definition (Frequent Itemset Mining: Set-wise Definition) Given a set of sets T = {T1. . . Tn},
where each element of the set Ti is drawn on the universe of elements U, determine all sets I that occur
as a subset of at least a predefined fractionminsup of the sets in T .
o Property (Support Monotonicity Property) The support of every subset J of I is at least equal
to that of the support of itemset I. sup(J) ≥ sup(I) ∀ J ⊆ I.
o Property (Downward Closure Property) Every subset of a frequent itemset is also frequent.
o Definition (Association Rules) Let X and Y be two sets of items. Then, the rule X ⇒ Y is said to
be an association rule at a minimum support of minsup and minimum confidence of minconf, if it
satisfies both the following criteria:
1. The support of the itemset X ∪ Y is at least minsup.
o Property (Confidence Monotonicity) Let X1, X2, and I be item sets such that X1⊂ X2⊂ I. Then
the confidence of X2 ⇒ I - X2is at least that of X1⇒ I - X1.
conf(X2⇒ I - X2) ≥ conf(X1⇒ I - X1)
Let Fkdenote the set of frequent k-item sets, and Ckdenote the set of candidate k-item sets
This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such association scan help retailers
develop marketing strategies by gaining insight into which items are frequently purchased together
by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and
what kind of bread) on the same trip to the supermarket. Such information can lead to increased
sales by helping retailers do selective marketing and plan their shelfspace.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
The rule of joining is that L2 is that k=3 and we should have k-2 (i.e 1) element in common and this
element should be the first element.
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. –
Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
• The resulting association rules are shown below, each listed with its confidence.
– R1: I1 ^ I2 => I5
• R1 is Rejected
– R2: I1 ^ I5 => I2
– R3: I2 ^ I5=>I1
• R3 is Selected.
– R4: I1 => I2 ^ I5
• R4 is Rejected.
– R5: I2 => I1 ^ I5
• R5 is Rejected.
– R6: I5 => I1 ^ I2
- ImprovingApriori:generalideas
Reduce passes of transaction database scans
Shrinknumber of candidates
- Completeness: any association rule mining algorithm should get the same set of frequent
itemsets.
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree
co-occurring with suffix pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent
patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.
• Lets start from I5. The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
• Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2 I1: 1} and {I2 I1 I3:
1}, which forms its conditional pattern base.
So considering table’s last column of frequent Pattern generation we have generated a 3-item
frequent set as {I2, I1 I5: 2}&{I2, I1, I3: 2}}. Similarly 2-Item frequent sets are {I2, I5:2},{I1, I5:2}, {I2,
I4: 2}, {I2, I3:4}, {I1, I3: 4}&{I2, I1: 4}. Also, we can see that we have arrived at distinct sets. None of
the sets are similar in the case.
In comparison to the Apriori Algorithm, we have generated only the frequent patterns in the item
sets rather than all the combinations of different items. For example, here we haven’t generated
{I3, I4} or {I3, I5} since they are not frequently bought together items which is the main essence
behind the association rules criteria and FP growth algorithm.
G. Adomavicius, and A. Tuzhilin. Toward the next generation of recommender systems: A survey of
the state-of-the-art and possible extensions. IEEE Transactions onKnowledge and Data Engineering,
17(6), pp. 734–749, 2005.
C. Aggarwal, and P. Yu. The igrid index: reversing the dimensionality curse for similarity indexing in
high-dimensional space. KDD Conference, pp. 119–129, 2000.