Dav Cia 2
Dav Cia 2
Dav Cia 2
Answer-1
1. A) Support measures the frequency of occurrence of an item set in a data set, while confidence
quantifies the reliability of an association rule by assessing the likelihood of one item set's presence
given another's.
1. B) Association rules are "if-then" statements, that help to show the probability of relationships
between data items, within large data sets in various types of databases.
For example, the rule “If a customer buys bread, they are also likely to buy milk.”
1. D) Data filtering is the process of examining a dataset to exclude, rearrange, or apportion data
according to certain criteria. For example, data filtering may involve finding out the total number of
sales per quarter and excluding records from last month.
1. E) Principal Component Analysis (PCA) plays a key role in reducing data dimensionality while
preserving maximum information. It simplifies analysis by identifying the main axes of variation in
high-dimensional datasets, enabling visualization and pattern detection.
Section - B
Answer - 2
Several methods have been proposed to improve the efficiency of the Apriori algorithm:
1. Transaction Reduction:
This method removes redundant or irrelevant transactions from the database before mining. This can
significantly reduce the number of database scans required, thereby improving performance.
Examples:
Removing short transactions that are unlikely to contain any frequent itemsets.
2. Hash-Based Techniques:
Instead of linearly scanning the database for each candidate itemset, hash tables can be used to
efficiently check their support.
This technique allows adding new candidate itemsets at any point during the database scan.
This can potentially reduce the number of scans required and improve performance.
Examples:
Using a bitset to keep track of the items appearing in each transaction.
Answer 2-OR
CLIQUE divides the data space into a grid by dividing each dimension into equal intervals called units.
The number of units per dimension impacts the algorithm's efficiency and accuracy.
A unit is considered dense if the number of data points within it exceeds a pre-specified density
threshold.
After identifying all dense cells, CLIQUE searches for the largest set of connected dense cells, forming
a cluster.
CLIQUE utilizes the apriori principle to generate clusters from all dense subspaces.
This involves iteratively building clusters from smaller, already identified dense regions.
Answer -3
Naive Bayes is called "naive" because it makes a strong and often unrealistic assumption: that all
features are independent of each other.
Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet
fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an
apple without depending on each other.
Given training data D, posterior probability of a hypothesis h, p(h|D) follows the Bayes theorem-
P(h|D)=P(D|h)P(h)P(D)
P(h): Independent probability of h: prior probability
P(D): Independent probability of D
P(D|h): Conditional probability of D given h: likelihood
P(h|D): Conditional probability of h given D: Posterior probability
Answer -3 (OR)
Trend: The long-term upward or downward movement in the series. This could be due to factors like
technological advancements, population growth, or economic trends.
Seasonality: Repetitive patterns within a year, such as daily, weekly, monthly, or quarterly
fluctuations. Examples include daily traffic patterns, weekly retail sales variations, and monthly
electricity consumption changes.
Cyclical: Fluctuations over longer periods than seasonal variations, typically lasting several years.
These cycles are often associated with economic booms and recessions.
Answer -4
Components:
1. Ingest: Data streams are received from various sources like sensors, logs, social media feeds, etc.,
into the processing engine.
2. Process: Data is filtered, transformed, and analyzed using streaming operators like aggregations,
joins, and filtering.
3.Output: Processed data is published to destinations like databases, dashboards, or other
applications for further analysis or triggering actions.
Traditional algorithms for large datasets struggle with continuous, potentially infinite data streams.
This leads to growing storage demands for exact query answers, making them impractical.
External memory algorithms are not well-suited due to lack of real-time response capabilities.
2. Limited Computation Time:
Continuous data streams require prompt processing, demanding low computation time per element.
High latency hinders real-time processing and makes the algorithm incompatible with the data
stream's pace.
3. Approximate Query Answering:
Blocking operators cannot produce output until receiving all input, hindering continuous query
processing.
Data streams are potentially infinite, making it impossible for blocking operators to ever receive their
entire input and produce output.
Section - C
Answer - 5
Apriori algorithm:
The apriori algorithm solves the frequent item sets problem. The algorithm analyzes a data set to determine which
combinations of items occur together frequently. The Apriori algorithm is at the core of various algorithms for data
mining problems. The best known problem is finding the association rules that hold in a basket -item relation.
Numerical:
Given:
Support = 60% = ( 60/100)*x^5 = 3
Confidence = 70%
ITERATION 1:
STEP 1: (C1)
Item Count
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
Item Count
0 3
U 1
Y 3
STEP 2: (L2)
Item Count
E 4
K 5
M 3
O 3
Y 3
ITERATION 2:
STEP 3: (C2)
Item Count
E,K 4
E,M 2
E,O 3
E,Y 2
K,M 3
K,O 3
K,Y 3
M,O 1
M,Y 2
O,Y 2
STEP 4: (L2)
Item Count
E,K 4
E,O 3
K,M 3
K,O 3
K,Y 3
ITERATION 3:
STEP 5: (C3)
Item Count
E,K,O 3
K,M,O 1
K,M,Y 2
STEP 6: (L3)
Item Count
E,K,O 3
Now, stop since no more combinations can be made in L3.
ASSOCIATION RULE:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.