Dav Cia 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Section - A

Answer-1

1. A) Support measures the frequency of occurrence of an item set in a data set, while confidence
quantifies the reliability of an association rule by assessing the likelihood of one item set's presence
given another's.

1. B) Association rules are "if-then" statements, that help to show the probability of relationships
between data items, within large data sets in various types of databases.
For example, the rule “If a customer buys bread, they are also likely to buy milk.”

1. C) There are two main types of market basket analysis:


a. ) Association rule learning: Identifies frequently bought together items (e.g., bread and butter).
b. ) Sequential pattern mining: Discovers patterns in purchase sequences (e.g., buying diapers before
baby wipes).

1. D) Data filtering is the process of examining a dataset to exclude, rearrange, or apportion data
according to certain criteria. For example, data filtering may involve finding out the total number of
sales per quarter and excluding records from last month.

1. E) Principal Component Analysis (PCA) plays a key role in reducing data dimensionality while
preserving maximum information. It simplifies analysis by identifying the main axes of variation in
high-dimensional datasets, enabling visualization and pattern detection.

Section - B

Answer - 2

Several methods have been proposed to improve the efficiency of the Apriori algorithm:

1. Transaction Reduction:

This method removes redundant or irrelevant transactions from the database before mining. This can
significantly reduce the number of database scans required, thereby improving performance.
Examples:
Removing short transactions that are unlikely to contain any frequent itemsets.

2. Hash-Based Techniques:

Instead of linearly scanning the database for each candidate itemset, hash tables can be used to
efficiently check their support.

Examples: Using Bloom filters to quickly identify infrequent itemsets.

3. Dynamic Itemset Counting:

This technique allows adding new candidate itemsets at any point during the database scan.
This can potentially reduce the number of scans required and improve performance.
Examples:
Using a bitset to keep track of the items appearing in each transaction.
Answer 2-OR

Working of CLIQUE Algorithm:

1. Data Space Discretization:

CLIQUE divides the data space into a grid by dividing each dimension into equal intervals called units.
The number of units per dimension impacts the algorithm's efficiency and accuracy.

2. Identifying Dense Units:

A unit is considered dense if the number of data points within it exceeds a pre-specified density
threshold.

3. Progressive Dimensionality Expansion:

The algorithm starts by finding dense units along one dimension.


Then, it looks for dense units along two dimensions, connecting dense units from the previous step.
This process iteratively continues through all dimensions until no further expansion is possible.

4. Cluster Formation and Minimization:

After identifying all dense cells, CLIQUE searches for the largest set of connected dense cells, forming
a cluster.

5. Apriori-based Cluster Generation:

CLIQUE utilizes the apriori principle to generate clusters from all dense subspaces.
This involves iteratively building clusters from smaller, already identified dense regions.

Answer -3

Naive Bayes is called "naive" because it makes a strong and often unrealistic assumption: that all
features are independent of each other.

Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet
fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an
apple without depending on each other.

Major Ideas of Bayesian Classification:

Probabilistic learning: Explicit probabilities are calculated for Hypothesis.


Probabilistic prediction: Multiple hypothesis can be predicted by their probability weight.
Meta-classification: The ouputs of several classifiers can be combined, e.g by multiplying the
probabilities that all classifiers predict for a given class.

Given training data D, posterior probability of a hypothesis h, p(h|D) follows the Bayes theorem-
P(h|D)=P(D|h)P(h)P(D)
P(h): Independent probability of h: prior probability
P(D): Independent probability of D
P(D|h): Conditional probability of D given h: likelihood
P(h|D): Conditional probability of h given D: Posterior probability

Answer -3 (OR)

Components of Time Series Analysis


Time series analysis involves decomposing a series into its underlying components:

Trend: The long-term upward or downward movement in the series. This could be due to factors like
technological advancements, population growth, or economic trends.

Seasonality: Repetitive patterns within a year, such as daily, weekly, monthly, or quarterly
fluctuations. Examples include daily traffic patterns, weekly retail sales variations, and monthly
electricity consumption changes.

Cyclical: Fluctuations over longer periods than seasonal variations, typically lasting several years.
These cycles are often associated with economic booms and recessions.

Irregular/Random: Unpredictable fluctuations that cannot be explained by the other components.


This could be due to random events like natural disasters, strikes, or political turmoil.

Component Association for Specific Cases:


Heavy sales on the occasion of Diwali: This would be associated with the seasonal component of time
series analysis. Diwali is a specific annual event that causes a predictable surge in sales for certain
categories of products.
Strike in a factory, delaying production for 10 days: This would be associated with the
irregular/random component of time series analysis. This event is unexpected and disrupts the
normal production cycle, leading to a temporary drop in output.

Answer -4

General Stream Processing Model:


Stream processing is a paradigm for processing data in continuous streams, analyzing, filtering,
transforming, and enhancing information as it arrives. It provides real-time insights and enables
immediate actions based on the processed data.

Components:

1. Ingest: Data streams are received from various sources like sensors, logs, social media feeds, etc.,
into the processing engine.
2. Process: Data is filtered, transformed, and analyzed using streaming operators like aggregations,
joins, and filtering.
3.Output: Processed data is published to destinations like databases, dashboards, or other
applications for further analysis or triggering actions.

Issues with Stream Processing:


1. Complexity: Designing and managing streaming applications can be complex due to distributed
nature and state management.
2. Debugging: Debugging issues in real-time processing pipelines can be challenging due to the
dynamic nature of data.
3. Scalability challenges: Scaling streaming applications horizontally can be complex and require
specialized tools and expertise.
Answer -4 (OR)

Major Issues in data stream query Processing

1. Unbounded Memory Requirements:

Traditional algorithms for large datasets struggle with continuous, potentially infinite data streams.
This leads to growing storage demands for exact query answers, making them impractical.
External memory algorithms are not well-suited due to lack of real-time response capabilities.
2. Limited Computation Time:

Continuous data streams require prompt processing, demanding low computation time per element.
High latency hinders real-time processing and makes the algorithm incompatible with the data
stream's pace.
3. Approximate Query Answering:

Limited memory necessitates approximate answers instead of exact ones.


Sliding window technique analyzes only recent data, discarding older information for approximate
results.
4. Blocking Operators:

Blocking operators cannot produce output until receiving all input, hindering continuous query
processing.
Data streams are potentially infinite, making it impossible for blocking operators to ever receive their
entire input and produce output.

Section - C

Answer - 5

Apriori algorithm:
The apriori algorithm solves the frequent item sets problem. The algorithm analyzes a data set to determine which
combinations of items occur together frequently. The Apriori algorithm is at the core of various algorithms for data
mining problems. The best known problem is finding the association rules that hold in a basket -item relation.
Numerical:
Given:
Support = 60% = ( 60/100)*x^5 = 3
Confidence = 70%
ITERATION 1:
STEP 1: (C1)

Item Count

A 1

C 2

D 1

E 4

I 1

K 5

M 3

N 2
Item Count

0 3

U 1

Y 3
STEP 2: (L2)

Item Count

E 4

K 5

M 3

O 3

Y 3
ITERATION 2:
STEP 3: (C2)

Item Count

E,K 4

E,M 2

E,O 3

E,Y 2

K,M 3

K,O 3

K,Y 3

M,O 1

M,Y 2

O,Y 2
STEP 4: (L2)

Item Count

E,K 4

E,O 3

K,M 3

K,O 3

K,Y 3
ITERATION 3:
STEP 5: (C3)

Item Count

E,K,O 3

K,M,O 1

K,M,Y 2
STEP 6: (L3)

Item Count

E,K,O 3
Now, stop since no more combinations can be made in L3.
ASSOCIATION RULE:

1.

[E,K] →→ 0 = 3/4 = 75%

2.
3.

[K,O] →→ E = 3/3 = 100%

4.
5.

[E,O] →→ K = 3/3 = 100%

6.
7.

E →→ [K,O] = 3/4 = 75%

8.
9.

K →→ [E,O] = 3/5 = 60%

10.
11.

O →→ [E,K] = 3/3 = 100%

12.

∴∴ Rule no. 5 is discarded because confidence ≥≥70%


So, Rule 1,2,3,4,6 are selected

You might also like