B SC (IT) VI-DSE3-M5
B SC (IT) VI-DSE3-M5
B SC (IT) VI-DSE3-M5
Data Mining is a technical process by which consistent patterns are identified, explored, sorted and
organized. Data Mining is used to provide two primary advantages –
1. To give businesses the predictive power to estimate the unknown or future values and
2. To provide businesses the descriptive power by finding interesting patterns in the data.
With data mining tools, organizations of any size can extract valuable insights from their datasets,
including information about consumers, costs, and future trends. This process can be employed to:-
Data mining which is also known as knowledge discovery in data (KDD) is the process of uncovering
patterns and other valuable information from large data sets. Given the evolution of data
warehousing technology and the growth of big data, adoption of data mining techniques has rapidly
accelerated over the last couple of decades, assisting companies by transforming their raw data into
useful knowledge. However, despite the fact that that technology continuously evolves to handle
data at a large-scale, leaders still face challenges with scalability and automation.
Data mining has improved organizational decision-making through insightful data analyses. The data
mining techniques that underpin these analyses can be divided into two main purposes; they can
either describe the target dataset or they can predict outcomes through the use of machine learning
algorithms. These methods are used to organize and filter data, surfacing the most interesting
information, from fraud detection to user behaviors, bottlenecks, and even security breaches.
With this technique, we analyze the data and then convert that data into meaningful information.
This helps the business to take accurate and better decisions in an organization. It helps to develop
smart market decision, run accurate campaigns, make predictions, and more. With the help of Data
mining, we can analyze customer behaviors and their insights. This leads to great success and data-
driven business.
In order to have a thorough understanding of the behavior of the customers, Data mining helps to
extract the pattern from the databases. The knowledge thus acquired, allows the companies to offer
the best possible services.
Data mining is not specific to one type of media or data. Data mining should be applicable to any
kind of information repository. However, algorithms and approaches may differ when applied to
different types of data. Indeed, the challenges presented by different types of data vary significantly.
Data mining is being put into use and studied for databases, including relational databases, object-
relational databases and object-oriented databases, data warehouses, transactional databases,
unstructured and semi-structured repositories such as the World Wide Web, social media data,
advanced databases such as spatial databases, multimedia databases, time-series databases and
textual databases, and even flat files.
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data Mining
techniques. It is a field of interest to researchers in various fields, including artificial intelligence,
machine learning, pattern recognition, databases, statistics, knowledge acquisition for expert
systems, and data visualization.
The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.
procedure, including the inferring of algorithms that investigate the data, develop the model, and
find previously unknown patterns. The model is used for extracting the knowledge from the data,
analyze the data, and predict the data.
The availability and abundance of data today make knowledge discovery and Data Mining a matter
of impressive significance and need. In the recent development of the field, it isn't surprising that a
wide variety of techniques is presently accessible to specialists and experts.
The knowledge discovery process is iterative and interactive, comprises of nine steps. The process is
iterative at each stage, implying that moving back to the previous actions might be required. The
process has many imaginative aspects in the sense that one can’t presents one formula or make a
complete scientific categorization for the correct decisions for each step and application type. Thus,
it is needed to understand the process and the different requirements and possibilities in each stage.
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD process is an
iterative process and it requires multiple iterations of the above steps to extract accurate knowledge
from the data. The following steps are included in KDD process:
This is the initial preliminary step. It develops the scene for understanding what should be done with
the various decisions like transformation, algorithms, representation, etc. The individuals who are
in charge of a KDD venture need to understand and characterize the objectives of the end-user and
the environment in which the knowledge discovery process will occur.
Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data Mining
learns and discovers from the accessible data. This is the evidence base for building the models. If
some significant attributes are missing, at that point, then the entire study may be unsuccessful
from this respect, the more attributes are considered. On the other hand, to organize, collect, and
operate advanced data repositories is expensive, and there is an arrangement with the opportunity
for best understanding the phenomena. This arrangement refers to an aspect where the interactive
and iterative aspect of the KDD is taking place. This begins with the best available data sets and later
expands and observes the impact in terms of knowledge discovery and modelling.
In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical techniques or
use a Data Mining algorithm in this context. For example, when one suspects that a specific attribute
of lacking reliability or has many missing data, at this point, this attribute could turn into the
objective of the Data Mining supervised algorithm. A prediction model for these attributes will be
created, and after that, missing data can be predicted. The expansion to which one pays attention to
this level relies upon numerous factors. Regardless, studying the aspects is significant and regularly
revealing by itself, to enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction ( for example, feature selection and extraction
and record sampling), also attribute transformation(for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not utilize
the right transformation at the starting, then we may acquire an amazing effect that insights to us
about the transformation required in the next iteration. Thus, the KDD process follows upon itself
and prompts an understanding of the transformation required.
inductive approach is that the prepared model applies to future cases. The technique also takes into
account the level of meta-learning for the specific set of accessible data.
Having the technique, we now decide on the strategies. This stage incorporates choosing a particular
technique to be used for searching patterns that include multiple inducers. For example, considering
precision versus understandability, the previous is better with neural networks, while the latter is
better with decision trees. For each system of meta-learning, there are several possibilities of how it
can be succeeded. Meta-learning focuses on clarifying what causes a Data Mining algorithm to be
fruitful or not in a specific issue. Thus, this methodology attempts to understand the situation under
which a Data Mining algorithm is most suitable. Each algorithm has parameters and strategies of
leaning, such as ten folds cross-validation or another division for training and testing.
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need to
utilize the algorithm several times until a satisfying outcome is obtained. For example, by turning the
algorithms control parameters, such as the minimum number of instances in a single leaf of a
decision tree.
8. Pattern Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there. This
step focuses on the comprehensibility and utility of the induced model. In this step, the identified
knowledge is also recorded for further use. The last step is the use, and overall feedback and
discovery results acquire by Data Mining.
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD process.
There are numerous challenges in this step, such as losing the "laboratory conditions" under which
we have worked. For example, the knowledge was discovered from a certain static depiction, it is
usually a set of data, but now the data becomes dynamic. Data structures may change certain
quantities that become unavailable, and the data domain might be modified, such as an attribute
that may have a value that was not expected previously.
Advantages of KDD
Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the
data ready for analysis, which saves time and money.
Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and
anomalies in the data that may indicate fraud.
Predictive modeling: KDD can be used to build predictive models that can forecast future
trends and patterns.
Disadvantages of KDD
Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large
amounts of data, which can include sensitive information about individuals.
Complexity: KDD can be a complex process that requires specialized skills and knowledge to
implement and interpret the results.
Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or
consistent, the results can be misleading
High cost: KDD can be an expensive process, requiring significant investments in hardware,
software, and personnel.
Overfitting: KDD process can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.
Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are the two
terms which look similar but refer to different kinds of systems. Online transaction
processing (OLTP) captures, stores, and processes data from transactions in real time. Online
analytical processing (OLAP) uses complex queries to analyze aggregated historical data from
OLTP systems.
Note: OLTP provides an immediate record of current business activity, while OLAP
generates and validates insights from that data as it’s compiled over time. That historical
perspective empowers accurate forecasting, but as with all business intelligence, the
insights generated with OLAP are only as good as the data pipeline from which they
emanate.
This function seeks to uncover the relationships between data points; it is used to determine
whether a specific action or variable has any traits that can be linked to other actions (e.g.,
business travelers’ room choices and dining habits). A hotelier might use association rule
insights to offer room upgrades or food and beverage promotions to attract additional
business travelers.
2. Anomaly or Outlier Detection
In addition to searching for patterns, data mining seeks to uncover unusual data within a set.
Anomaly detection is the process of finding data that doesn’t conform to the pattern. This
process can help find instances of fraud and help retailers learn more about spikes, or
declines, in the sales of certain products.
3. 3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for
prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for
which values are being predicted is consistently valued(ordered) instead of categorical
(discrete-esteemed and unordered). The attribute can be referred to simply as the predicted
attribute. Prediction can be viewed as the construction and use of a model to assess the
class of an unlabelled object, or to assess the value or value ranges of an attribute that a
given object is likely to have.
4. Classification Analysis
With this technique, data points are assigned to groups, or classes, based on a specific
question or problem to address. For instance, if a consumer packaged goods company wants
to optimize its coupon discount strategy for a specific product, it might review inventory
levels, sales data, coupon redemption rates, and consumer behavioral data in order to make
the best decision possible.
5. Clustering Analysis
Clustering looks for similarities within a data set, separating data points that share common
traits into subsets. This is similar to the classification type of analysis in that it groups data
points, but, in clustering analysis, the data is not assigned to previously defined groups.
Clustering is useful for defining traits within a data set, such as the segmentation of
customers based on purchase behavior, need state, life stage, or likely preferences in
marketing communication.
6. Regression Analysis
Regression analysis is about understanding which factors within a data set are most
important, which can be ignored, and how these factors interact. With this technique, data
miners are able to validate theories such as “when a lot of snow is predicted, more bread
and milk will be sold before the storm.” While this seems obvious enough there are a
number of variables that need to be verified and quantified for the store manager to make
sure enough stock is available. For example, how much is “a lot” of snow? How much is
“more milk and bread”? Which types of weather forecasts tend to cause consumer action
and how many days before the storm will consumers start buying? What is the relationship
between inches of snow, units of bread, and units of milk? Through regression analysis,
specific inventory levels of milk and bread (in units/cases) can be recommended for specific
levels of snow forecasted (inches), at specific points in time (days before the storm). In this
way, the use of regression analysis maximizes sales, minimizes out-of-stock instances, and
helps avoid overstocking which results in product spoilage after the storm.
A data warehouse is a technique for collecting and managing data from varied sources to provide
meaningful business insights. It is a blend of technologies and components which allows the
strategic use of data. Data Warehouse is repository of a large amount of data, which is designed for
query and analysis instead of transaction processing. It is a process of transforming data into
information and making it available to users for analysis.
On the other hand, data mining is looking for hidden, valid, and potentially useful patterns in huge
data sets. Data Mining is all about discovering unsuspected/ previously unknown relationships
amongst the data. It is a multi-disciplinary skill that uses machine learning, statistics, AI and database
technology. The insights extracted via Data mining can be used for marketing, fraud detection, and
scientific discovery, etc.
Organizations need to spend lots of their In Data warehouse, data is pooled from multiple
resources for training and Implementation sources. The data needs to be cleaned and
purpose. Moreover, data mining tools work in transformed. This could be a challenge.
different manners due to different algorithms
employed in their design.
The data mining methods are costeffective Data warehouse's responsibility is to simplify
and efficient compares to other statistical every type of business data. Most of the work
data applications. that will be done on user's part is inputting the
raw data.
Another critical benefit of data mining Data warehouse allows users to access critical
techniques is the identification of errors which data from the number of sources in a single
can lead to losses. Generated data could be place. Therefore, it saves user's time of
used to detect a drop-in sale. retrieving data from multiple sources.
Data mining helps to generate actionable Once you input any information into Data
strategies built on data insights. warehouse system, you will unlikely to lose track
of this data again. You need to conduct a quick
search, helps you to find the right statistic
information.
Data mining techniques are derived that utilize the domain knowledge from statistical analysis,
artificial intelligence, and database systems in order to analyze data in a proper manner in view of
different dimensions and perspectives. Data mining tools discover patterns or trends from the
collection of large sets of data and transforming data into useful information for making decisions.
These are the most popular data mining tools are Orange, SaS, Rattle Data Mining, Rapidminer,
DataMelt Data Mining, Oracle BI etc.
for bioinformatics and text mining, and it is packed with features for data analytics. This is
also used as a python library.
Orange comprises of canvas interface onto which the user places widgets and creates a data
analysis workflow. The widget proposes fundamental operations, For example, reading the
data, showing a data table, selecting features, training predictors, comparing learning
algorithms, visualizing data elements, etc. Orange comes with multiple regression and
classification algorithms. Orange can read documents in native and other data formats.
Orange is dedicated to machine learning techniques for classification or supervised data
mining. There are two types of objects used in classification: learner and classifiers. Learners
consider class-leveled data and return a classifier. Regression methods are very similar to
classification in Orange, and both are designed for supervised data mining and require class-
level data. The learning of ensembles combines the predictions of individual models for
precision gain. The model can either come from different training data or use different
learners on the same sets of data.
Data Mining is primarily used today by companies with a strong consumer focus, in the areas like
retail, financial, communication, and marketing organizations, to “drill down” into their transactional
data and determine pricing, customer preferences and product positioning, impact on sales,
customer satisfaction and corporate profits. With data mining, a retailer can use point-of-sale
records of customer purchases to develop products and promotions to appeal to specific customer
segments.
1. Basket Analysis
In its most basic application, retailers use basket analysis to analyze what consumers buy (or
put in their “baskets”). This is a form of the association technique, giving retailers insight
into buying habits and allowing them to recommend other purchases. A less familiar
application is one used by law enforcement, where vast amounts of anonymous consumer
data is analyzed looking for combinations of products one would use in bomb-making or the
production of methamphetamine.
2. Sales Forecasting
Sales forecasting is a form of predictive analysis to which businesses are devoting more of
their budgets. Data mining can help businesses project sales and set targets by examining
historical data such as sales records, financial indicators (e.g., consumer price index, inflation
markers etc.), consumer spending habits, sales attributed to a specific time of year, and
trends which may impact standard assumptions about the business.
3. Bioinformatics
Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich. Mining
biological data helps to extract useful knowledge from massive datasets gathered in biology,
and in other related life sciences areas such as medicine and neuroscience. Applications of
data mining to bioinformatics include gene finding, protein function inference, disease
diagnosis, disease prognosis, disease treatment optimization, protein and gene interaction
network reconstruction, data cleansing, and protein sub-cellular location prediction.
4. Inventory Planning
Data mining can provide businesses with up-to-date information regarding product
inventory, delivery schedules, and production requirements. Data mining also can help
remove some of the uncertainty that comes with simple supply-and-demand issues within
the supply chain. The speed with which data mining can discern patterns and devise
projections helps companies better manage their product stock and operate more
efficiently.
5. Customer Segmentation
Traditional market research may help us to segment customers but data mining goes in deep
and increases market effectiveness. Data mining aids in aligning the customers into a distinct
segment and can tailor the needs according to the customers. Market is always about
retaining the customers. Data mining allows to find a segment of customers based on
vulnerability and the business could offer them with special offers and enhance satisfaction.
6. Customer Relationship Management
Customer Relationship Management is all about acquiring and retaining customers, also
improving customers’ loyalty and implementing customer focused strategies. To maintain a
proper relationship with a customer a business need to collect data and analyze the
information. This is where data mining plays its part. With data mining technologies the
collected data can be used for analysis. Instead of being confused where to focus to retain
customer, the seekers for the solution get filtered results.
7. Healthcare
Mining can be used to predict the volume of patients in every category. Processes are
developed that make sure that the patients receive appropriate care at the right place and
at the right time. Data mining can also help healthcare insurers to detect fraud and abuse.
8. Education
There is a new emerging field, called Educational Data Mining (EDM), concerns with
developing methods that discover knowledge from data originating from educational
Environments. The goals of EDM are identified as predicting students’ future learning
behavior, studying the effects of educational support, and advancing scientific knowledge
about learning. Data mining can be used by an institution to take accurate decisions and also
to predict the results of the student. With the results the institution can focus on what to
teach and how to teach. Learning pattern of the students can be captured and used to
develop techniques to teach them.
9. Intrusion Detection
Any action that will compromise the integrity and confidentiality of a resource is an
intrusion. The defensive measures to avoid an intrusion includes user authentication, avoid
programming errors, and information protection. Data mining can help improve intrusion
detection by adding a level of focus to anomaly detection. It helps an analyst to distinguish
an activity from common everyday network activity. Data mining also helps extract data
which is more relevant to the problem.
10. Criminal Investigation
Criminology is a process that aims to identify crime characteristics. Actually crime analysis
includes exploring and detecting crimes and their relationships with criminals. The high
volume of crime datasets and also the complexity of relationships between these kinds of
data have made criminology an appropriate field for applying data mining techniques. Text
based crime reports can be converted into word processing files. This information can be
used to perform crime matching process.
11. Fraud Detection
While frequently occurring patterns in data can provide teams with valuable insight,
observing data anomalies is also beneficial, assisting companies in detecting fraud. While
this is a well-known use case within banking and other financial institutions, SaaS-based
companies have also started to adopt these practices to eliminate fake user accounts from
their datasets.
12. Operational Optimization
Process mining leverages data mining techniques to reduce costs across operational
functions, enabling organizations to run more efficiently. This practice has helped to identify
costly bottlenecks and improve decision-making among business leaders