ML Unit - 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

UNIT -1

INTRODUCTION TO MACHINE LEARNING

Definition:

A subset of artificial intelligence known as machine learning focuses primarily on


the creation of algorithms that enable a computer to independently learn from data and
previous experiences. Arthur Samuel first used the term "machine learning" in 1959. It
could be summarized as follows:

Without being explicitly programmed, machine learning enables a machine to


automatically learn from data, improve performance from experiences, and predict things.

A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks T, as measured by P,
improves with experience E
E*T=P
Examples
 Handwriting recognition learning problem

 Task T : Recognizing and classifying handwritten words within images

 Performance P : Percent of words correctly classified

 Training experience E : A dataset of handwritten words with given

classifications
 A robot driving learning problem

 Task T : Driving on highways using vision sensors

 Performance P : Average distance traveled before an error

 Training experience E : A sequence of images and steering commands

recorded while observing a human driver

“A computer program which learns from experience is called a machine learning


program or simply a learning program”
How does Machine Learning work

A machine learning system builds prediction models, learns from previous data, and
predicts the output of new data whenever it receives it. The amount of data helps to build
a better model that accurately predicts the output, which in turn affects the accuracy of
the predicted output

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology(Data-driven is a way of making decisions based on the
analysis and interpretation of data stored from digital sources)
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Some key points which show the importance of Machine Learning:


o Rapid increment in the production of data

o Solving complex problems, which are difficult for a human


o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:


1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1. Supervised Learning

In supervised learning, sample labeled data are provided to the machine learning
system for training, and the system then predicts the output based on the training data.

After the training and processing are done, we test the model with sample data to see if it
can accurately predict the output.

The mapping of the input data to the output data is the objective of supervised learning.
The managed learning depends on oversight, and it is equivalent to when an understudy
learns things in the management of the educator. Spam filtering is an example of
supervised learning.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression
2. Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any


supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
We don't have a predetermined result. The machine tries to find useful insights from
the huge amount of data. It can be further classifieds into two categories of algorithms:

o Clustering
o Association
3. Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent


gets a reward for each right action and gets a penalty for each wrong action.

The agent learns automatically with these feedbacks and improves its performance.
In reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
4. APPLICATION OF ML :
1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is


used to identify objects, persons, places, digital images, etc. The popular use case of
image recognition and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload


a photo with our Facebook friends, then we automatically get a tagging suggestion with
name, and the technology behind this is machine learning's face detection and recognition
algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, and Alexa are using speech recognition technology to follow the voice
instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or


heavily congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment


companies such as Amazon, Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser and this is
because of machine learning.

Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment


series, movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars.


Machine learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal,


and spam. We always receive an important mail in our inbox with the important symbol
and spam emails in our spam box, and the technology behind this is Machine learning.
Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,


and Naïve Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google


assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the
information using our voice instruction. These assistants can help us in various ways just
by our voice instructions such as Play music, call someone, Open an email, Scheduling an
appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode
it using ML algorithms and act accordingly.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that
a fraudulent transaction can take place such as fake accounts, fake ids, and steal
money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is
always a risk of up and downs in shares, so for this machine learning's long short term
memory neural network is used for the prediction of stock market trends.
10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it
is not a problem at all, as for this also machine learning helps us by converting the text into
our known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning


algorithm, which is used with image recognition and translates the text from one language
to another language.

WELL POSED LEARNING PROBLEMS:

Well Posed Learning Problem – A computer program is said to learn from experience E
in context to some task T and some performance measure P, if its performance on T, as
was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
 Task
 Performance Measure
 Experience
 The performance System — Takes a new board as input and outputs a trace of the

game it played against itself.


 The Critic — Takes the trace of a game as an input and outputs a set of training

examples of the target function.


 The Generalizer — Takes training examples as input and outputs a hypothesis that

estimates the target function. Good generalization to new cases is crucial.


 The Experiment Generator — Takes the current hypothesis (currently learned

function) as input and outputs a new problem (an initial board state) for the
performance system to explore.

Given performance task, in this case playing checkers, by using the learned target
function(s), The Critic takes as input the history or trace of the game and produces as
output a set of training examples of the target function. The Generalizer takes as input
the training examples and produces an output hypothesis that is its estimate of the target
function. The Experiment Generator takes as input the current hypothesis (currently
learned function) and outputs a new problem (i.e., initial board state) for the
Performance System to explore. Its role is to pick new practice problems that will
maximize the learning rate of the overall system. source of experience examples. To get
a successful learning system, it should be designed properly, for a proper design several
steps may be followed for perfect and efficient system.
Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
 Task – Classifying emails as spam or not
 Performance Measure – The fraction of emails accurately classified as spam or not
spam
 Experience – Observing you label emails as spam or not spam
A checkers learning problem
 Task – Playing checkers game
 Performance Measure – percent of games won against opposer
 Experience – playing implementation games against itself
3. Handwriting Recognition Problem
 Task – Acknowledging handwritten words within portrayal
 Performance Measure – percent of words accurately classified
 Experience – a directory of handwritten words with given classifications
4. A Robot Driving Problem
 Task – driving on public four-lane highways using sight scanners
 Performance Measure – average distance progressed before a fallacy
 Experience – order of images and steering instructions noted down while observing a
human driver
5. Fruit Prediction Problem
 Task – forecasting different fruits for recognition
 Performance Measure – able to predict maximum variety of fruits
 Experience – training machine with the largest datasets of fruits images
6. Face Recognition Problem
 Task – predicting different types of faces
 Performance Measure – able to predict maximum types of faces
 Experience – training machine with maximum amount of datasets of different face
images
7. Automatic Translation of documents
 Task – translating one type of language used in a document to other language
 Performance Measure – able to convert one language to other efficiently
 Experience – training machine with a large dataset of different types of languages
DESIGNING A LEARNING SYSTEM IN MACHINE LEARNING:

According to Tom Mitchell, “A computer program is said to be learning from


experience (E), with respect to some task (T). Thus, the performance measure (P) is the
performance at task T, which is measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
 Task, T: To classify mails into Spam or Not Spam.
 Performance measure, P: Total percent of mails being correctly classified as being
“Spam” or “Not Spam”.
 Experience, E: Set of Mails with label “Spam”

Step 1) Choosing the Training Experience:

The very important and first task is to choose the training data or training
experience which will be fed to the Machine Learning Algorithm. It is important to
note that the data or experience that we fed to the algorithm must have a significant impact
on the Success or Failure of the Model. So Training data or experience should be chosen
wisely.
Below are the attributes which will impact on Success and Failure of Data:
 The training experience will be able to provide direct or indirect feedback
regarding choices.
For example: While Playing chess the training data will provide feedback to itself
like instead of this move if this is chosen the chances of success increases.
 Second important attribute is the degree to which the learner will control the
sequences of training
For example: when training data is fed to the machine then at that time accuracy is
very less but when it gains experience while playing again and again with itself or
opponent the machine algorithm will get feedback and control the chess game
accordingly.

 Third important attribute is how it will represent the distribution of examples over
which performance will be measured.
For example: a Machine learning algorithm will get experience while going through
a number of different cases and different examples. Thus, Machine Learning
Algorithm will get more and more experience by passing through more and more
examples and hence its performance will increase.

Step 2- Choosing target function:

The next important step is choosing the target function. It means according to the
knowledge fed to the algorithm the machine learning will choose NextMove function
which will describe what type of legal moves should be taken.
For example : While playing chess with the opponent, when opponent will play then
the machine learning algorithm will decide what be the number of possible legal moves
taken in order to get success.

Step 3- Choosing Representation for Target function:

When the machine algorithm will know all the possible legal moves the next step
is to choose the optimized move using any representation i.e. using linear
Equations, Hierarchical Graph Representation, Tabular form etc. The NextMove
function will move the Target move like out of these move which will provide more
success rate.
For Example : while playing chess machine have 4 possible moves, so the machine
will choose that optimized move which will provide success to it.
Step 4- Choosing Function Approximation Algorithm:

An optimized move cannot be chosen just with the training data. The training data
had to go through with set of example and through these examples the training data
will approximates which steps are chosen and after that machine will provide
feedback on it.
For Example : When a training data of Playing chess is fed to algorithm so at that
time it is not machine algorithm will fail or get success and again from that failure or
success it will measure while next move what step should be chosen and what is its
success rate.

Step 5- Final Design:

The final design is created at last when system goes from number of examples ,
failures and success , correct and incorrect decision and what will be the next step etc.
Example: DeepBlue is an intelligent computer which is ML-based won chess game
against the chess expert Garry Kasparov, and it became the first computer which had
beaten a human chess expert.
PERSPECTIVES AND ISSUES IN MACHINE LEARNING:

1. Insufficient Training Data:

The major issue that comes while using machine learning algorithms is the lack
of quality as well as quantity of data. Although data plays a main role in the
processing of machine learning algorithms, many data scientists claim that
insufficient data, noisy data, and unclean data are extremely exhausting the machine
learning algorithms.

For example:

A simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data

Data quality can be affected by some factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the decision
as well as accuracy in classification tasks.

o Incorrect data- It is also responsible for faulty programming and results obtained
in machine learning models. Hence, incorrect data may affect the accuracy of the
results also.

o Generalizing of output data- Sometimes, it is also found that generalizing output


data becomes complex, which results in comparatively poor future actions.4

2. Poor quality of data:


As we have discussed above, data plays a significant role in machine learning,
and it must be of good quality as well. Noisy data, incomplete data, inaccurate
data, and unclean data lead to less accuracy in classification and low-quality
results. Hence, data quality can also be considered as a major common problem
while processing machine learning algorithms.

3. Overfitting and Underfitting:

Over fitting:

Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It negatively
affects the performance of the model. Let's understand with a simple example where we
have a few training data sets such as 1000 mangoes, 1000 apples, 1000 bananas, and 5000
papayas.

Then there is a considerable probability of identification of an apple as papaya


because we have a massive amount of biased data in the training data set. hence
prediction got negatively affected. The main reason behind over fitting is using non-linear
methods used in machine learning algorithms as they build non-realistic data models.

We can overcome over fitting by using linear and parametric algorithms in the
machine learning models.

Methods to reduce overfitting:

o Increase training data in a dataset.


o Reduce model complexity by simplifying the model by selecting one with fewer
parameters
o Ridge Regularization and Lasso Regularization(Ridge adds L2 regularization, and
Lasso adds L1 to linear regression models, preventing over fitting)
o Early stopping during the training phase
o Reduce the noise
o Reduce the number of attributes in training data.
Under fitting:

Under fitting is just the opposite of over fitting. Whenever a machine learning model
is trained with fewer amounts of data, and as a result, it provides incomplete and inaccurate
data and destroys the accuracy of the machine learning model.

Under fitting occurs when our model is too simple to understand the base structure
of the data.This generally happens when we have limited data into the data set, and we try
to build a linear model with non-linear data.

Methods to reduce Under fitting:

o Increase model complexity


o Remove noise from the data
o Trained on increased and better features
o Reduce the constraints
o Increase the number of period to get better results.

4. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine


learning model; hence, regular monitoring and maintenance become compulsory for
the same. Different results for different actions require data change. hence editing of
codes as well as resources for monitoring them also become necessary.

5. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously


growing in the market, still these industries are fresher in comparison to others. The
absence of skilled resources in the form of manpower is also an issue. Hence, we
need manpower having in-depth knowledge of mathematics, science, and
technologies for developing and managing scientific substances for machine
learning.

6. Customer Segmentation

Customer segmentation is also an important issue while developing a machine


learning algorithm. To identify the customers who paid for the recommendations
shown by the model and who don't even check them.

Hence, an algorithm is necessary to recognize the customer behavior and trigger


a relevant recommendation for the user based on past experience.

7. Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major
issue faced by machine learning engineers and data scientists. However, Machine
Learning and Artificial Intelligence are very new technologies but are still in an
experimental phase and continuously being changing over time. There is the majority
of hits and trial experiments; hence the probability of error is higher than expected.
Further, it also includes analyzing the data, removing data bias, training data,
applying complex mathematical calculations, etc…

8. Data Bias

Data Biasing is also found a big challenge in Machine Learning. These errors
exist when certain elements of the dataset are heavily weighted or need more
importance than others. Biased data leads to inaccurate results, uneven outcomes,
and other analytical errors. However, we can resolve this error by determining where
data is actually biased in the dataset. Further, take necessary steps to reduce it.

Methods to remove Data Bias:

o Research more for customer segmentation.


o Combine inputs from multiple sources to ensure data diversity.
o Include bias testing in the development process.
o Analyze data regularly and keep tracking errors to resolve them easily.
Algorithms in Machine Learning

o Linear Regression
o Logistic Regression
o Decision Tree
o Bayes Theorem and Naïve Bayes Classification
o Support Vector Machine (SVM) Algorithm
o K-Nearest Neighbor (KNN) Algorithm
o K-Means
o Gradient Boosting algorithms
o Dimensionality Reduction Algorithms
o Random Forest

CONCEPT LEARNING AND THE GENERAL TO SPECIFIAC ORDERING:

A concept means the spirit of something, the category of something.


For example - human beings we have learned the concept of Ship. We know what it is, and
no matter how many new, high-tech and fancy ships we are shown, we can still classify
them as ships, even though we might not have seen those specific models before! We can
do this because we have learned the general concept of ship that encompasses infinite
number of different kinds of ships that were, are, or will ever be built. What other concepts
can you think of?

1. Bird
2. Stressful situation
3. Cold, hot

Now please note that, on a more technical level we can think of a concept as a Boolean
function! As we know, every function has a domain (i.e., the inputs) and a range (i.e., the
outputs)! So can you think of the domain and range of a Boolean function that represents
a concept such as the concept of bird?
 this function is true for birds and false for anything else (tigers, ants, …)

“A machine learning algorithm tries to infer the general definition of some concept,
through some training examples that are labeled as members or non-members of that
particular concept”
The whole idea is to estimate the true underlying Boolean function (i.e., concept),
which can successfully fit the training examples perfectly and spit out the right output.
This means that if the label for a training example is positive (i.e., a member of the
concept) or negative, we would like for our learned function to correctly determine all
these cases.
Let’s say the task is to learn the following target concept:

Days on which James will play golf

 What is a hypothesis?
 How do we represent a hypothesis to a machine learning algorithm?
We can simply define a hypothesis as a vector of some constraints on the attributes. In
our example below, a hypothesis consists of 3 constraints. Now, a constraint for a given
attribute could have different shapes and forms:

 Totally free of constraint (denoted with a question mark): This means that we
really don’t put any constraint on a particular attribute as that attribute doesn’t play
an important role in learning the concept of play in this hypothesis.

 Strong constraint on the value of

 the attribute: specified by the exact required value for that particular attribute

 Strong constraint regardless of the value of the attribute: no value is acceptable


for the attribute. Meaning that no day is a positive example (i.e., play=yes).

It is important to remind ourselves that if a given training example x (or a test


example for that matter), could satisfy all the constraints in
a given hypothesis h, (i.e., h(x) = 1), then we can say that hypothesis h classifies x
as a positive example (i.e., play = yes). Otherwise, if x fails to satisfy even one
constraint in h, the h(x) = 0, and this means that hypothesis h classifies input x as a
negative example (i.e., play = No).
So, for a positive training example x, C(x)=1 and for a negative training example C(x)
= 0!

h(x) = C(x)!
CONCEPT LEARNING AS SEARCH FIND S ALGORITHM:

The S algorithm, also known as the Find-S algorithm, is a machine learning


algorithm that seeks to find a maximally specific hypothesis based on labeled training data.
It starts with the most specific hypothesis and generalizes it by incorporating positive
examples. It ignores negative examples during the learning process.

The algorithm's objective is to discover a hypothesis that accurately represents the


target concept by progressively expanding the hypothesis space until it covers all positive
instances

Symbols used in Find-S algorithm

.
Inner working of Find-S algorithm
 Initialization − The algorithm starts with the most specific hypothesis, denoted as h.
This initial hypothesis is the most restrictive concept and typically assumes no
positive examples. It may be represented as h = <∅, ∅, ..., ∅>, where ∅ denotes "don't
care" or "unknown" values for each attribute.

 Iterative Process − The algorithm iterates through each training example and refines
the hypothesis based on whether the example is positive or negative.
o For each positive training example (an example labeled as the target class), the
algorithm updates the hypothesis by generalizing it to include the attributes of
the example. The hypothesis becomes more general as it covers more positive
examples.
o For each negative training example (an example labeled as a non-target class),
the algorithm ignores it as the hypothesis should not cover negative examples.
The hypothesis remains unchanged for negative examples.

 Generalization − After processing all the training examples, the algorithm produces
a final hypothesis that covers all positive examples while excluding negative
examples. This final hypothesis represents the generalized concept that the algorithm
has learned from the training data.

Let's explore the steps of the algorithm using a practical example −

Suppose, we have a dataset of animals with two attributes: "has fur" and "makes sound."
Each animal is labeled as either a dog or a cat. Here is a sample training dataset .
To apply the Find-S algorithm, we start with the most specific hypothesis, denoted as h,
which initially represents the most restrictive concept. In our example, the initial
hypothesis would be h = <∅, ∅>, indicating that no specific animal matches the concept.
 For each positive training example (an example labeled as the target class), we update
the hypothesis h to include the attributes of that example. In our case, the positive
training examples are dogs. Therefore, h would be updated to h = <Yes, Yes>.
 For each negative training example (an example labeled as a non-target class), we
ignore it as the hypothesis h should not cover those examples. In our case, the
negative training examples are cats, and since h already covers dogs, we don't need
to update the hypothesis.
 After processing all the training examples, we obtain a generalized hypothesis that
covers all positive training examples and excludes negative examples. In our
example, the final hypothesis h = <Yes, Yes> accurately represents the concept of a
dog.

Python program illustrating the Find-S algorithm:


Important Representation :

1. ? indicates that any value is acceptable for the attribute.


2. specify a single required value ( e.g., Cold ) for the attribute.
3. ϕindicates that no value is acceptable.
4. The most general hypothesis is represented by: {?,?, ?, ?, ?, ?}
5. The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Steps Involved In Find-S :

1. Start with the most specific hypothesis.


h = {null, ϕ, ϕ, ϕ, ϕ, ϕ}
2. Take the next example and if it is negative, then no changes occur to the hypothesis.
3. If the example is positive and we find that our initial hypothesis is too specific then
we update our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final hypothesis
when can use to classify the new examples.
Example:
Algorithm :

1. Initialize h to the most specific hypothesis in H


2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Version Space Learning for ML

CANDIDATE ELIMINATION ALGORITHM:


The candidate elimination algorithm incrementally builds the version space given
a hypothesis space H and a set E of examples. The examples are added one by one; each
example possibly shrinks the version space by removing the hypotheses that are
inconsistent with the example. The candidate elimination algorithm does this by
updating the general and specific boundary for each new example.
 You can consider this as an extended form of the Find-S algorithm.
 Consider both positive and negative examples.
 Actually, positive examples are used here as the Find-S algorithm (Basically they are
generalizing from the specification).
 While the negative example is specified in the generalizing form.

Terms Used:

 Concept learning: Concept learning is basically the learning task of the machine
(Learn by Train data)
 General Hypothesis: Not Specifying features to learn the machine.
 G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes
 Specific Hypothesis: Specifying features to learn machine (Specific feature)
 S= {‘pi’,’pi’,’pi’…}: The number of pi depends on a number of attributes.
 Version Space: It is an intermediate of general hypothesis and Specific hypothesis.
It not only just writes one hypothesis but a set of all possible hypotheses based on
training data-set.
Initially : G = [[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?]]
S = [Null, Null, Null, Null, Null, Null]

For instance 1 : <'sunny','warm','normal','strong','warm ','same'> and positive output.


G1 = G
S1 = ['sunny','warm','normal','strong','warm ','same']

For instance 2 : <'sunny','warm','high','strong','warm ','same'> and positive output.


G2 = G
S2 = ['sunny','warm',?,'strong','warm ','same']

For instance 3 : <'rainy','cold','high','strong','warm ','change'> and negative output.


G3 = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?], [?, ?, ?, ?, ?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, 'same']]
S3 = S2

For instance 4 : <'sunny','warm','high','strong','cool','change'> and positive output.


G4 = G3
S4 = ['sunny','warm',?,'strong', ?, ?]
Output:
G = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?]]
S = ['sunny','warm',?,'strong', ?, ?]

The Candidate Elimination Algorithm (CEA) is an improvement over the Find-S


algorithm for classification tasks. While CEA shares some similarities with Find-S, it
also has some essential differences that offer advantages and disadvantages. Here are
some advantages and disadvantages of CEA in comparison with Find-S:
Advantages of CEA over Find-S:

1. Improved accuracy: CEA considers both positive and negative examples to generate
the hypothesis, which can result in higher accuracy when dealing with noisy or
incomplete data.

2. Flexibility: CEA can handle more complex classification tasks, such as those with
multiple classes or non-linear decision boundaries.

3. More efficient: CEA reduces the number of hypotheses by generating a set of general
hypotheses and then eliminating them one by one. This can result in faster processing
and improved efficiency.

4. Better handling of continuous attributes: CEA can handle continuous attributes by


creating boundaries for each attribute, which makes it more suitable for a wider range
of datasets.

Disadvantages of CEA in comparison with Find-S:

1. More complex: CEA is a more complex algorithm than Find-S, which may make it
more difficult for beginners or those without a strong background in machine
learning to use and understand.

2. Higher memory requirements: CEA requires more memory to store the set of
hypotheses and boundaries, which may make it less suitable for memory-constrained
environments.
3. Slower processing for large datasets: CEA may become slower for larger datasets
due to the increased number of hypotheses generated.

4. Higher potential for overfitting: The increased complexity of CEA may make it more
prone to overfitting on the training data, especially if the dataset is small or has a
high degree of noise.
1.1 Version Spaces
Definition (Version space). A concept is complete if it covers all positive
examples.

A concept is consistent if it covers none of the negative examples. The


version space is the set of all complete and consistent concepts. This set
is convex and is fully defined by its least and most general elements.

The key idea in the CANDIDATE-ELIMINATION algorithm is to output a


description of the set of all
hypotheses consistent with the training examples

1.7.1 Representation
The Candidate – Elimination algorithm finds all describable hypotheses that are
consistent with the
observed training examples. In order to define this algorithm precisely,
we begin with a few basic definitions. First, let us say that a hypothesis is
consistent with the training examples if it correctly classifies these
examples.

Definition: A hypothesis h is consistent with a set of training examples


D if and only if h(x) = c(x) for each example (x, c(x)) in D.

Note difference between definitions of consistent and satisfies


An example x is said to satisfy hypothesis h when h(x) = 1,
regardless of whether x is a positive or negative example of the
target concept.
An example x is said to consistent with hypothesis h iff h(x) = c(x)

Definition: version space- The version space, denoted V SH, D with


respect to hypothesis space H and training examples D, is the subset of
hypotheses from H consistent with the training examples in D

1.7.2 The LIST-THEN-ELIMINATION algorithm


The LIST-THEN-ELIMINATE algorithm first initializes the version
space to contain all hypotheses in H and then eliminates any hypothesis
found inconsistent with any training example.

1. VersionSpace c a list containing every hypothesis in H


2. For each training example, (x, c(x)) remove from VersionSpace any
hypothesis h for which h(x) ≠ c(x)
3. Output the list of hypotheses in VersionSpace
List-Then-Eliminate works in principle, so long as version space is finite.
However, since it requires exhaustive enumeration of all hypotheses in
practice it is not feasible.

A More Compact Representation for Version Spaces


The version space is represented by its most general and least general
members. These members form general and specific boundary sets that
delimit the version space within the partially ordered hypothesis space.
Definition: The general boundary G, with respect to hypothesis space
H and training data D, is the set of maximally general members of H
consistent with D

G {g H | Consistent (g, D) ( g' H)[(g' g) Consistent(g',


D)]}
g

Definition: The specific boundary S, with respect to hypothesis space


H and training data D, is the set of minimally general (i.e., maximally
specific) members of H consistent with D.

S {s H | Consistent (s, D) ( s' H)[(s s') Consistent(s', D)]}


g

Theorem: Version Space representation theorem


Theorem: Let X be an arbitrary set of instances and Let H be a set of
Boolean-valued hypotheses defined over X. Let c: X →{O, 1} be an
arbitrary target concept defined over X, and let D be an arbitrary set of
training examples
{(x, c(x))). For all X, H, c, and D such that S and G are well defined,

VS ={ h H|( s S)( g G)(g h s )}


H,D g g

To Prove:
1. Every h satisfying the right hand side of the above expression is in VS
H, D
2. Every member of VS satisfies the right-hand side of the expression
H, D

Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g
g h gs
 By the definition of S, s must be satisfied by all positive examples in D.
Because h g s,
h must also be satisfied by all positive examples in D.
 By the definition of G, g cannot be satisfied by any negative
example in D, and because g g h h cannot be satisfied by any
negative example in D. Because h is satisfied by all positive
examples in D and by no negative examples in D, h is consistent
with D, and therefore h is a member of VSH,D.
2. It can be proven by assuming some h in VSH,D,that does not
satisfy the right-hand side of the expression, then showing that
this leads to an inconsistency
1.7.3 CANDIDATE-ELIMINATION Learning Algorithm

The CANDIDATE-ELIMINTION algorithm computes the


version space containing all hypotheses from H that are consistent
with an observed sequence of training examples.
Initialize G to the set of maximally general hypotheses in H Initialize S to
the set of maximally specific hypotheses in H For each training example
d, do
• If d is a positive example
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than
h
• Remove from S any hypothesis that is more general than another
hypothesis in S

• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than
h
• Remove from G any hypothesis that is less general than

another hypothesis in G CANDIDATE- ELIMINTION algorithm


using version spaces

Example :

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport


1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

CANDIDATE-ELIMINTION algorithm begins by initializing


the version space to the set of all hypotheses in H;

Initializing the G boundary set to contain the most general hypothesis in H


G0 ?, ?, ?, ?, ?, ?

Initializing the S boundary set to contain the most specific (least general)
hypothesis
S0 , , , , ,

When the first training example is presented, the CANDIDATE-


ELIMINTION algorithm checks the S boundary and finds that it is
overly specific and it fails to cover the positive example.
The boundary is therefore revised by moving it to the least more
general hypothesis that covers this new example
No update of the G boundary is needed in response to this training
example because Go correctly covers this example
 When the second training example is observed, it has a similar
effect of generalizing S further to S2, leaving G again
unchanged i.e., G2 = G1 =G0
 Consider the third training example. This negative
example reveals that the G boundary of the version
space is overly general, that is, the hypothesis in G
incorrectly predicts that this new example is a positive
example.
 The hypothesis in the G boundary must therefore be
specialized until it correctly classifies this new
negative example.

46
Given that there are six attributes that could be specified to
specialize G2, why are there only three new hypotheses in
G 3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?)
is a minimal specialization of G 2 that correctly labels
the new example as a negative example, but it is not
included in G 3. The reason this hypothesis is excluded
is that it is inconsistent with the previously
encountered positive examples
Consider the fourth training example.

 This positive example further generalizes the S boundary of


the version space. It also results in removing one member of
the G boundary, because this member fails to cover the new
positive example

After processing these four examples, the boundary sets S 4 and G 4


47
delimit the version space of all hypotheses consistent with the set
of incrementally observed training examples.

48
Example 2:
Time Weather Temp Company Humidity Wind Goes
Morning Sunny Warm Yes Mild Strong Yes
Evening Rainy Cold No Mild Normal No
Morning Sunny Moderate Yes Normal Normal Yes
Evening Sunny Cold Yes High Strong Yes

Attributes:
Time, weather, temp, company, humidity, wind
Target: goes for walk or not
+ve yes -ve no
Specific hypothesis(s)
General hypothesis (g)
S0 [null, null, null, null, null, null ]
G0 [?, ?, ?, ?, ?, ?]
Step 1:
Xo< morning ,sunny, warm , yes, mild , strong>
{+ve}
Update specific hypothesis values
S1{ morning ,sunny, warm , yes, mild , strong}
G1 [?, ?, ?, ?, ?, ?]

49
Step 2:
X1[evening,rainy,cold,no,mild,normal
-ve instance
Update general hypothesis
S2{ morning ,sunny, warm , yes, mild , strong} ------S1 AND S2 SAME
G2={morning,?,?,?,?,?} (?,suuny,?,?,?,?}{{?,?,warm,?,?,?}{?,?,?,yes,?,?}
{?,?,?,?,?,strong}

STEP 3
x2{morning,sunny,moderate,yes,normal,normal}
+ve instance --so update specific hypothesis
(x2,s2) s3={morning,sunny,?,yes,?,?}
G3=={morning,?,?,?,?,?} (?,suuny,?,?,?,?} { ?,?,?,yes,?,?}
STEP 4:
X3={evening,sunny,cold,yes,high,strong}
+ve---update specific hypothesis
(x3,s3) S4 ={?,sunny,?, yes,?,?}
G4=(?,sunny,?,?,?,?} { ?,?,?,yes,?,?}
Let;
Specific hypothesis values ={?,sunny,?,yes,?,?}
general hypothesis value = {?, sunny,?,?,?,?} { ?,?,?,yes,?,?}

50
DECISION TREE:
A decision tree is a type of supervised learning algorithm that is
commonly used in machine learning to model and predict outcomes based on
input data.
It is a tree-like structure where each internal node tests on attribute, each
branch corresponds to attribute value and each leaf node represents the
final decision or prediction. The decision tree algorithm falls under the
category of supervised learning. They can be used to solve
both regression and classification problems.
Decision Tree Terminologies:

 Root Node: A decision tree’s root node, which represents the original
choice or feature from which the tree branches, is the highest node.

 Internal Nodes (Decision Nodes): Nodes in the tree whose choices are
determined by the values of particular attributes. There are branches on
these nodes that go to other nodes.

 Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or


forecasts are decided upon. There are no more branches on leaf nodes.

 Branches (Edges): Links between nodes that show how decisions are
made in response to particular circumstances.

 Splitting: The process of dividing a node into two or more sub-nodes


based on a decision criterion. It involves selecting a feature and a
threshold to create subsets of data.

 Parent Node: A node that is split into child nodes. The original node from
which a split originates.

 Child Node: Nodes created as a result of a split from a parent node.

51
 Decision Criterion: The rule or condition used to determine how the data
should be split at a decision node. It involves comparing feature values
against a threshold.

 Pruning: The process of removing branches or nodes from a decision tree


to improve its generalization and prevent overfitting.

How Decision Tree is formed?


The process of forming a decision tree involves recursively partitioning
the data based on the values of different attributes. The algorithm selects
the best attribute to split the data at each internal node, based on certain
criteria such as information gain or Gini impurity. This splitting process
continues until a stopping leaf node.

Decision Tree Approach

Decision tree uses the tree representation to solve the problem in which
each leaf node corresponds to a class label and attributes are represented
on the internal node of the tree. We can represent any boolean function on
discrete attributes using the decision tree.

52
Below are some assumptions that we made while using the decision tree:
At the beginning, we consider the whole training set as the root.
 Feature values are preferred to be categorical. If the values are continuous
then they are discretized prior to building the model.
 On the basis of attribute values, records are distributed recursively.
 We use statistical methods for ordering attributes as root or the internal
node

As you can see from the above image the Decision Tree works on the Sum of
Product form which is also known as Disjunctive Normal Form. In the above
image, we are predicting the use of computer in the daily life of people. In the
Decision Tree, the major challenge is the identification of the attribute for the
root node at each level. This process is known as attribute selection. We have
two popular attribute selection measures:
1. Information Gain
2. Gini Index
Information Gain:
When we use a node in a decision tree to partition the training instances into
smaller subsets the entropy changes. Information gain is a measure of this
change in entropy.
 Suppose S is a set of instances,
 A is an attribute
 Sv is the subset of S
 v represents an individual value that the attribute A can take and Values (A)
is the set of all possible values of A, then

53
Entropy: is the measure of uncertainty of a random variable, it characterizes
the impurity of an arbitrary collection of examples. The higher the entropy
more the information content.
Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A
= v, and Values (A) is the set of all possible values of A, then

Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3

Building Decision Tree using Information Gain The essentials:


 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Note: No root-to-leaf path should contain the same discrete attribute twice
 Recursively construct each subtree on the subset of training instances that
would be classified down that path in the tree.
 If all positive or all negative training instances remain, the label that node
“yes” or “no” accordingly
 If no attributes remain, label with a majority vote of training instances left
at that node
 If no instances remain, label with a majority vote of the parent’s training
instances.

54
ID 3 ALGORITHM:

ID3 or Iterative Dichotomiser3 Algorithm is used in machine


learning for building decision trees from a given dataset. It was developed
in 1986 by Ross Quinlan.

It is a greedy algorithm that builds a decision tree by recursively


partitioning the data set into smaller and smaller subsets until all data points
in each subset belong to the same class.

It employs a top-down approach, recursively selecting features to split


the dataset based on information gain.

How ID3 Algorithms work:

The ID3 algorithm works by building a decision tree, which is a


hierarchical structure that classifies data points into different categories and
splits the dataset into smaller subsets based on the values of the features in the
dataset.

The ID3 algorithm then selects the feature that provides the most
information about the target variable. The decision tree is built top-down,
starting with the root node, which represents the entire dataset. At each node,
the ID3 algorithm selects the attribute that provides the most information gain
about the target variable.

The attribute with the highest information gain is the one that best
separates the data points into different categories.

The ID3 algorithm uses a measure of impurity, such as entropy or Gini


impurity, to calculate the information gain of each attribute. Entropy is a
measure of disorder in a dataset. A dataset with high entropy is a dataset
where the data points are evenly distributed across the different categories. A
dataset with low entropy is a dataset where the data points are concentrated
in one or a few categories.
55
➔ p+ is the probability of positive class
➔ p– is the probability of negative class
➔ S is the subset of the training example

If entropy is low, data is well understood; if high, more information is


needed. Preprocessing data before using ID3 can enhance accuracy.

In sum, ID3 seeks to reduce uncertainty and make informed decisions


by picking attributes that offer the most insight in a dataset.

Information gain:

Information gain assesses how much valuable information an attribute


can provide. We select the attribute with the highest information gain, which
signifies its potential to contribute the most to understanding the data.

If information gain is high, it implies that the attribute offers a


significant insight. ID3 acts like an investigator, making choices that
maximize the information gain in each step.

This approach aims to minimize uncertainty and make well-informed


decisions, which can be further enhanced by preprocessing the data.

56
What are the steps in ID3 algorithm:

1. Determine entropy for the overall the dataset using class distribution.

2. For each feature.


 Calculate Entropy for Categorical Values.
 Assess information gain for each unique categorical value of the
feature.

3. Choose the feature that generates highest information gain.

4. Iteratively apply all above steps to build the decision tree structure.

57
EXAMPLE:

Calculate Entropy(S)

where
p+, is the proportion of positive examples in S and
p-, is the proportion of negative examples in S.

58
STEP 1:
Attribute: outlook
Values (outlook)= Sunny, Overcast, Rain
1. S sunny [2+,-3]
2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑠𝑢𝑛𝑛𝑦 ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0.971
5 5 5 5

2. s overcast [4+,-0]
4 4 0 0
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0
4 4 4 4

3. S rain [3+,-2]
3 3 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑟𝑎𝑖𝑛) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0.971
5 5 5 5

Gain(s,outlook)≡ ∑𝑣∈𝑠𝑢𝑛𝑛𝑦,𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡,𝑟𝑎𝑖𝑛
Entropy(s)- 5/14 entropy(ssunny)- 4/14 entropy(sovercast) - 5/14 entropy(srain)
0.940 - 5/14 * 0.971 -4/14 * 0 -5/14 * 0.971
=0.2464

59
STEP 2
Attribute = temperature
Values (temp)= hot, mild, cool
1. S hot [2+,-2]
2 2 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 ℎ𝑜𝑡) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =1
4 4 4 4

2. s mild [4+,-2]
4 4 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑚𝑖𝑙𝑑 ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.9183
6 6 6 6

3.S cool [3+,-1]


3 3 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑐𝑜𝑜𝑙) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.8113
4 4 4 4

Gain(s,temp)≡ ∑𝑣∈ hot,mild,cool


Entropy(s)- 4/14 entropy(shot)- 6/14 entropy(smild) - 4/14 entropy(scool)
0.940 - 5/14 * 1 -4/14 * 0.9183 -5/14 * 0.8113
=0.0289

STEP 3:
Attribute = humidity
Values (humidity)= high, normal
1. S high [3+,-4]
3 3 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 ℎ𝑖𝑔ℎ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.9852
7 7 7 7

60
2. s normal[6+,-1]
6 6 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑛𝑜𝑟𝑚𝑎𝑙) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.5916
7 7 7 7

Gain(s,humidity)≡ ∑𝑣∈ high,normal


Entropy(s)- 7/14 entropy(shigh)- 7/14 entropy(snormal)
0.940 - 7/14 * 0.9852 - 7/14 * 0.5916
=0.1516
STEP 4:
Attribute = wind
Values (humidity)= strong, weak
1. S strong [3+,-3]
3 3 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑠𝑡𝑟𝑜𝑛𝑔) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =1
6 6 6 6

2. s week [6+,-2]
6 6 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑤𝑒𝑎𝑘) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.8113
8 8 8 8

Gain(s,wind)≡ ∑𝑣∈ strong,weak


Entropy(s)- 6/14 entropy(sstrong)- 8/14 entropy(sweak)
0.940 - 6/14 * 1.0 - 8/14 * 0.8113
=0.0478
61
Gain (s,outlook) =0.2464
Gain (s,temp)=0.0289
Gain (s,humidity)=0.1516
Gain (s,wind)=0.0478

62
STEP:1
Attribute = temperature
Values (temp)= hot, mild, cool
Ssunny[+2,-3]
2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑠𝑢𝑛𝑛𝑦 ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0.97
5 5 5 5

1. S hot [0+,-2]
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 ℎ𝑜𝑡) =0
2. s mild [1+,-1]
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑚𝑖𝑙𝑑 ) =1
3.S cool [1+,-0]
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑐𝑜𝑜𝑙) =0

63
Gain(s,temp)≡ ∑𝑣∈ hot,mild,cool
Entropy(s)- 2/5 entropy(shot)- 2/5 entropy(smild) - 1/5 entropy(scool)
0.97 - 2/5 * 0 -2/5 * 1 -1/5 * 0
=0.570

STEP 2:
Attribute = humidity
Values (humidity)= high, normal
Ssunny[+2,-3]
2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑠𝑢𝑛𝑛𝑦 ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0.97
5 5 5 5

1. S high [0+,-3]
0 0 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 ℎ𝑖𝑔ℎ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0
5 5 5 5

2. s normal [2+,-0]
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑛𝑜𝑟𝑚𝑎𝑙) = 0

Gain(s,humidity)≡ ∑𝑣∈ high,normal


Entropy(s)- 3/5 entropy(shigh)- 2/5 entropy(snormal)
64
0.97 - 3/5 * 0 - 2/5 * 0
=0.97

STEP 3:
Attribute = wind
Values (humidity)= strong, weak
Ssunny[+2,-3]
2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑠𝑢𝑛𝑛𝑦 ) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0.97
5 5 5 5

1. S strong [1+,-1]
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑠𝑡𝑟𝑜𝑛𝑔) = 1
2. s week [1+,-2]
1 1 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 𝑤𝑒𝑎𝑘) = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.9183
3 3 3 3

Gain(s,wind)≡ ∑𝑣∈ strong,weak


Entropy(s)- 2/5 entropy(sstrong)- 3/5 entropy(sweak)
0.940 - 2/5 * 1.0 - 3/5 * 0.9183
=0.0192

65
Gain (ssunny,temp)=0.570
Gain (ssunny,humidity)=0.97
Gain (ssunny,wind)=0.0192

66
Entropy(s)- 0/5 entropy(shot)- 3/5 entropy(smild) - 2/5 entropy(scool)
0.97 - 0/5 * 0 - 3/5 * 0.9183- 2/5 * 1
=0.0192

67
68
69
Advantages of the Decision Tree:
1. It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
4. There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
1. The decision tree contains lots of layers, which makes it complex.
2. It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.q
3. For more class labels, the computational complexity of the decision tree may
increase.

70

You might also like