Unit - 1, Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

SCHOOL OF COMPUTING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Machine Learning Techniques – SEC1630

Machine Learning Techniques – SEC1630

Machine Learning Techniques – SEC1630

Machine Learning Essentials – SCSA 1415


SCHOOL OF COMPUTING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Machine Learning Techniques – SEC1630

Machine Learning Techniques – SEC1630

Machine Learning Techniques – SEC1630

UNIT I – INTRODUCTION TO MACHINE LEARNING


UNIT 1 INTRODUCTION TO MACHINE LEARNING
What is machine learning – Examples of Machine Learning Applications -Types of
Machine learning Techniques - Learning a class from Examples - Vapnik-
Chervonenkis Dimension - Probably Approximately Correct Learning – Gradient
Descent – Bias and Variance- Overfitting- Underfitting-Confusion Matrix.

An introduction to Machine Learning


Definition of Machine Learning: Arthur Samuel, an early American leader in the field of
computer gaming and artificial intelligence, coined the term “Machine Learning ” in 1959
while at IBM. He defined machine learning as “the field of study that gives computers the
ability to learn without being explicitly programmed “. However, there is no universally
accepted definition for machine learning. Different authors define the term differently. We
give below two more definitions.
 Machine learning is programming computers to optimize a performance criterion using
example data or past experience . We have a model defined up to some parameters,
and learning is the execution of a computer program to optimize the parameters of the
model using the training data or past experience. The model may be predictive to make
predictions in the future, or descriptive to gain knowledge from data.
 The field of study known as machine learning is concerned with the question of how to
construct computer programs that automatically improve with experience.

Machine learning is a subfield of artificial intelligence that involves the development of


algorithms and statistical models that enable computers to improve their performance in
tasks through experience. These algorithms and models are designed to learn from data
and make predictions or decisions without explicit instructions. There are several types of
machine learning, including supervised learning, unsupervised learning, and
reinforcement learning. Supervised learning involves training a model on labeled data,
while unsupervised learning involves training a model on unlabeled data. Reinforcement
learning involves training a model through trial and error. Machine learning is used in a
wide variety of applications, including image and speech recognition, natural language
processing, and recommender systems.
Definition of learning: A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P , if its performance at tasks T, as
measured by P , improves with experience E.
Examples
 Handwriting recognition learning problem
 Task T : Recognizing and classifying handwritten words within images
 Performance P : Percent of words correctly classified
 Training experience E : A dataset of handwritten words with given
classifications
 A robot driving learning problem
 Task T : Driving on highways using vision sensors
 Performance P : Average distance traveled before an error
 Training experience E : A sequence of images and steering commands
recorded while observing a human driver
Definition: A computer program which learns from experience is called a machine learning
program or simply a learning program .

Applications of Machine learning


Machine learning is a buzzword for today's technology, and it is growing very rapidly day
by day. We are using machine learning in our daily life even without knowing it such as
Google Maps, Google assistant, Alexa, etc. Below are some most trending real-world
applications of Machine Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used
to identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a
photo with our Facebook friends, then we automatically get a tagging suggestion with
name, and the technology behind this is machine learning's face
detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies
such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we
search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine
learning.
Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and
spam emails in our spam box, and the technology behind this is Machine learning. Below
are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,
and Naïve Bayes classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant:
We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the
information using our voice instruction. These assistants can help us in various ways just
by our voice instructions such as Play music, call someone, Open an email, Scheduling
an appointment, etc.
These virtual assistants use machine learning algorithms as an important part.
These assistant record our voice instructions, send it over the server on a cloud, and
decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways
that a fraudulent transaction can take place such as fake accounts, fake ids, and steal
money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a
specific pattern which gets change for the fraud transaction hence, it detects it and makes
our online transactions more secure.
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market, there is
always a risk of up and downs in shares, so for this machine learning's long short term
memory neural network is used for the prediction of stock market trends.
10. Medical Diagnosis:
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning
algorithm, which is used with image recognition and translates the text from one language
to another language.

Machine Learning Applications and Examples


1. Social Media Features
Social media platforms use machine learning algorithms and approaches to create some
attractive and excellent features. For instance, Facebook notices and records your
activities, chats, likes, and comments, and the time you spend on specific kinds of
posts. Machine learning learns from your own experience and makes friends and page
suggestions for your profile.

2. Product Recommendations
Product recommendation is one of the most popular and known applications of machine
learning. Product recommendation is one of the stark features of almost every e-commerce
website today, which is an advanced application of machine learning techniques.
Using machine learning and AI, websites track your behavior based on your previous
purchases, searching patterns, and cart history, and then make product
recommendations.

3. Image Recognition
Image recognition, which is an approach for cataloging and detecting a feature or an object
in the digital image, is one of the most significant and notable machine learning and AI
techniques. This technique is being adopted for further analysis, such as pattern
recognition, face detection, and face recognition.

4. Sentiment Analysis
Sentiment analysis is one of the most necessary applications of machine learning.
Sentiment analysis is a real-time machine learning application that determines the
emotion or opinion of the speaker or the writer. For instance, if someone has written a
review or email (or any form of a document), a sentiment analyzer will instantly find out
the actual thought and tone of the text. This sentiment analysis application can be used
to analyze a review based website, decision-making applications, etc.

5. Automating Employee Access Control


Organizations are actively implementing machine learning algorithms to determine the
level of access employees would need in various areas, depending on their job profiles.
This is one of the coolest applications of machine learning.
6. Marine Wildlife Preservation
Machine learning algorithms are used to develop behavior models for endangered
cetaceans and other marine species, helping scientists regulate and monitor their
populations.
7. Regulating Healthcare Efficiency and Medical Services
Significant healthcare sectors are actively looking at using machine learning algorithms
to manage better. They predict the waiting times of patients in the emergency waiting
rooms across various departments of hospitals. The models use vital factors that help
define the algorithm, details of staff at various times of day, records of patients, and
complete logs of department chats and the layout of emergency rooms. Machine learning
algorithms also come to play when detecting a disease, therapy planning, and prediction
of the disease situation. This is one of the most necessary machine learning applications.
8. Predict Potential Heart Failure
An algorithm designed to scan a doctor’s free-form e-notes and identify patterns in a
patient’s cardiovascular history is making waves in medicine. Instead of a physician
digging through multiple health records to arrive at a sound diagnosis, redundancy is now
reduced with computers making an analysis based on available information.
9. Banking Domain
Banks are now using the latest advanced technology machine learning has to offer to help
prevent fraud and protect accounts from hackers. The algorithms determine what factors
to consider to create a filter to keep harm at bay. Various sites that are unauthentic will
be automatically filtered out and restricted from initiating transactions.
10. Language Translation
One of the most common machine learning applications is language translation. Machine
learning plays a significant role in the translation of one language to another. We are
amazed at how websites can translate from one language to another effortlessly and give
contextual meaning as well. The technology behind the translation tool is called ‘machine
translation.’ It has enabled people to interact with others from all around the world;
without it, life would not be as easy as it is now. It has provided confidence to travelers
and business associates to safely venture into foreign lands with the conviction that
language will no longer be a barrier.

Types of Machine Learning


Machine learning implementations are classified into four major categories, depending on
the nature of the learning “signal” or “response” available to a learning system which are
as follows:
A. Supervised learning:
Supervised learning is the machine learning task of learning a function that maps an input
to an output based on example input-output pairs. The given data is labeled .
Both classification and regression problems are supervised learning problems .
 Example — Consider the following data regarding patients entering a clinic . The data
consists of the gender and age of the patients and each patient is labeled as “healthy”
or “sick”.

gender age label

M 48 sick

M 67 sick

F 53 healthy

M 49 sick

F 32 healthy

M 34 healthy

M 21 healthy

B. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw inferences
from datasets consisting of input data without labeled responses. In unsupervised
learning algorithms, classification or categorization is not included in the observations.
Example: Consider the following data regarding patients entering a clinic. The data
consists of the gender and age of the patients.

gender age

M 48

M 67

F 53

M 49

F 34
M 21

C. Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as to
maximize its rewards.
A learner is not told what actions to take as in most forms of machine learning but instead
must discover which actions yield the most reward by trying them. For example —
Consider teaching a dog a new trick: we cannot tell him what to do, what not to do, but
we can reward/punish it if it does the right/wrong thing.
D. Semi-supervised learning:
Where an incomplete training signal is given: a training set with some (often many) of the
target outputs missing. There is a special case of this principle known as Transduction
where the entire set of problem instances is known at learning time, except that part of
the targets are missing. Semi-supervised learning is an approach to machine learning that
combines small labeled data with a large amount of unlabeled data during training. Semi-
supervised learning falls between unsupervised learning and supervised learning.
These ML algorithms help to solve different business problems like Regression,
Classification, Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

`
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It means in
the supervised learning technique, we train the machines using the "labelled" dataset, and
based on the training, the machine predicts the output. Here, the labelled data specifies
that some of the inputs are already mapped to the output. More preciously, we can say;
first, we train the machine with the input and corresponding output, and then we ask the
machine to predict the output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset
of cats and dog images. So, first, we will provide the training to the machine to understand
the images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour,
height (dogs are taller, cats are smaller), etc. After completion of training, we input the
picture of a cat and ask the machine to identify the object and predict the output. Now,
the machine is well trained, so it will check all the features of the object, such as height,
shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat
category. This is the process of how the machine identifies the objects in Supervised
Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning
are Risk Assessment, Fraud Detection, Spam filtering, etc.
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
o Random Forest Algorithm
o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
Some popular Regression algorithms are given below:
o Simple Linear Regression Algorithm
o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression
Advantages and Disadvantages of Supervised Learning
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
o These algorithms are not able to solve complex tasks.
o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.
Applications of Supervised Learning
Some common applications of Supervised Learning are given below:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It
is done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data
to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent
to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.
2. Unsupervised Machine Learning
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning,
the machine is trained using the unlabeled dataset, and the machine predicts the output
without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified
nor labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines
are instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown
to the model, and the task of the machine is to find the patterns and categories of the
objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data.
It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of other
groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.
Some of the popular clustering algorithms are given below:
o K-Means Clustering algorithm
o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm
is to find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm


Advantages:
o These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
Applications of Unsupervised Learning
o Network Analysis: Unsupervised learning is used for identifying plagiarism and
copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised
learning techniques for building recommendation applications for different web
applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised
learning, which can identify unusual data points within the dataset. It is used to
discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to
extract particular information from the database. For example, extracting
information of each user located at a particular location.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled
datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and
unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data. As labels are costly, but for corporate purposes, they may have
few labels. It is completely different from supervised and unsupervised learning as they
are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning
algorithms, the concept of Semi-supervised learning is introduced. The main aim
of semi-supervised learning is to effectively use all the available data, rather than only
labelled data like in supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the unlabeled data into
labelled data. It is because labelled data is a comparatively more expensive acquisition
than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student
is under the supervision of an instructor at home and college. Further, if that student is
self-analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.
Advantages and disadvantages of Semi-supervised Learning
Advantages:
o It is simple and easy to understand the algorithm.
o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages:
o Iterations results may not be stable.
o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded for
each good action and get punished for each bad action; hence the goal of reinforcement
learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of reinforcement
learning is to play a game, where the Game is the environment, moves of an agent at each
step define states, and the goal of the agent is to get a high score. Agent receives feedback
in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision
Process(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.
Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies
increasing the tendency that the required behaviour would occur again by adding
something. It enhances the strength of the behaviour of the agent and positively
impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour
would occur again by avoiding the negative condition.
Real-world Use cases of Reinforcement Learning
o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. Some popular games that use RL algorithms
are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait
for different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial
and manufacturing area, and these robots are made more powerful with
reinforcement learning. There are different industries that have their vision of
building intelligent robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.
Advantages and Disadvantages of Reinforcement Learning
Advantages
o It helps in solving complex real-world problems which are difficult to be solved by
general techniques.
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
o Helps in achieving long term results.
Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.
The curse of dimensionality limits reinforcement learning for real physical systems.
Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.
Example: P ( chips | beer ) = 0.7 // 70 percent of customers who buy beer also
buy chips.

We may want to make a distinction among customers and toward this,estimate


P(Y|X,D) where D is the set of customer attributes, for example,gender, age, marital
status, and so on, assuming that we have access to this information. If this is a
bookseller instead of a supermarket, products can be books or authors. In the case
of a Web portal, items correspond to links to Web pages, and we can estimate the
links a user is likely to click and use this information to download such pages in
advance for faster access.
Classification:
• Example: Credit scoring
• Differentiating between low-risk and high-risk customers from their income and
savings
• Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
Fig. 1.3 Classification

Prediction - Regression:
• Example: Predict Price of a used car
• x : car attributes
y : price
y = g (x | θ )
where, g ( ) is the model, θ are the parameters

Fig. 1.4 Price Prediction


A training dataset of used cars and the function fitted. For simplicity, mileage is
taken as the only input attribute and a linear model is used.
Supervised Learning:
Learning a Class from Examples:

• Class C of a “family car”

o Prediction: Is car x a family car?


o Knowledge extraction: What do people expect from a family car?
• Output:
o Positive (+) and negative (–) examples
• Input representation:
x1: price, x2 : engine power

Fig 1.5 Learning a Class from Example


Vapnik-Chervonenkis dimension

Introduction

The Vapnik-Chervonenkis dimension, more commonly known as the VC dimension, is a


model capacity measurement used in statistics and machine learning. It is termed
informally as a measure of a model’s capacity. It is used frequently to guide the model
selection process while developing machine learning applications. To understand the VC
dimension, we must first understand shattering.

Shattering

Shattering is the ability of a model to classify a set of points perfectly. More generally, the
model can create a function that can divide the points into two distinct classes without
overlapping. It is different from simple classification because it considers all possible
combinations of labels upon those points. Later in the shot, we’ll see this concept in action
while computing the VC dimension. In the context of shattering, we simply define the VC
dimension of a model as the size of the largest set of points that that model can shatter.
The VC dimension of a classifier is defined by Vapnik and Chervonenkis to be the
cardinality (size) of the largest set of points that the classification algorithm can shatter [1].
Shattering a set of points
A configuration of N points on the plane is just any placement of N points. In order to have
a VC dimension of at least N, a classifier must be able to shatter a single configuration
of N points. In order to shatter a configuration of points, the classifier must be able to, for
every possible assignment of positive and negative for the points, perfectly partition the
plane such that the positive points are separated from the negative points. For a
configuration of N points, there are 2^N possible assignments of positive or negative, so
the classifier must be able to properly separate the points in each of these.
In the below example, we show that the VC dimension for a linear classifier is at least 3,
since it can shatter this configuration of 3 points. In each of the 2³ = 8 possible assignment
of positive and negative, the classifier is able to perfectly separate the two classes.
Now, we show that a linear classifier is lower than 4. In this configuration of 4 points, the
classifier is unable to segment the positive and negative classes in at least one assignment.
Two lines would be necessary to separate the two classes in this situation. We actually
need to prove that there does not exist a 4 point configuration that can be shattered, but
the same logic applies to other configurations, so, for brevity’s sake, this example is good
enough.

Since we have now shown that the linear classifier’s VC dimension is at least 3, and lower
than 4, we can finally conclude that its VC dimension is exactly 3. Again, remember that
in order to have a VC dimension of N, the classifier must only shatter a single configuration
of N points — there will likely be many other configurations of N points that the classifier
cannot shatter.
Applications of VC dimension
Now that you know what the VC dimension is, and how to find it, it is important to also
understand what its practical implications are. In most cases, the exact VC dimension ofa
classifier is not so important. Rather, it is used more so to classify different types of
algorithms by their complexities; for example, the class of simple classifiers could include
basic shapes like lines, circles, or rectangles, whereas a class of complex classifiers could
include classifiers such as multilayer perceptrons, boosted trees, or other nonlinear
classifiers. The complexity of a classification algorithm, which is directly related to its VC
dimension, is related to the trade-off between bias and variance.
In this image, we visualize the effects of model complexity. On the bottom,
each S_i represents a set of models that are similar in VC dimension, or complexity. On
the graph above, VC dimension is measured on the x-axis as h. Observe that as complexity
increases, you transition from underfitting to overfitting; adding complexity is good up until
a certain point, after which you begin to overfit on the training data.
Another way of thinking about this is through bias and variance. A low complexity model
will have a high bias and low variance; while it has low expressive power leading to high
bias, it is also very simple, so it has very predictable performance leading to a low variance.
Conversely, a complex model will have a lower bias since it has more expressiveness, but
will have a higher variance as there are more parameters to tune based on the sample
training data. Generally, a model with a higher VC dimension will require more training
data to properly train, but will be able to identify more complex relationships in the data.
At some level of model complexity there will exist an ideal balance between bias and
variance, denoted by the dotted vertical line, at which you are neither underfitting nor
overfitting to your data. In other words, you should aim to choose a classifier with a level
of complexity that is just enough for your classification task — any less would lead to
underfitting, and any more would lead to overfitting.
Probably Approximately Correct (PAC):
• Cannot expect a learner to learn a concept exactly.
• Cannot always expect to learn a close approximation to the target concept
• Therefore, the only realistic expectation of a good learner is that with high
probability it will learn a close approximation to the target concept.
• In Probably Approximately Correct (PAC) learning, one requires that given
small parameters and , with probability at least (1- ) a learner produces
a hypothesis with error at most
• The reason we can hope for that is the Consistent Distribution assumption.
PAC Learnability

Gradient Descent in Machine Learning


Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.
In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in
machine learning, optimization is the task of minimizing the cost function parameterized
by the model's parameters. The main objective of gradient descent is to minimize the
convex function using iteration of parameter updates. Once these machine learning
models are optimized, these models can be used as powerful tools for Artificial Intelligence
and various computer science applications.
In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.
What is Gradient Descent or Steepest Descent?
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the most commonly used iterative
optimization algorithms of machine learning to train the machine learning and
deep learning models. It helps in finding the local minimum of a function.
o If we move towards a negative gradient or away from the gradient of the function at
the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize
the cost function using iteration. To achieve this goal, it performs two steps iteratively:
o Calculates the first-order derivative of the function to compute the gradient or slope
of that function.
o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
What is Cost-function?
The cost function is defined as the measurement of difference or error between
actual values and expected values at the current position and present in the form
of a single real number. It helps to increase and improve machine learning efficiency by
providing feedback to this model so that it can minimize error and find the local or global
minimum. Further, it continuously iterates along the direction of the negative gradient
until the cost function approaches zero. At this steepest descent point, the model will stop
learning further. Although cost function and loss function are considered synonymous,
also there is a minor difference between them. The slight difference between the loss
function and the cost function is about the error within the training of machine learning
models, as loss function refers to the error of one training example, while a cost function
calculates the average error across an entire training set.
The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce
the cost function.
Hypothesis:
Parameters:
Cost function:
Goal:
How does Gradient Descent work?
Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple
linear regression is given as:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

The starting point(shown in above fig.) is used to evaluate the performance as it is


considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:
o Direction & Learning Rate
These two factors are used to determine the partial derivative calculation of future iteration
and allow it to the point of convergence or local minimum or global minimum. Let's discuss
learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically
a small value that is evaluated and updated based on the behavior of the cost function. If
the learning rate is high, it results in larger steps but also leads to risks of overshooting
the minimum. At the same time, a low learning rate shows the small step sizes, which
compromises overall efficiency but gives the advantage of more precision.
Types of Gradient Descent
Based on the error in various training models, the Gradient Descent learning algorithm
can be divided into Batch gradient descent, stochastic gradient descent, and mini-
batch gradient descent. Let's understand these different types of gradient descent:
1. Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set
and update the model after evaluating all training examples. This procedure is known as
the training epoch. In simple words, it is a greedy approach where we have to sum over
all examples for each update.
Advantages of Batch gradient descent:
o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.
2. Stochastic gradient descent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each example
within a dataset and updates each training example's parameters one at a time. As it
requires only one training example at a time, hence it is easier to store in allocated
memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed.
Further, due to frequent updates, it is also treated as a noisy gradient. However,
sometimes it can be helpful in finding the global minimum and also escaping the local
minimum.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and it consists
of a few advantages over other gradient descent.
o It is easier to allocate in desired memory.
o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
3. MiniBatch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent. It divides the training datasets into small batch sizes then
performs the updates on those batches separately. Splitting training datasets into smaller
batches make a balance to maintain the computational efficiency of batch gradient descent
and speed of stochastic gradient descent. Hence, we can achieve a special type of gradient
descent with higher computational efficiency and less noisy gradient descent.
Advantages of Mini Batch gradient descent:
o It is easier to fit in allocated memory.
o It is computationally efficient.
o It produces stable gradient descent convergence.
Bias and Variance in Machine Learning
Machine learning is a branch of Artificial Intelligence, which allows machines to perform
data analysis and make predictions. However, if the machine learning model is not
accurate, it can make predictions errors, and these prediction errors are usually known
as Bias and Variance. In machine learning, these errors will always be present as there is
always a slight difference between the model predictions and actual predictions. The main
aim of ML/data science analysts is to reduce these errors in order to get more accurate
results. In this topic, we are going to discuss bias and variance, Bias-variance trade-off,
Underfitting and Overfitting. But before starting, let's first understand what errors in
Machine learning are?

Errors in Machine Learning?


In machine learning, an error is a measure of how accurately an algorithm can make
predictions for the previously unknown dataset. On the basis of these errors, the machine
learning model is selected that can perform best on the particular dataset. There are
mainly two types of errors in machine learning, which are:Reducible errors: These errors
can be reduced to improve the model accuracy. Such errors can further be classified into
bias and Variance.
o Irreducible errors: These errors will always be present in the model
regardless of which algorithm has been used. The cause of these errors is unknown
variables whose value can't be reduced.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies
them to test data for prediction. While making predictions, a difference occurs
between prediction values made by the model and actual values/expected
values, and this difference is known as bias errors or Errors due to bias. It can be
defined as an inability of machine learning algorithms such as Linear Regression to
capture the true relationship between the data points. Each algorithm begins with some
amount of bias because bias occurs from assumptions in the model, which makes the
target function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with
high bias is Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Ways to reduce High Bias:
High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:
o Increase the input features as the model is underfitted.
o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.
What is a Variance Error?
The variance would specify the amount of variation in the prediction if the different
training data was used. In simple words, variance tells that how much a random
variable is different from its expected value. Ideally, a model should not vary too much
from one training dataset to another, which means the algorithm should be good in
understanding the hidden mapping between inputs and output variables. Variance errors
are either of low variance or high variance.
Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation
in the prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset,
and does not generalize well with the unseen dataset. As a result, such a model gives good
results with the training dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:
o A high variance model leads to overfitting.
o Increase model complexities.
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.

Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms
with high variance are decision tree, Support Vector Machine, and K-nearest
neighbours.
Ways to Reduce High Variance:
o Reduce the input features or number of parameters as a model is overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.
Different Combinations of Bias-Variance
There are four possible combinations of bias and variances, which are represented by the
below diagram:

1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with
a large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate
on average.
How to identify High variance or High Bias?
High variance can be identified if the model has:

o Low training error and high test error.


High Bias can be identified if the model has:
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this balance between
the bias error and variance error is known as the Bias-Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model
that accurately captures the regularities in training data and simultaneously generalizes
well with the unseen dataset. Unfortunately, doing this is not possible simultaneously.
Because a high variance algorithm may perform well with training data, but it may lead
to overfitting to noisy data. Whereas, high bias algorithm generates a much simple model
that may not even capture important regularities in the data. So, we need to find a sweet
spot between bias and variance to make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance
between bias and variance errors.\

Underfitting and Overfitting


When we talk about the Machine Learning model, we actually talk about how well it
performs and its accuracy which is known as prediction errors. Let us consider that we
are designing a machine learning model. A model is said to be a good machine learning
model if it generalizes any new input data from the problem domain in a proper way. This
helps us to make predictions about future data, that the data model has never seen. Now,
suppose we want to check how well our machine learning model learns and generalizes to
the new data. For that, we have overfitting and underfitting, which are majorly responsible
for the poor performances of the machine learning algorithms.
Before diving further let’s understand two important terms:
 Bias: Assumptions made by a model to make a function easier to learn. It is actually
the error rate of the training data. When the error rate has a high value, we call it High
Bias and when the error rate has a low value, we call it low Bias.
 Variance: The difference between the error rate of training data and testing data is
called variance. If the difference is high then it’s called high variance and when the
difference of errors is low then it’s called low variance. Usually, we want to make a low
variance for generalized our model.

Underfitting: A statistical model or a machine learning algorithm is said to have


underfitting when it cannot capture the underlying trend of the data, i.e., it only performs
well on training data but performs poorly on testing data. (It’s just like trying to fit
undersized pants!) Underfitting destroys the accuracy of our machine learning model. Its
occurrence simply means that our model or the algorithm does not fit the data well
enough. It usually happens when we have fewer data to build an accurate model and also
when we try to build a linear model with fewer non-linear data. In such cases, the rules of
the machine learning model are too easy and flexible to be applied to such minimal data
and therefore the model will probably make a lot of wrong predictions. Underfitting can be
avoided by using more data and also reducing the features by feature selection.
In a nutshell, Underfitting refers to a model that can neither performs well on the training
data nor generalize to new data.
Reasons for Underfitting:
1. High bias and low variance
2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.
Techniques to reduce underfitting:

1. Increase model complexity


2. Increase the number of features, performing feature engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.
Overfitting: A statistical model is said to be overfitted when the model does not make
accurate predictions on testing data. When a model gets trained with so much data, it
starts learning from the noise and inaccurate data entries in our data set. And when
testing with test data results in High variance. Then the model does not categorize the
data correctly, because of too many details and noise. The causes of overfitting are the
non-parametric and non-linear methods because these types of machine learning
algorithms have more freedom in building the model based on the dataset and therefore
they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are
using decision trees.
In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms
on training data is different from unseen data.
Reasons for Overfitting are as follows:
1. High variance and low bias
2. The model is too complex
3. The size of the training data

Examples:
Techniques to reduce overfitting:
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.

Good Fit in a Statistical Model: Ideally, the case when the model makes the predictions
with 0 error, is said to have a good fit on the data. This situation is achievable at a spot
between overfitting and underfitting. In order to understand it, we will have to look at the
performance of our model with the passage of time, while it is learning from the training
dataset.
With the passage of time, our model will keep on learning, and thus the error for the model
on the training and testing data will keep on decreasing. If it will learn for too long, the
model will become more prone to overfitting due to the presence of noise and less useful
details. Hence the performance of our model will decrease. In order to get a good fit, we
will stop at a point just before where the error starts increasing. At this point, the model
is said to have good skills in training datasets as well as our unseen testing dataset.

Confusion Matrix in Machine Learning


The confusion matrix is a matrix used to determine the performance of the classification
models for a given set of test data. It can only be determined if the true values for test data
are known. The matrix itself can be easily understood, but the related terminologies may
be confusing. Since it shows the errors in the model performance in the form of a matrix,
hence also known as an error matrix. Some features of Confusion matrix are given below:
o For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it
is 3*3 table, and so on.
o The matrix is divided into two dimensions, that are predicted values and actual
values along with the total number of predictions.
o Predicted values are those values, which are predicted by the model, and actual
values are the true values for the given observations.
o It looks like the below table:

The above table has the following cases:


o True Negative: Model has given prediction No, and the real or actual value was also
No.
o True Positive: The model has predicted yes, and the actual value was also true.
o False Negative: The model has predicted no, but the actual value was Yes, it is also
called as Type-II error.
o False Positive: The model has predicted Yes, but the actual value was No. It is also
called a Type-I error.
Need for Confusion Matrix in Machine learning
o It evaluates the performance of the classification models, when they make
predictions on test data, and tells how good our classification model is.
o It not only tells the error made by the classifiers but also the type of errors such as
it is either type-I or type-II error.
o With the help of the confusion matrix, we can calculate the different parameters for
the model, such as accuracy, precision, etc.
Example: We can understand the confusion matrix using an example.
Suppose we are trying to create a model that can predict the result for the disease that is
either a person has that disease or not. So, the confusion matrix for this is given as:
From the above example, we can conclude that:
o The table is given for the two-class classifier, which has two predictions "Yes" and
"NO." Here, Yes defines that patient has the disease, and No defines that patient
does not has that disease.
o The classifier has made a total of 100 predictions. Out of 100 predictions, 89 are
true predictions, and 11 are incorrect predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas
the actual "Yes" was 27, and actual "No" was 73 times.
Calculations using Confusion Matrix:
We can perform various calculations for the model, such as the model's accuracy, using
this matrix. These calculations are given below:
o Classification Accuracy: It is one of the important parameters to determine the
accuracy of the classification problems. It defines how often the model predicts the
correct output. It can be calculated as the ratio of the number of correct predictions
made by the classifier to all number of predictions made by the classifiers. The
formula is given below:

o Misclassification rate: It is also termed as Error rate, and it defines how often the
model gives the wrong predictions. The value of error rate can be calculated as the
number of incorrect predictions to all number of the predictions made by the
classifier. The formula is given below:

o Precision: It can be defined as the number of correct outputs provided by the model
or out of all positive classes that have predicted correctly by the model, how many
of them were actually true. It can be calculated using the below formula:
o Recall: It is defined as the out of total positive classes, how our model predicted
correctly. The recall must be as high as possible.

o F-measure: If two models have low precision and high recall or vice versa, it is
difficult to compare these models. So, for this purpose, we can use F-score. This
score helps us to evaluate the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision. It can be calculated using the below
formula:

Other important terms used in Confusion Matrix:


o Null Error rate: It defines how often our model would be incorrect if it always
predicted the majority class. As per the accuracy paradox, it is said that "the best
classifier has a higher error rate than the null error rate."
o ROC Curve: The ROC is a graph displaying a classifier's performance for all possible
thresholds. The graph is plotted between the true positive rate (on the Y-axis) and
the false Positive rate (on the x-axis).

You might also like