Machine Learning Unit1

Machine Learning
Techniques [KAI-601]
CO-PO (Course Outcome- Program Outcome)
CO No. Course Outcomes According to Bloom's Cognitive Level
Define [L1: Remember] the characteristics of machine learning that make it useful to
KAI601.1 (CO1)
real-world problems.
Describe [L2: Understand] a collection of machine learning algorithms and

KAI601.2 (CO2)
problems.
Apply [L3: Apply] machine learning algorithms theoretical foundations of decision

KAI601.3 (CO3) trees, Instance-Based Learning, Reinforcement Learning Algorithms and Genetic
Algorithms.
Analyze [L4: Analysis] the working of machine learning and neural network with
KAI601.4 (CO4)
deep learning algorithms and models.
Text books:
1. Tom M. Mitchell, ―Machine Learning, McGraw-Hill Education (India)

Private Limited, 2013.
2. Ethem Alpaydin, ―Introduction to Machine Learning (Adaptive
Computation and Machine Learning), MIT Press
3. Stephen Marsland, ―Machine Learning: An Algorithmic Perspective,
CRC Press, 2009.
4. Bishop, C., Pattern Recognition and Machine Learning. Berlin:
Springer-Verlag.
5. M. Gopal, “Applied Machine Learning”, McGraw Hill Education
6. NPTEL, Coursera
Detailed Syllabus
UNIT-1
INTRODUCTION – Learning, Types of Learning, Well defined

learning problems, Designing a Learning System, History of ML,
Introduction of Machine Learning Approaches – (Artificial Neural
Network, Clustering, Reinforcement Learning, Decision Tree
Learning, Bayesian networks, Support Vector Machine, Genetic
Algorithm), Issues in Machine Learning and Data Science Vs
Machine Learning;
UNIT-2
REGRESSION: Linear Regression and Logistic Regression

BAYESIAN LEARNING - Bayes theorem, Concept learning, Bayes
Optimal Classifier, Naïve Bayes classifier, Bayesian belief networks,
EM algorithm.
SUPPORT VECTOR MACHINE: Introduction, Types of support
vector kernel – (Linear kernel, polynomial kernel, and
Gaussiankernel), Hyperplane – (Decision surface), Properties of
SVM, and Issues in SVM
UNIT-3
DECISION TREE LEARNING - Decision tree learning algorithm,

Inductive bias, Inductive inference with decision trees, Entropy and
information theory, Information gain, ID-3 Algorithm, Issues in
Decision tree learning.
INSTANCE-BASED LEARNING – k-Nearest Neighbour Learning,
Locally Weighted Regression, Radial basis function networks,
Case-based learning.
UNIT-4
ARTIFICIAL NEURAL NETWORKS – Perceptron’s, Multilayer

perceptron, Gradient descent and the Delta rule, Multilayer networks,
Derivation of Backpropagation Algorithm, Generalization,
Unsupervised Learning – SOM Algorithm and its variant;
DEEP LEARNING - Introduction, concept of convolutional neural
network , Types of layers – (Convolutional Layers , Activation
function , pooling , fully connected) , Concept of Convolution (1D
and 2D) layers, Training of network, Case study of CNN for eg on
Diabetic Retinopathy, Building a smart speaker, Self-deriving car etc.
UNIT-5
REINFORCEMENT LEARNING–Introduction to Reinforcement

Learning , Learning Task,Example of Reinforcement Learning in
Practice, Learning Models for Reinforcement – (Markov Decision
process , Q Learning - Q Learning function, Q Learning Algorithm ),
Application of Reinforcement Learning,Introduction to Deep Q
Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle of
reproduction, Crossover, Mutation, Genetic Programming, Models of
Evolution and Learning, Applications.
Learning
Machine learning is an application of artificial intelligence (AI) that

provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine learning
focuses on the development of computer programs that can access
data and use it learn for themselves.
The process of learning begins with observations or data, such as

examples, direct experience, or instruction, in order to look for
patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the
computers learn automatically without human intervention or
assistance and adjust actions accordingly.
Traditional Learning v/s Machine Learning
Learning System
Example of Machine Learning
What is Machine Learning?

Have you ever shopped
online? So while checking
for a product, did you
noticed when it recommends
for a product similar to what
you are looking for? or did
you noticed “the person
bought this product also
bought this” combination of
products. How are they
doing this recommendation?
This is machine learning.
Example of Machine Learning
Did you ever get a call from

any bank or finance company
asking you to take a loan or an
insurance policy? What do you
think, do they call everyone?
No, they call only a few
selected customers who they
think will purchase their
product. How do they select?
This is target marketing and
can be applied using
Clustering. This is machine
learning.
Types of Learning
Machine learning is sub-categorized to three types:
• Supervised Learning – Train Me!

• Unsupervised Learning – I am self sufficient in learning
• Reinforcement Learning – My life My rules! (Hit & Trial)
What is Supervised Learning?
Supervised Learning is the one, where you can consider the learning is
guided by a teacher. We have a dataset which acts as a teacher and its
role is to train the model or the machine. Once the model gets trained
it can start making a prediction or decision when new data is given to
it.
Block Diagram of Supervised Learning
What is Unsupervised Learning?
The model learns through observation and finds structures in the data.
Once the model is given a dataset, it automatically finds patterns and
relationships in the dataset by creating clusters in it. What it cannot do
is add labels to the cluster, like it cannot say this a group of apples or
mangoes, but it will separate all the apples from mangoes.
Suppose we presented images of apples, bananas and mangoes to the

model, so what it does, based on some patterns and relationships it
creates clusters and divides the dataset into those clusters. Now if a
new data is fed to the model, it adds it to one of the created clusters.
Block Diagram of Unsupervised Learning
What is Reinforcement Learning?
It is the ability of an agent to interact with the environment and find

out what is the best outcome. It follows the concept of hit and trial
method. The agent is rewarded or penalized with a point for a correct
or a wrong answer, and on the basis of the positive reward points
gained the model trains itself. And again once trained it gets ready to
predict the new data presented to it.
Block Diagram of Reinforcement Learning
Semi-supervised machine learning
Semi-supervised machine learning uses both unlabeled and labeled data sets to train
algorithms. Generally, during semi-supervised machine learning, algorithms are first
fed a small amount of labeled data to help direct their development and then fed
much larger quantities of unlabeled data to complete the model.
For example, an algorithm may be fed a smaller quantity of labeled speech data and
then trained on a much larger set of unlabeled speech data in order to create a
machine learning model capable of speech recognition.
Semi-supervised machine learning is often employed to train algorithms for

classification and prediction purposes in the event where large volumes of labeled
data is unavailable.
Supervised Learning vs Unsupervised Learning
Well-Posed Learning Problems
• Definition:
A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E.
2
3
Well-Posed Learning Problems : Examples
• A checkers learning problem

– Task T : playing checkers
– Performance measure P : percent of games won against opponents
– Training experience E : playing practice games against itself
• A handwriting recognition learning problem

– Task T : recognizing and classifying handwritten words within images
– Performance measure P : percent of words correctly classified
– Training experience E : a database of handwritten words with given
classifications
2
4
Well-Posed Learning Problems : Examples (cont.)
• A robot driving learning problem

– Task T : driving on public four-lane highways using vision sensors
– Performance measure P : average distance traveled before an error (as
judged by human overseer)
– Training experience E : a sequence of images and steering commands
recorded while observing a human driver
2
5
Designing a Learning System
• Choosing the Training Experience

• Choosing the Target Function
• Choosing a Representation for the Target Function
• Choosing a Function Approximation Algorithm
• The Final Design
2
6
Choosing the Training Experience
• Whether the training experience provides direct or indirect

feedback regarding the choices made by the performance system:
• Example:
– Direct training examples in learning to play checkers consist of
individual checkers board states and the correct move for each.
– Indirect training examples in the same game consist of the
move sequences and final outcomes of various games played in
which information about the correctness of specific moves
early in the game must be inferred indirectly from the fact that
the game was eventually won or lost – credit assignment
problem.
2
7
Choosing the Training Experience (cont.)
• The degree to which the learner controls the sequence of

training examples:
• Example:
– The learner might rely on the teacher to select informative
board states and to provide the correct move for each
– The learner might itself propose board states that it finds
particularly confusing and ask the teacher for the correct move.
Or the learner may have complete control over the board states
and (indirect) classifications, as it does when it learns by
playing against itself with no teacher present.
2
8
Choosing the Training Experience (cont.)
• How well it represents the distribution of examples over which the

final system performance P must be measured: In general learning
is most reliable when the training examples follow a distribution
similar to that of future test examples.
• Example:
– If the training experience in play checkers consists only of
games played against itself, the learner might never encounter
certain crucial board states that are very likely to be played
by the human checkers champion. (Note however that the
most current theory of machine learning rests on the crucial
assumption that the distribution of training examples is
identical to the distribution of test examples)
2
9
Choosing the Target Function
• To determine what type of knowledge will be learned and

how this will be used by the performance program:
• Example:
– In play checkers, it needs to learn to choose the best move
among those legal moves: ChooseMove: B → M, which
accepts as input any board from the set of legal board states B
and produces as output some move from the set of legal moves
M.
3
0
Choosing the Target Function (cont.)
• Since the target function such as ChooseMove turns

out to be very difficult to learn given the kind of
indirect training experience available to the system,
an alternative target function is then an evaluation
function that assigns a numerical score to any given
board state, V: B → R.
Choosing a Representation for the Target
Function
• Given the ideal target function V, we choose a

representation that the learning system will use
to describe V' that it will learn:
• Example:
– In play checkers,
V'(b) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + w6x6
– where wi is the numerical coefficient or weight to determine the
relative importance of the various board features and xi is the
number of i-th objects on the board.
Choosing a Function Approximation Algorithm
• Each training example is given by <b, Vtrain(b)> where Vtrain(b) is the

training value for a board b.
• Estimating Training Values:
Vtrain(b) ← V' (Successor(b)).
• Adjusting the weights: To specify the learning algorithm for choosing the
weights wi to best fit the set of training examples {<b, Vtrain(b)>}, which
minimizes the squared error E between the training values and the values
predicted by the hypothesis V‘
• E=
∑
(V (b) -V '(b))2
<b,Vtrain(b )>∈training train
examples
Choosing a Function Approximation Algorithm
(cont.)
• E=
∑
(V (b) -V '(b))2
train
<b,Vtrain(b )>∈ training examples
• To minimize E, the following rule is used:
LMS weight update rule
For each training example <b, Vtrain(b)>

Use the current weights to calculate V'(b)
For each weight wi , update it as
wi ← wi + η (Vtrain(b) – V'(b)) xi
The Final Design
• Performance System: To solve the given performance task by using

the learned target function(s). It takes an instance of a new problem
(new game) as input and a trace of its solution (game history) as
output.
• Critic: To take as input the history or trace of the game and

produce as output a set of training examples of the target function.
The Final Design (cont.)
• Generalizer: To take as input the training examples and produce

an output hypothesis that is its estimate of the target function. It
generalizes from the specific training examples, hypothesizing a
general function that covers these examples and other cases
beyond the training examples.
• Experiment Generator: To take as input the current hypothesis

(currently learned function) and outputs a new problem (i.e., initial
board state) for Performance System to explore. Its role is to pick
new practice problems that will maximize the learning rate of the
overall system.
Experiment
Generator
New problem
(initial game board) Hypothesis ( V′ )
Performance
Generalizer
System
Solution trace Training examples

(game history)
Critic {<b1,Vtrain(b1)>, < b2,Vtrain(b2)>, …}
Figure 1 Final design of the checkers learning program

Choices in Designing the Checkers Learning
Problem
Determine Type
of Training Experience
Games against
experts
Games against Table of correct
self moves …
Determine
Target Function
Board Board …
€ move € value
Determine Representation
of Learned Function
…
Polynomial Linear function Artificial neural
of six features network
Determine
Learning Algorithm
Gradient Linear
programming …
descent
Completed Design
History of ML
1950s: Samuel’s Checker-Playing Program

1960s: Neural Network: Rosenblatt’s Perceptron (Inventor of ANN)
Pattern Recognition
Minsky & Papert Prove Limitations of Perceptron
1970s: Symbolic Concept Introduction
Expert Systems & Knowledge Acquisition Bottleneck
Quinlan’s ID3
NLP
1980s: Advanced Decision Trees & Rule Learning
Focus on experimental methodology
Resurgence of Neural Network
History of ML
90s: ML & Statistics

Support Vector Machines
Data Mining
Adaptive Agents & Web Applications
Text Learning
Reinforcement Learning
Ensembles
Bayes Net Learning
1994: Self Driving Cars Road Test
1997: Deep Blue defeated Garry Kasparov in Chess Exhibition Match.
2009: Google Builds Self Driving Cars
2011: Watson wins Jeopardy
2014: Human Vision surpasses by ML systems
History of ML
Machine Learning the Game of Checkers

Arthur Samuel of IBM developed a computer program for playing
checkers in the 1950s. Since the program had a very small amount of
computer memory available, Samuel initiated what is
called alpha-beta pruning. His design included a scoring function
using the positions of the pieces on the board. The scoring function
attempted to measure the chances of each side winning. The program
chooses its next move using a minimax strategy, which eventually
evolved into the minimax algorithm.
Samuel also designed a number of mechanisms allowing his program
to become better. In what Samuel called rote learning, his program
recorded/remembered all positions it had already seen and combined
this with the values of the reward function. Arthur Samuel first came
up with the phrase “Machine Learning” in 1952.
History of ML : The Perceptron
In 1957, Frank Rosenblatt – at the Cornell Aeronautical Laboratory –

combined Donald Hebb’s model of brain cell interaction with Arthur
Samuel’s Machine Learning efforts and created the perceptron. The
perceptron was initially planned as a machine, not a program. The software,
originally designed for the IBM 704, was installed in a custom-built machine
called the Mark 1 perceptron, which had been constructed for image
recognition. This made the software and the algorithms transferable and
available for other machines.
Described as the first successful neuro-computer, the Mark I perceptron
developed some problems with broken expectations. Although the
perceptron seemed promising, it could not recognize many kinds of visual
patterns (such as faces), causing frustration and stalling neural network
research. It would be several years before the frustrations of investors and
funding agencies faded. Neural network/Machine Learning research
struggled until a resurgence during the 1990s.
History of ML
Multilayers Provide the Next Step
In the 1960s, the discovery and use of multilayers opened a new path
in neural network research. It was discovered that providing and using
two or more layers in the perceptron offered significantly more
processing power than a perceptron using one layer. Other versions of
neural networks were created after the perceptron opened the door to
“layers” in networks, and the variety of neural networks continues to
expand. The use of multiple layers led to feedforward neural
networks and backpropagation.
Backpropagation, developed in the 1970s, allows a network to adjust
its hidden layers of neurons/nodes to adapt to new situations. It
describes “the backward propagation of errors,” with an error being
processed at the output and then distributed backward through the
network’s layers for learning purposes. Backpropagation is now being
used to train deep neural networks.
History of ML
An Artificial Neural Network (ANN) has hidden layers which are

used to respond to more complicated tasks than the earlier perceptrons
could. ANNs are a primary tool used for Machine Learning. Neural
networks use input and output layers and, normally, include a hidden
layer (or layers) designed to transform input into data that can be used
the by output layer. The hidden layers are excellent for finding
patterns too complex for a human programmer to detect, meaning a
human could not find the pattern and then teach the device to
recognize it.
History of ML
The Nearest Neighbor Algorithm

In 1967, the nearest neighbor algorithm was conceived, which was the
beginning of basic pattern recognition. This algorithm was used for
mapping routes and was one of the earliest algorithms used in finding
a solution to the traveling salesperson’s problem of finding the most
efficient route. Using it, a salesperson enters a selected city and
repeatedly has the program visit the nearest cities until all have been
visited. Marcello Pelillo has been given credit for inventing the
“nearest neighbor rule.”
Machine Learning and Artificial Intelligence take
Separate Paths
In the late 1970s and early 1980s, Artificial Intelligence research had
focused on using logical, knowledge-based approaches rather than
algorithms. Additionally, neural network research was abandoned by
computer science and AI researchers. This caused a schism between
Artificial Intelligence and Machine Learning. Until then, Machine
Learning had been used as a training program for AI.
The Machine Learning industry, which included a large number of

researchers and technicians, was reorganized into a separate field
and struggled for nearly a decade. The industry goal shifted from
training for Artificial Intelligence to solving practical problems in
terms of providing services. Its focus shifted from the approaches
inherited from AI research to methods and tactics used in probability
theory and statistics.
Machine Learning and Artificial Intelligence take
Separate Paths (continued)
During this time, the ML industry maintained its focus on neural

networks and then flourished in the 1990s. Most of this success was a
result of Internet growth, benefiting from the ever-growing
availability of digital data and the ability to share its services by way
of the Internet.
History of ML (Boosting)
“Boosting” was a necessary development for the evolution of Machine
Learning. Boosting algorithms are used to reduce bias during supervised
learning and include ML algorithms that transform weak learners into strong
ones. The concept of boosting was first presented in a 1990 paper titled “The
Strength of Weak Learnability,” by Robert Schapire. Schapire states, “A set
of weak learners can create a single strong learner.” Weak learners are
defined as classifiers that are only slightly correlated with the true
classification (still better than random guessing). By contrast, a strong
learner is easily classified and well-aligned with the true classification.
Most boosting algorithms are made up of repetitive learning weak
classifiers, which then add to a final strong classifier. After being added,
they are normally weighted in a way that evaluates the weak learners’
accuracy. Then the data weights are “re-weighted.” Input data that is
misclassified gains a higher weight, while data classified correctly loses
weight.
Boosting
This environment allows future weak learners to focus more

extensively on previous weak learners that were misclassified.
The basic difference between the various types of boosting algorithms
is “the technique” used in weighting training data points. AdaBoost is
a popular Machine Learning algorithm and historically significant,
being the first algorithm capable of working with weak learners. More
recent algorithms include BrownBoost, LPBoost, MadaBoost,
TotalBoost, xgboost, and LogitBoost. A large number boosting
algorithms work within the AnyBoost framework.
History of ML
Speech Recognition
Currently, much of speech recognition training is being done by a
Deep Learning technique called Long Short-Term Memory (LSTM), a
neural network model described by Jürgen Schmidhuber and Sepp
Hochreiter in 1997. LSTM can learn tasks that require memory of
events that took place thousands of discrete steps earlier, which is
quite important for speech.
Around the year 2007, Long Short-Term Memory started
outperforming more traditional speech recognition programs. In 2015,
the Google speech recognition program reportedly had a significant
performance jump of 49 percent using a CTC-trained LSTM.
History of ML
Facial Recognition Becomes a Reality
In 2006, the Face Recognition Grand Challenge – a National Institute
of Standards and Technology program – evaluated the popular face
recognition algorithms of the time. 3D face scans, iris images, and
high-resolution face images were tested. Their findings suggested the
new algorithms were ten times more accurate than the facial
recognition algorithms from 2002 and 100 times more accurate than
those from 1995. Some of the algorithms were able to outperform
human participants in recognizing faces and could uniquely identify
identical twins.
In 2012, Google’s X Lab developed an ML algorithm that can
autonomously browse and find videos containing cats. In 2014,
Facebook developed DeepFace, an algorithm capable of recognizing
or verifying individuals in photographs with the same accuracy as
humans.
Machine Learning at Present
Recently, Machine Learning was defined by Stanford University as

“the science of getting computers to act without being explicitly
programmed.” Machine Learning is now responsible for some of the
most significant advancements in technology, such as the new
industry of self-driving vehicles. Machine Learning has prompted a
new array of concepts and technologies, including supervised and
unsupervised learning, new algorithms for robots, the Internet of
Things, analytics tools, chatbots, and more.
Machine Learning at Present
Listed below are seven common ways the world of business is currently
using Machine Learning:
Analyzing Sales Data: Streamlining the data
Real-Time Mobile Personalization: Promoting the experience
Fraud Detection: Detecting pattern changes
Product Recommendations: Customer personalization
Learning Management Systems: Decision-making programs
Dynamic Pricing: Flexible pricing based on a need or demand
Natural Language Processing: Speaking with humans
Machine Learning models have become quite adaptive in continuously
learning, which makes them increasingly accurate the longer they operate.
ML algorithms combined with new computing technologies promote
scalability and improve efficiency. Combined with business analytics,
Machine Learning can resolve a variety of organizational complexities.
Modern ML models can be used to make predictions ranging from outbreaks
of disease to the rise and fall of stocks.
Artificial Neural Networks
Your first step in Deep Learning.

Deep Learning is the most exciting and powerful branch of Machine
Learning. It's a technique that teaches computers to do what comes
naturally to humans: learn by example. Deep learning is a key
technology behind driverless cars, enabling them to recognize a stop
sign or to distinguish a pedestrian from a lamppost. It is the key to
voice control in consumer devices like phones, tablets, TVs, and
hands-free speakers. Deep learning is getting lots of attention lately
and for good reason. It’s achieving results that were not possible
before.
In deep learning, a computer model learns to perform classification
tasks directly from images, text, or sound. Deep learning models can
achieve state-of-the-art accuracy, sometimes exceeding human-level
performance. Models are trained by using a large set of labeled data
and neural network architectures that contain many layers.
Artificial Neural Network
Deep Learning models can be used for a variety of

complex tasks:
Artificial Neural Networks(ANN) for Regression
and classification
Convolutional Neural Networks(CNN) for
Computer Vision
Recurrent Neural Networks(RNN) for Time Series
analysis
Self-organizing maps for Feature extraction
Deep Boltzmann machines for Recommendation
systems
Auto Encoders for Recommendation systems
Artificial Neural Network : Definition
Artificial Neural Networks or ANN is an information processing

paradigm that is inspired by the way the biological nervous system
such as brain process information. It is composed of large number
of highly interconnected processing elements(neurons) working in
unison to solve a specific problem
Perceptron
The following diagram represents the general model of ANN which is

inspired by a biological neuron. It is also called Perceptron.
A single layer neural network is called a Perceptron. It gives a single
output.
Explanation of Perceptron
In the above figure, for one single observation, x0, x1, x2,
x3...x(n) represents various inputs(independent variables) to the
network. Each of these inputs is multiplied by a connection weight
or synapse. The weights are represented as w0, w1, w2,
w3….w(n) . Weight shows the strength of a particular node.
b is a bias value. A bias value allows you to shift the activation
function up or down.
In the simplest case, these products are summed, fed to a transfer
function (activation function) to generate a result, and this result is
sent as output.
Mathematically, x1.w1 + x2.w2 + x3.w3 ...... xn.wn = ∑ xi.wi
Now activation function is applied 𝜙(∑ xi.wi)
Activation function
The Activation function is important for an ANN to learn and make

sense of something really complicated. Their main purpose is to
convert an input signal of a node in an ANN to an output signal. This
output signal is used as input to the next layer in the stack.
Activation function decides whether a neuron should be activated or

not by calculating the weighted sum and further adding bias to it.
The motive is to introduce non-linearity into the output of a neuron.
If we do not apply activation function then the output signal would be

simply linear function(one-degree polynomial). Now, a linear function
is easy to solve but they are limited in their complexity, have less
power. Without activation function, our model cannot learn and model
complicated data such as images, videos, audio, speech, etc.
Now the question arises why do we need
Non-Linearity?
Non-Linear functions are those which have a degree more than one
and they have a curvature. Now we need a neural network to learn and
represent almost anything and any arbitrary complex function that
maps an input to output.
Neural Network is considered “Universal Function

Approximators”. It means they can learn and compute any function
at all.
Types of Activation Functions:
1. Threshold Activation Function — (Binary step function)

A Binary step function is a threshold-based activation function. If
the input value is above or below a certain threshold, the neuron is
activated and sends exactly the same signal to the next layer.
A Binary step function

Activation function A = “activated” if Y > threshold
else not or A=1 if y>threshold 0 otherwise.
The problem with this function is for creating a binary classifier ( 1 or
0), but if you want multiple such neurons to be connected to bring in
more classes, Class1, Class2, Class3, etc. In this case, all neurons will
give 1, so we cannot decide.
Sigmoid Activation Function — (Logistic function)

A Sigmoid function is a mathematical function having a characteristic
“S”-shaped curve or sigmoid curve which ranges between 0 and 1,
therefore it is used for models where we need to predict the
probability as an output.
Sigmoid curve
The Sigmoid function is differentiable, means we can find the slope of

the curve at any 2 points.
The drawback of the Sigmoid activation function is that it can cause
the neural network to get stuck at training time if strong negative input
is provided.
Hyperbolic Tangent Function — (tanh)

It is similar to Sigmoid but better in performance. It is nonlinear in
nature, so great we can stack layers. The function ranges between
(-1,1).
Hyperbolic tangent function

The main advantage of this function is that strong negative inputs will
be mapped to negative output and only zero-valued inputs are mapped
to near-zero outputs.,So less likely to get stuck during training.
Rectified Linear Units — (ReLu)

ReLu is the most used activation function in CNN and ANN which
ranges from zero to infinity.[0,∞)
ReLu
It gives an output ‘x’ if x is positive and 0 otherwise. It looks like

having the same problem of linear function as it is linear in the
positive axis. Relu is non-linear in nature and a combination of ReLu
is also non-linear. In fact, it is a good approximator and any function
can be approximated with a combination of Relu.
ReLu is 6 times improved over hyperbolic tangent function.
It should only be applied to hidden layers of a neural network. So, for

the output layer use softmax function for classification problem and
for regression problem use a Linear function.
Here one problem is some gradients are fragile during training and can
die. It causes a weight update which will make it never activate on any
data point again. Basically ReLu could result in dead neurons.
To fix the problem of dying neurons, Leaky ReLu was introduced.
So, Leaky ReLu introduces a small slope to keep the updates alive.
Leaky ReLu ranges from -∞ to +∞.
ReLu vs Leaky ReLu

Leak helps to increase the range of the ReLu function. Usually, the
value of a = 0.01 or so.
When a is not 0.01, then it is called Randomized ReLu.
How does the Neural network work?
Let us take the example of the price of a property and to start with we have different factors assembled in a
single row of data: Area, Bedrooms, Distance to city and Age.
The input values go through the weighted synapses straight over to the
output layer. All four will be analyzed, an activation function will be
applied, and the results will be produced.
This is simple enough but there is a way to amplify the power of the
Neural Network and increase its accuracy by the addition of a hidden
layer that sits between the input and output layers.
A neural network with a hidden layer(only showing non-0 values)

Now in the above figure, all 4 variables are connected to neurons via a
synapse. However, not all of the synapses are weighted. they will
either have a 0 value or non-0 value.
here, the non-0 value → indicates the importance
0 value → They will be discarded.
Let's take the example of Area and Distance to City are non-zero for the first neuron, which means they are
weighted and matter to the first neuron. The other two variables, Bedrooms and Age aren’t weighted and so are
not considered by the first neuron.
You may wonder why that first neuron is only considering two of the
four variables. In this case, it is common on the property market that
larger homes become cheaper the further they are from the city. That’s
a basic fact. So what this neuron may be doing is looking specifically
for properties that are large but are not so far from the city.
Now, this is where the power of neural networks comes from. There
are many of these neurons, each doing similar calculations with
different combinations of these variables.
Once this criterion has been met, the neuron applies the activation function and do its calculations. The next
neuron down may have weighted synapses of Distance to the city and, Bedrooms.
This way the neurons work and interact in a very flexible way allowing it to look for specific things and
therefore make a comprehensive search for whatever it is trained for.
How do Neural networks learn?
Looking at an analogy may be useful in understanding the

mechanisms of a neural network. Learning in a neural network is
closely related to how we learn in our regular lives and activities —
we perform an action and are either accepted or corrected by a trainer
or coach to understand how to get better at a certain task. Similarly,
neural networks require a trainer in order to describe what should have
been produced as a response to the input. Based on the difference
between the actual value and the predicted value, an error value also
called Cost Function is computed and sent back through the system.
Cost Function: One half of the squared difference between actual

and output value.
For each layer of the network, the cost function is analyzed and used
to adjust the threshold and weights for the next input. Our aim is to
minimize the cost function. The lower the cost function, the closer the
actual value to the predicted value. In this way, the error keeps
becoming marginally lesser in each run as the network learns how to
analyze values.
We feed the resulting data back through the entire neural network.
The weighted synapses connecting input variables to the neuron are
the only thing we have control over.
As long as there exists a disparity between the actual value and the
predicted value, we need to adjust those weights. Once we tweak them
a little and run the neural network again, A new Cost function will be
produced, hopefully, smaller than the last.
We need to repeat this process until we scrub the cost function down
to as small as possible.
The procedure described above is known as Back-propagation and is

applied continuously through a network until the error value is kept at
a minimum.
Back-propagation
There are basically 2 ways to adjust weights: —

1. Brute-force method
2. Batch-Gradient Descent
Brute-force method
Best suited for the single-layer feed-forward network. Here you take a
number of possible weights. In this method, we want to eliminate all
the other weights except the one right at the bottom of the U-shaped
curve.
Optimal weight can be found using simple elimination techniques.
This process of elimination work if you have one weight to optimize.
What if you have complex NN with many numbers of weights, then
this method fails because of the Curse of Dimensionality.
The alternative approach that we have is called Batch Gradient
Descent.
Batch-Gradient Descent
It is a first-order iterative optimization algorithm and its responsibility
is to find the minimum cost value(loss) in the process of training the
model with different weights or updating weights.
Gradient Descent
In Gradient Descent, instead of going through every weight one at a

time, and ticking every wrong weight off as you go, we instead look at
the angle of the function line.
If slope → Negative, that means you go down the curve.
If slope → Positive, Do nothing
This way a vast number of incorrect weights are eliminated. For
instance, if we have 3 million samples, we have to loop through 3
million times. So basically you need to calculate each cost 3 million
times.
Stochastic Gradient Descent(SGD)

Gradient Descent works fine when we have a convex curve just like in
the above figure. But if we don't have a convex curve, Gradient
Descent fails.
The word ‘stochastic‘ means a system or a process that is linked with
a random probability. Hence, in Stochastic Gradient Descent, a few
samples are selected randomly instead of the whole data set for each
iteration.
v Stochastic Gradient Descent
In SGD, we take one row of data at a time, run it through the neural
network then adjust the weights. For the second row, we run it, then
compare the Cost function and then again adjusting weights. And so
on…
SGD helps us to avoid the problem of local minima. It is much faster
than Gradient Descent because it is running each row at a time and it
doesn’t have to load the whole data in memory for doing computation.
One thing to be noted is that, as SGD is generally noisier than typical
Gradient Descent, it usually took a higher number of iterations to
reach the minima, because of its randomness in its descent. Even
though it requires a higher number of iterations to reach the minima
than typical Gradient Descent, it is still computationally much less
expensive than typical Gradient Descent. Hence, in most scenarios,
SGD is preferred over Batch Gradient Descent for optimizing a
learning algorithm.
Training ANN with Stochastic Gradient Descent
Step-1 → Randomly initialize the weights to small numbers close to 0

but not 0.
Step-2 → Input the first observation of your dataset in the input layer,
each feature in one node.
Step-3 → Forward-Propagation: From left to right, the neurons are
activated in a way that the impact of each neuron's activation is
limited by the weights. Propagate the activations until getting the
predicted value.
Step-4 → Compare the predicted result to the actual result and
measure the generated error(Cost function).
Step-5 → Back-Propagation: from right to left, the error is
backpropagated. Update the weights according to how much they are
responsible for the error. The learning rate decides how much we
update weights.
Training ANN with Stochastic Gradient Descent
Step-6 → Repeat step-1 to 5 and update the weights after each

observation(Reinforcement Learning)
Step-7 → When the whole training set passed through the ANN, that
makes and epoch. Redo more epochs.
Clustering in Machine Learning
It is basically a type of unsupervised learning method . An

unsupervised learning method is a method in which we draw
references from datasets consisting of input data without labelled
responses. Generally, it is used as a process to find meaningful
structure, explanatory underlying processes, generative features, and
groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a
number of groups such that data points in the same groups are more
similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.
Example of Clustering
For ex– The data points in the graph below clustered together can be
classified into one single group. We can distinguish the clusters, and
we can identify that there are 3 clusters in the below picture.
Why Clustering ?
DBSCAN: Density-based Spatial Clustering of Applications with Noise

These data points are clustered by using the basic concept that the data point lies
within the given constraint from the cluster centre. Various distance methods and
techniques are used for calculation of the outliers.
Why Clustering ?
Clustering is very much important as it determines the intrinsic grouping among the
unlabeled data present. There are no criteria for a good clustering. It depends on the
user, what is the criteria they may use which satisfy their need. For instance, we
could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings (“useful” data
classes) or in finding unusual data objects (outlier detection). This algorithm must
make some assumptions which constitute the similarity of points and each
assumption make different and equally valid clusters.
Clustering Methods :
Density-Based Methods : These methods consider the clusters as the

dense region having some similarity and different from the lower
dense region of the space. These methods have good accuracy and
ability to merge two clusters.Example DBSCAN (Density-Based
Spatial Clustering of Applications with Noise) , OPTICS (Ordering
Points to Identify Clustering Structure) etc.
Hierarchical Based Methods : The clusters formed in this method
forms a tree-type structure based on the hierarchy. New clusters are
formed using the previously formed one. It is divided into two
category
Agglomerative (bottom up approach)
Divisive (top down approach)
examples CURE (Clustering Using Representatives), BIRCH
(Balanced Iterative Reducing Clustering and using Hierarchies) etc.
Other Methods
Partitioning Methods : These methods partition the objects into k

clusters and each partition forms one cluster. This method is used to
optimize an objective criterion similarity function such as when the
distance is a major parameter example K-means, CLARANS
(Clustering Large Applications based upon Randomized Search) etc.
Grid-based Methods : In this method the data space is formulated

into a finite number of cells that form a grid-like structure. All the
clustering operation done on these grids are fast and independent of
the number of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (CLustering In Quest) etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised

learning algorithm that solves clustering problem.K-means algorithm
partition n observations into k clusters where each observation
belongs to the cluster with the nearest mean serving as a prototype of
the cluster .
Applications of Clustering in different fields
Marketing : It can be used to characterize & discover customer

segments for marketing purposes.
Biology : It can be used for classification among different species of
plants and animals.
Libraries : It is used in clustering different books on the basis of
topics and information.
Insurance : It is used to acknowledge the customers, their policies
and identifying the frauds.
City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we
can determine the dangerous zones.
Reinforcement Learning
Reinforcement learning is an area of Machine Learning. It is about taking suitable

action to maximize reward in a particular situation. It is employed by various
software and machines to find the best possible behavior or path it should take in a
specific situation. Reinforcement learning differs from the supervised learning in a
way that in supervised learning the training data has the answer key with it so the
model is trained with the correct answer itself whereas in reinforcement learning,
there is no answer but the reinforcement agent decides what to do to perform the
given task. In the absence of a training dataset, it is bound to learn from its
experience.
Example: The problem is as follows: We have an agent and a reward, with many
hurdles in between. The agent is supposed to find the best possible path to reach the
reward. The following problem explains the problem more easily.
Example: Reinforcement Learning
The above image shows the robot, diamond, and fire. The goal of the robot is to
get the reward that is the diamond and avoid the hurdles that are fire. The robot
learns by trying all the possible paths and then choosing the path which gives him
the reward with the least hurdles. Each right step will give the robot a reward and
each wrong step will subtract the reward of the robot. The total reward will be
calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –
Input: The input should be an initial state from which the model will
start
Output: There are many possible output as there are variety of solution
to a particular problem
Training: The training is based upon the input, The model will return a
state and the user will decide to reward or punish the model based on
its output.
The model keeps continues to learn.
The best solution is decided based on the maximum reward.
Supervised Learning vs Reinforcement Learning
Types of Reinforcement: There are two types of
Reinforcement:
Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive
effect on behavior. Advantages of positive reinforcement learning are:
Maximizes Performance
Sustain Change for a long period of time
Disadvantages of reinforcement learning:
Too much Reinforcement can lead to overload of states which can diminish the results
Negative –
Negative Reinforcement is defined as strengthening of a behavior because a negative
condition is stopped or avoided. Advantages of negative reinforcement learning:
Increases Behavior
Provide defiance to minimum standard of performance
Disadvantages of reinforcement learning:
It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement
Learning –
RL can be used in robotics for industrial automation.

RL can be used in machine learning and data processing
RL can be used to create training systems that provide custom
instruction and materials according to the requirement of students.
RL can be used in large environments in the following situations:
A model of the environment is known, but an analytic solution is not
available;
Only a simulation model of the environment is given (the subject of
simulation-based optimization)
The only way to collect information about the environment is to
interact with it.
Decision Tree Learning
Decision tree learning is one of the predictive modelling approaches

used in statistics, data mining and machine learning. It uses a decision
tree (as a predictive model) to go from observations about an item
(represented in the branches) to conclusions about the item's target
value (represented in the leaves). Tree models where the target
variable can take a discrete set of values are called classification
trees; in these tree structures, leaves represent class labels and
branches represent conjunctions of features that lead to those class
labels. Decision trees where the target variable can take continuous
values (typically real numbers) are called regression trees. Decision
trees are among the most popular machine learning algorithms given
their intelligibility and simplicity.
In decision analysis, a decision tree can be used to visually and
explicitly represent decisions and decision making.
A tree showing survival of passengers on the Titanic ("sibsp" is the
number of spouses or siblings aboard). The figures under the leaves
show the probability of survival and the percentage of observations in
the leaf. Summarizing: Your chances of survival were good if you
were (i) a female or (ii) a male younger than 9.5 years with strictly
less than 3 siblings.
Decision Tree Learning
• Decision tree learning is a method for approximating

discrete-valued target functions.
• The learned function is represented by a decision tree.
– A learned decision tree can also be re-represented as a set of if-then rules.
• Decision tree learning is one of the most widely used and
practical methods for inductive inference.
• It is robust to noisy data and capable of learning disjunctive
expressions.
• Decision tree learning method searches a completely
expressive hypothesis .
– Avoids the difficulties of restricted hypothesis spaces.
– Its inductive bias is a preference for small trees over large trees.
• The decision tree algorithms such as ID3, C4.5 are very
popular inductive inference algorithms, and they are sucessfully
applied to many leaning tasks.
Decision Tree for
PlayTennis
Outlook
Sunny Overcast Rain
Humidity Ye Win
s d
High Normal Strong Wea

k
No Ye No Ye
s s
Decision Tree
• Decision trees represent a disjunction of conjunctions of constraints
on the attribute values of instances.
• Each path from the tree root to a leaf corresponds to a conjunction
of attribute tests, and
• The tree itself is a disjunction of these conjunctions.
Outlook
(Outlook = Sunny ∧ Humidity = Sunny Overcast Rain

Normal)
∨ (Outlook = Overcast)
Humidity Ye Wind
s
∨ (Outlook = Rain ∧ Wind = Weak) High Normal Strong Weak
No Ye No Ye
s s
Decision Tree
• Decision trees classify instances by sorting them down the tree from the
root to some leaf node, which provides the classification of the
instance.
• Each node in the tree specifies a test of some attribute of the instance.
• Each branch descending from a node corresponds to one of the
possible values for the attribute.
• Each leaf node assigns a classification.
• The instance
(Outlook=Sunny, Temperature=Hot, Humidity=High,
Wind=Strong) is classified as a negative instance.
When to Consider Decision Trees
• Instances are represented by attribute-value pairs.
– Fixed set of attributes, and the attributes take a small number of disjoint possible values.
• The target function has discrete output values.
– Decision tree learning is appropriate for a boolean classification, but it easily extends to learning
functions with more than two possible output values.
• Disjunctive descriptions may be required.
– decision trees naturally represent disjunctive expressions.
• The training data may contain errors.
– Decision tree learning methods are robust to errors, both errors in classifications of the training
examples and errors in the attribute values that describe these examples.
• The training data may contain missing attribute values.
– Decision tree methods can be used even when some training examples have unknown values.
• Decision tree learning has been applied to problems such as learning to classify
– medical patients by their disease,
– equipment malfunctions by their cause, and
– loan applicants by their likelihood of defaulting on payments.
Advantages
It can be used for both classification and regression problems:

As decision trees are simple hence they require less effort for
understanding an algorithm.
It can capture nonlinear relationships: They can be used to classify
non-linearly separable data.
They are very fast and efficient compared to KNN and other
classification algorithms.
•Easy to understand, interpret, visualize.
•The data type of decision tree can handle any type of data whether it is
numerical or categorical, or boolean.
•Normalization is not required in the Decision Tree.
•Useful in data exploration
Disadvantages
Concerning the decision tree split for numerical variables millions

of records:
Take more time for training-time complexity to increase as the input
increases
Growing with the tree from the training set: Overfit pruning (pre,
post), ensemble method random forest.
Method of overfitting: If we discuss overfitting, it is one of the most
difficult methods for decision tree models.
Issues of DT
how deeply to grow the decision tree, handling continuous attribut

choosing an appropriate attribute selection measure, handling train
data with missing attribute values, handling attributes with differin
costs, and improving computational efficiency.
Avoiding Overfitting the Data
Avoiding Overfitting —
There are several approaches to avoiding overfitting in decision tre
learning. These can be grouped into two classes:
- Pre-pruning (avoidance): Stop growing the tree earlier, before i
reaches the point where it perfectly classifies the training data
- Post-pruning (recovery): Allow the tree to overfit the data, and t
post-prune the tree
Underfitting
Criterion used to determine the correct final tree size
Reduced Error Pruning
Top-Down Induction of Decision Trees -- ID3
1. A ← the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant node
4. Sort training examples to leaf node according to the attribute value
of the branch
5. If all training examples are perfectly classified (same value of
target attribute) STOP,
else iterate over new leaf nodes.
Which Attribute is ”best”?
• We would like to select the attribute that is most useful for
classifying examples.
• Informution gain measures how well a given attribute separates
the training examples according to their target classification.
• ID3 uses this information gain measure to select among the
candidate attributes at each step while growing the tree.
• In order to define information gain precisely, we use a
measure commonly used in information theory, called
entropy
• Entropy characterizes the (im)purity of an arbitrary collection
of examples.
Entropy
• Given a collection S, containing positive and negative examples
of some target concept, the entropy of S relative to this boolean
classification is:
Entropy(S) = -p+ log2p+ - p- log2p-
• S is a sample of training examples

• p+ is the proportion of positive examples
• p- is the proportion of negative examples
Entropy
ID3 - Training Examples – [9+,5-]
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Inductive Bias in ID3
• ID3 search strategy
– selects in favor of shorter trees over longer ones,
– selects trees that place the attributes with highest information gain closest to the
root.
– because ID3 uses the information gain heuristic and a hill climbing strategy, it
does not always find the shortest consistent tree, and it is biased to favor trees
that place attributes with high information gain closest to the root.
Inductive Bias of ID3:

– Shorter trees are preferred over longer trees.
– Trees that place high information gain attributes close to the
root are preferred over those that do not.
Overfitting
Given a hypothesis space H, a hypothesis h∈H is said to OVERFIT the
training data if there exists some alternative hypothesis h'∈H, such
that h has smaller error than h' over the training examples, but h' has a
smaller error than h over the entire distribution of instances.
Reasons for overfitting:

– Errors and noise in training examples
– Coincidental regularities (especially small number of examples
are associated with leaf nodes).
Overfitting
• As ID3 adds new nodes to grow the decision tree, the accuracy of the tree measured
over the training examples increases monotonically.
• However, when measured over a set of test examples independent of the
training examples, accuracy first increases, then decreases.
Avoid Overfitting
How can we avoid overfitting?
– Stop growing when data split not statistically significant
• stop growing the tree earlier, before it reaches the point where it perfectly
classifies the training data
– Grow full tree then post-prune
• allow the tree to overfit the data, and then post-prune the tree.
• The correct tree size is found by stopping early or by post-pruning, a
key question is what criterion is to be used to determine the correct
final tree size.
– Use a separate set of examples, distinct from the training examples, to evaluate the
utility of post-pruning nodes from the tree.
– Use all the available data for training, but apply a statistical test to estimate
whether expanding a particular node is likely to produce an improvement beyond
the training set. ( chi-square test )
Avoid Overfitting - Reduced-Error Pruning
• Split data into training and validation set
• Do until further pruning is harmful:
– Evaluate impact on validation set of pruning each possible
node (plus those below it)
– Greedily remove the one that most improves the validation
set accuracy
• Pruning of nodes continues until further pruning is harmful
(i.e., decreases accuracy of the tree over the validation set).
• Using a separate set of data to guide pruning is an effective
approach provided a large amount of data is available.
– The major drawback of this approach is that when data is limited, withholding part
of it for the validation set reduces even further the number of examples available
for training.
Main Points with Decision Tree Learning
• Decision tree learning provides a practical method for concept
learning and for learning other discrete-valued functions.
– decision trees are inferred by growing them from the root downward, greedily
selecting the next best attribute.
• ID3 searches a complete hypothesis space.
• The inductive bias in ID3 includes a preference for smaller trees.
• Overfitting training data is an important issue in decision tree learning.
– Pruning decision trees or rules are important.
• A large variety of extensions to the basic ID3 algorithm has
been developed. These extensions include methods for
– post-pruning trees, handling real-valued attributes, accommodating training
examples with missing attribute values, using attribute selection measures other
than information gain, and considering costs associated with instance attributes.
Bayesian Networks
Bayesian networks are a type of Probabilistic Graphical Model that

can be used to build models from data and/or expert opinion.
They can be used for a wide range of tasks including prediction,
anomaly detection, diagnostics, automated insight, reasoning, time
series prediction and decision making under uncertainty. Figure
below shows these capabilities in terms of the four major analytics
disciplines, Descriptive analytics, Diagnostic analytics, Predictive
analytics and Prescriptive analytics.
Descriptive, diagnostic, predictive & prescriptive
analytics with Bayesian networks
Descriptive, diagnostic, predictive & prescriptive analytics with Bayesian

networks
Bayesian Networks
They are also commonly referred to as Bayes nets, Belief

networks and sometimes Causal networks.
A Bayes net is a model. It reflects the states of some part of a world

that is being modeled and it describes how those states are related by
probabilities. The model might be of your house, or your car, your
body, your community, an ecosystem, a stock-market, etc. Absolutely
anything can be modeled by a Bayes net. All the possible states of the
model represent all the possible worlds that can exist, that is, all the
possible ways that the parts or states can be configured. The car
engine can be running normally or giving trouble. It's tires can be
inflated or flat. Your body can be sick or healthy, and so on.
Bayesian Networks
So where do the probabilities come in? Well, typically some states

will tend to occur more frequently when other states are present. Thus,
if you are sick, the chances of a runny nose are higher. If it is cloudy,
the chances of rain are higher, and so on.
Here is a simple Bayes net that illustrates these concepts. In this
simple world, let us say the weather can have three states: sunny,
cloudy, or rainy, also that the grass can be wet or dry, and that the
sprinkler can be on or off. Now there are some causal links in this
world. If it is rainy, then it will make the grass wet directly. But if it is
sunny for a long time, that too can make the grass wet, indirectly, by
causing us to turn on the sprinkler.
Bayesian Networks
When actual probabilities are entered into this net that reflect the
reality of real weather, lawn, and sprinkler-use-behavior, such a net
can be made to answer a number of useful questions, like, "if the lawn
is wet, what are the chances it was caused by rain or by the sprinkler",
and "if the chance of rain increases, how does that affect my having to
budget time for watering the lawn".
Here is another simple Bayes net called Asia. It is an example which
is popular for introducing Bayes nets and is
from Lauritzen&Spiegelhalter88. Note, it is for example purposes
only, and should not be used for real decision making.
Bayesian Networks
It is a simplified version of a network that could be used to diagnose patients

arriving at a clinic. Each node in the network corresponds to some condition of the
patient, for example, "Visit to Asia" indicates whether the patient recently visited
Asia. The arrows (also called links) between any two nodes indicate that there are
probability relationships that are know to exist between the states of those two
nodes. Thus, smoking increases the chances of getting lung cancer and of getting
bronchitis. Both lung cancer and bronchitis increase the chances of getting dyspnea
(shortness of breath). Both lung cancer and tuberculosis, but not usually bronchitis,
can cause an abnormal lung x-ray. And so on.
Bayesian Networks
The direction of the link arrows roughly corresponds to "causality". That is the
nodes higher up in the diagram tend to influence those below rather than, or, at least,
more so than the other way around.
In a Bayes net, the links may form loops, but they may not form cycles. This is not
an expressive limitation; it does not limit the modeling power of these nets. It only
means we must be more careful in building our nets. In the left diagram below, there
are numerous loops. These are fine. In the right diagram, the addition of the link
from D to B creates a cycle, which is not permitted.
A valid Bayes net Not a Bayes net

Bayesian Networks
The key advantage of not allowing cycles it that it makes possible

very fast update algorithms, since there is no way for probabilistic
influence to "cycle around" indefinitely.
To diagnose a patient, values could be entered for some of nodes
when they are known. This would allow us to re-calculate the
probabilities for all the other nodes. Thus if we take a chest x-ray and
the x-ray is abnormal, then the chances of the patient having TB or
lung-cancer rise. If we further learn that our patient visited Asia, then
the chances that they have tuberculosis would rise further, and of
lung-cancer would drop (since the X-ray is now better explained by
the presence of TB than of lung-cancer).
Support Vector Machines
Support Vector Machine (SVM) is a relatively simple Supervised

Machine Learning Algorithm used for classification and/or
regression. It is more preferred for classification but is sometimes very
useful for regression as well. Basically, SVM finds a hyper-plane that
creates a boundary between the types of data. In 2-dimensional space,
this hyper-plane is nothing but a line.
In SVM, we plot each data item in the dataset in an N-dimensional
space, where N is the number of features/attributes in the data. Next,
find the optimal hyperplane to separate the data. So by this, you must
have understood that inherently, SVM can only perform binary
classification (i.e., choose between two classes). However, there are
various techniques to use for multi-class problems.
Support Vector Machine for Multi-Class
Problems
To perform SVM on multi-class problems, we can create a binary

classifier for each class of the data. The two results of each classifier
will be :
The data point belongs to that class OR
The data point does not belong to that class.
For example, in a class of fruits, to perform multi-class classification,
we can create a binary classifier for each fruit. For say, the ‘mango’
class, there will be a binary classifier to predict if it IS a mango OR it
is NOT a mango. The classifier with the highest score is chosen as the
output of the SVM.
SVM for complex (Non Linearly Separable)
SVM works very well without any modifications for linearly

separable data. Linearly Separable Data is any data that can be
plotted in a graph and can be separated into classes using a straight
line.
A: Linearly Separable Data B: Non-Linearly

Separable Data
Kernelized SVM
We use Kernelized SVM for non-linearly separable data. Say, we

have some non-linearly separable data in one dimension. We can
transform this data into two-dimensions and the data will become
linearly separable in two dimensions. This is done by mapping each
1-D data point to a corresponding 2-D ordered pair.
So for any non-linearly separable data in any dimension, we can just
map the data to a higher dimension and then make it linearly
separable. This is a very powerful and general transformation.
A kernel is nothing a measure of similarity between data points.
The kernel function in a kernelized SVM tell you, that given two data
points in the original feature space, what the similarity is between the
points in the newly transformed feature space.
Advantages of SVM
•Works well when there is an understandable margin of dissociation

between classes.
•More productive in high dimensional spaces.
•Effective in instances where the number of dimensions is larger than
the number of specimens.
•More memory systematic.
Disadvantages of SVM
•Not acceptable for large datasets.

•Does not execute very well when the dataset has more sound i.e.
target classes are overlapping.
•Suffers from problem of overfitting and underfitting.
•There is no probabilistic clarification for the classification.
Applications
•Face observation.
•Text and Hypertext arrangement.
•Bioinformatics
•Handwriting Recognition.
•Generalized Predictive Control.
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of
natural selection and genetics. These are intelligent exploitation of random search
provided with historical data to direct the search into the region of better
performance in solution space. They are commonly used to generate high-quality
solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those
species who can adapt to changes in their environment are able to survive and
reproduce and go to next generation. In simple words, they simulate “survival of the
fittest” among individual of consecutive generation for solving a problem. Each
generation consist of a population of individuals and each individual represents a
point in search space and possible solution. Each individual is represented as a string
of character/integer/float/bits. This string is analogous to the Chromosome.
Foundation of Genetic Algorithms
Genetic algorithms are based on an analogy with genetic structure and

behavior of chromosome of the population. Following is the
foundation of GAs based on this analogy –
Individual in population compete for resources and mate
Those individuals who are successful (fittest) then mate to create more
offspring than others
Genes from “fittest” parent propagate throughout the generation, that
is sometimes parents create offspring which is better than either
parent.
Thus each successive generation is more suited for their environment.
Search space
The population of individuals are maintained within search space.

Each individual represent a solution in search space for given
problem. Each individual is coded as a finite length vector (analogous
to chromosome) of components. These variable components are
analogous to Genes. Thus a chromosome (individual) is composed of
several genes (variable components).
Fitness Score
A Fitness Score is given to each individual which shows the ability of an

individual to “compete”. The individual having optimal fitness score (or near
optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along
with their fitness scores. The individuals having better fitness scores are given more
chance to reproduce than others. The individuals with better fitness scores are
selected who mate and produce better offspring by combining chromosomes of
parents. The population size is static so the room has to be created for new arrivals.
So, some individuals die and get replaced by new arrivals eventually creating new
generation when all the mating opportunity of the old population is exhausted. It is
hoped that over successive generations better solutions will arrive while least fit die.
Each new generation has on average more “better genes” than the individual
(solution) of previous generations. Thus each new generations have better “partial
solutions” than previous generations. Once the offsprings produced having no
significant difference than offspring produced by previous populations, the
population is converged. The algorithm is said to be converged to a set of solutions
for the problem.
Advantages of Genetic Algorithm
•They are robust, probabilistic in nature, requires less information.

•Provide optimization over large space state.
•Unlike traditional AI, they do not break on slight change in input or
presence of noise.
•Global optimization.
•Parallelism.
Disadvantages of Genetic Algorithm
•Too slow for processing.

•Expensive to implement.
•Difficult to understand.
•Difficult to debug.
•Difficult to optimize sometimes.
Applications of Genetic Algorithm
•Recurrent Neural Network

•Mutation Testing.
•Code Breaking and Debugging.
•Filtering and Signal Processing.
•Learning Fuzzy Rule base etc.
Issues in Machine Learning
• What algorithms exist for learning general target functions from

specific training examples ?
• How does the number of training examples influence accuracy ?
• When and how can prior knowledge held by the learner guide
the process of generalizing from examples ?
Issues in Machine Learning (cont.)
• What is the best strategy for choosing a useful next training

experience, and how does the choice of this strategy alter the
complexity of the learning problem ?
• What is the best way to reduce the learning task to one or

more function approximation problems ?
• How can the learner automatically alter its representation

to improve its ability to represent and learn the target
function ?
Data Science
Data science is an inter-disciplinary field that uses scientific methods,

processes, algorithms and systems to extract knowledge and insights
from many structural and unstructured data. Data science is related
to data mining, machine learning and big data.
Data science continues to evolve as one of the most promising and

in-demand career paths for skilled professionals. Today, successful
data professionals understand that they must advance past the
traditional skills of analyzing large amounts of data, data mining, and
programming skills. In order to uncover useful intelligence for their
organizations, data scientists must master the full spectrum of the data
science life cycle and possess a level of flexibility and understanding
to maximize returns at each phase of the process.
Common Disciplines of a Data Scientist
The Data Science Life Cycle
The image represents the five stages of the data science life cycle: Capture, (data
acquisition, data entry, signal reception, data extraction); Maintain (data
warehousing, data cleansing, data staging, data processing, data
architecture); Process (data mining, clustering/classification, data modeling, data
summarization); Analyze (exploratory/confirmatory, predictive analysis, regression,
text mining, qualitative analysis); Communicate (data reporting, data visualization,
business intelligence, decision making).
Data Science vs Machine Learning
THANK YOU

Machine Learning Unit1

Uploaded by

Copyright:

Available Formats

Machine Learning Unit1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Unit1

Uploaded by

Copyright:

Available Formats

Machine Learning

CO No. Course Outcomes According to Bloom's Cognitive Level

Describe [L2: Understand] a collection of machine learning algorithms and

Apply [L3: Apply] machine learning algorithms theoretical foundations of decision

1. Tom M. Mitchell, ―Machine Learning, McGraw-Hill Education (India)

INTRODUCTION – Learning, Types of Learning, Well defined

REGRESSION: Linear Regression and Logistic Regression

DECISION TREE LEARNING - Decision tree learning algorithm,

ARTIFICIAL NEURAL NETWORKS – Perceptron’s, Multilayer

REINFORCEMENT LEARNING–Introduction to Reinforcement

Machine learning is an application of artificial intelligence (AI) that

The process of learning begins with observations or data, such as

What is Machine Learning?

Did you ever get a call from

Machine learning is sub-categorized to three types:

• Supervised Learning – Train Me!

Suppose we presented images of apples, bananas and mangoes to the

It is the ability of an agent to interact with the environment and find

Semi-supervised machine learning is often employed to train algorithms for

• A checkers learning problem

• A handwriting recognition learning problem

• A robot driving learning problem

• Choosing the Training Experience

• Whether the training experience provides direct or indirect

• The degree to which the learner controls the sequence of

• How well it represents the distribution of examples over which the

• To determine what type of knowledge will be learned and

• Since the target function such as ChooseMove turns

• Given the ideal target function V, we choose a

• Each training example is given by <b, Vtrain(b)> where Vtrain(b) is the

• To minimize E, the following rule is used:

LMS weight update rule

For each training example <b, Vtrain(b)>

• Performance System: To solve the given performance task by using

• Critic: To take as input the history or trace of the game and

• Generalizer: To take as input the training examples and produce

• Experiment Generator: To take as input the current hypothesis

Solution trace Training examples

Figure 1 Final design of the checkers learning program

1950s: Samuel’s Checker-Playing Program

90s: ML & Statistics

Machine Learning the Game of Checkers

In 1957, Frank Rosenblatt – at the Cornell Aeronautical Laboratory –

An Artificial Neural Network (ANN) has hidden layers which are

The Nearest Neighbor Algorithm

The Machine Learning industry, which included a large number of

During this time, the ML industry maintained its focus on neural

This environment allows future weak learners to focus more

Recently, Machine Learning was defined by Stanford University as

Your first step in Deep Learning.

Deep Learning models can be used for a variety of

Artificial Neural Networks or ANN is an information processing

The following diagram represents the general model of ANN which is

The Activation function is important for an ANN to learn and make

Activation function decides whether a neuron should be activated or

If we do not apply activation function then the output signal would be

Neural Network is considered “Universal Function

1. Threshold Activation Function — (Binary step function)

A Binary step function