Machine Learning Unit1
Machine Learning Unit1
Machine Learning Unit1
Techniques [KAI-601]
CO-PO (Course Outcome- Program Outcome)
Define [L1: Remember] the characteristics of machine learning that make it useful to
KAI601.1 (CO1)
real-world problems.
Analyze [L4: Analysis] the working of machine learning and neural network with
KAI601.4 (CO4)
deep learning algorithms and models.
Text books:
Supervised Learning is the one, where you can consider the learning is
guided by a teacher. We have a dataset which acts as a teacher and its
role is to train the model or the machine. Once the model gets trained
it can start making a prediction or decision when new data is given to
it.
Block Diagram of Supervised Learning
What is Unsupervised Learning?
The model learns through observation and finds structures in the data.
Once the model is given a dataset, it automatically finds patterns and
relationships in the dataset by creating clusters in it. What it cannot do
is add labels to the cluster, like it cannot say this a group of apples or
mangoes, but it will separate all the apples from mangoes.
Semi-supervised machine learning uses both unlabeled and labeled data sets to train
algorithms. Generally, during semi-supervised machine learning, algorithms are first
fed a small amount of labeled data to help direct their development and then fed
much larger quantities of unlabeled data to complete the model.
For example, an algorithm may be fed a smaller quantity of labeled speech data and
then trained on a much larger set of unlabeled speech data in order to create a
machine learning model capable of speech recognition.
• Definition:
A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E.
2
3
Well-Posed Learning Problems : Examples
2
4
Well-Posed Learning Problems : Examples (cont.)
2
5
Designing a Learning System
2
6
Choosing the Training Experience
2
7
Choosing the Training Experience (cont.)
2
8
Choosing the Training Experience (cont.)
2
9
Choosing the Target Function
3
0
Choosing the Target Function (cont.)
• Adjusting the weights: To specify the learning algorithm for choosing the
weights wi to best fit the set of training examples {<b, Vtrain(b)>}, which
minimizes the squared error E between the training values and the values
predicted by the hypothesis V‘
• E=
∑
(V (b) -V '(b))2
<b,Vtrain(b )>∈training train
examples
Choosing a Function Approximation Algorithm
(cont.)
• E=
∑
(V (b) -V '(b))2
train
<b,Vtrain(b )>∈ training examples
Experiment
Generator
New problem
(initial game board) Hypothesis ( V′ )
Performance
Generalizer
System
Board Board …
€ move € value
Determine Representation
of Learned Function
…
Polynomial Linear function Artificial neural
of six features network
Determine
Learning Algorithm
Gradient Linear
programming …
descent
Completed Design
History of ML
In the late 1970s and early 1980s, Artificial Intelligence research had
focused on using logical, knowledge-based approaches rather than
algorithms. Additionally, neural network research was abandoned by
computer science and AI researchers. This caused a schism between
Artificial Intelligence and Machine Learning. Until then, Machine
Learning had been used as a training program for AI.
Speech Recognition
Currently, much of speech recognition training is being done by a
Deep Learning technique called Long Short-Term Memory (LSTM), a
neural network model described by Jürgen Schmidhuber and Sepp
Hochreiter in 1997. LSTM can learn tasks that require memory of
events that took place thousands of discrete steps earlier, which is
quite important for speech.
Around the year 2007, Long Short-Term Memory started
outperforming more traditional speech recognition programs. In 2015,
the Google speech recognition program reportedly had a significant
performance jump of 49 percent using a CTC-trained LSTM.
History of ML
Facial Recognition Becomes a Reality
In 2006, the Face Recognition Grand Challenge – a National Institute
of Standards and Technology program – evaluated the popular face
recognition algorithms of the time. 3D face scans, iris images, and
high-resolution face images were tested. Their findings suggested the
new algorithms were ten times more accurate than the facial
recognition algorithms from 2002 and 100 times more accurate than
those from 1995. Some of the algorithms were able to outperform
human participants in recognizing faces and could uniquely identify
identical twins.
In 2012, Google’s X Lab developed an ML algorithm that can
autonomously browse and find videos containing cats. In 2014,
Facebook developed DeepFace, an algorithm capable of recognizing
or verifying individuals in photographs with the same accuracy as
humans.
Machine Learning at Present
Listed below are seven common ways the world of business is currently
using Machine Learning:
Analyzing Sales Data: Streamlining the data
Real-Time Mobile Personalization: Promoting the experience
Fraud Detection: Detecting pattern changes
Product Recommendations: Customer personalization
Learning Management Systems: Decision-making programs
Dynamic Pricing: Flexible pricing based on a need or demand
Natural Language Processing: Speaking with humans
Machine Learning models have become quite adaptive in continuously
learning, which makes them increasingly accurate the longer they operate.
ML algorithms combined with new computing technologies promote
scalability and improve efficiency. Combined with business analytics,
Machine Learning can resolve a variety of organizational complexities.
Modern ML models can be used to make predictions ranging from outbreaks
of disease to the rise and fall of stocks.
Artificial Neural Networks
In the above figure, for one single observation, x0, x1, x2,
x3...x(n) represents various inputs(independent variables) to the
network. Each of these inputs is multiplied by a connection weight
or synapse. The weights are represented as w0, w1, w2,
w3….w(n) . Weight shows the strength of a particular node.
b is a bias value. A bias value allows you to shift the activation
function up or down.
In the simplest case, these products are summed, fed to a transfer
function (activation function) to generate a result, and this result is
sent as output.
Mathematically, x1.w1 + x2.w2 + x3.w3 ...... xn.wn = ∑ xi.wi
Now activation function is applied 𝜙(∑ xi.wi)
Activation function
Non-Linear functions are those which have a degree more than one
and they have a curvature. Now we need a neural network to learn and
represent almost anything and any arbitrary complex function that
maps an input to output.
Sigmoid curve
Types of Activation Functions:
The main advantage of this function is that strong negative inputs will
be mapped to negative output and only zero-valued inputs are mapped
to near-zero outputs.,So less likely to get stuck during training.
Types of Activation Functions:
ReLu
Types of Activation Functions:
Here one problem is some gradients are fragile during training and can
die. It causes a weight update which will make it never activate on any
data point again. Basically ReLu could result in dead neurons.
To fix the problem of dying neurons, Leaky ReLu was introduced.
So, Leaky ReLu introduces a small slope to keep the updates alive.
Leaky ReLu ranges from -∞ to +∞.
Leak helps to increase the range of the ReLu function. Usually, the
value of a = 0.01 or so.
When a is not 0.01, then it is called Randomized ReLu.
How does the Neural network work?
Let us take the example of the price of a property and to start with we have different factors assembled in a
single row of data: Area, Bedrooms, Distance to city and Age.
How does the Neural network work?
The input values go through the weighted synapses straight over to the
output layer. All four will be analyzed, an activation function will be
applied, and the results will be produced.
This is simple enough but there is a way to amplify the power of the
Neural Network and increase its accuracy by the addition of a hidden
layer that sits between the input and output layers.
Now in the above figure, all 4 variables are connected to neurons via a
synapse. However, not all of the synapses are weighted. they will
either have a 0 value or non-0 value.
here, the non-0 value → indicates the importance
0 value → They will be discarded.
Let's take the example of Area and Distance to City are non-zero for the first neuron, which means they are
weighted and matter to the first neuron. The other two variables, Bedrooms and Age aren’t weighted and so are
not considered by the first neuron.
You may wonder why that first neuron is only considering two of the
four variables. In this case, it is common on the property market that
larger homes become cheaper the further they are from the city. That’s
a basic fact. So what this neuron may be doing is looking specifically
for properties that are large but are not so far from the city.
How does the Neural network work?
Now, this is where the power of neural networks comes from. There
are many of these neurons, each doing similar calculations with
different combinations of these variables.
Once this criterion has been met, the neuron applies the activation function and do its calculations. The next
neuron down may have weighted synapses of Distance to the city and, Bedrooms.
This way the neurons work and interact in a very flexible way allowing it to look for specific things and
therefore make a comprehensive search for whatever it is trained for.
How do Neural networks learn?
For each layer of the network, the cost function is analyzed and used
to adjust the threshold and weights for the next input. Our aim is to
minimize the cost function. The lower the cost function, the closer the
actual value to the predicted value. In this way, the error keeps
becoming marginally lesser in each run as the network learns how to
analyze values.
We feed the resulting data back through the entire neural network.
The weighted synapses connecting input variables to the neuron are
the only thing we have control over.
How do Neural networks learn?
As long as there exists a disparity between the actual value and the
predicted value, we need to adjust those weights. Once we tweak them
a little and run the neural network again, A new Cost function will be
produced, hopefully, smaller than the last.
We need to repeat this process until we scrub the cost function down
to as small as possible.
How do Neural networks learn?
Back-propagation
How do Neural networks learn?
Batch-Gradient Descent
It is a first-order iterative optimization algorithm and its responsibility
is to find the minimum cost value(loss) in the process of training the
model with different weights or updating weights.
Gradient Descent
How do Neural networks learn?
In SGD, we take one row of data at a time, run it through the neural
network then adjust the weights. For the second row, we run it, then
compare the Cost function and then again adjusting weights. And so
on…
SGD helps us to avoid the problem of local minima. It is much faster
than Gradient Descent because it is running each row at a time and it
doesn’t have to load the whole data in memory for doing computation.
One thing to be noted is that, as SGD is generally noisier than typical
Gradient Descent, it usually took a higher number of iterations to
reach the minima, because of its randomness in its descent. Even
though it requires a higher number of iterations to reach the minima
than typical Gradient Descent, it is still computationally much less
expensive than typical Gradient Descent. Hence, in most scenarios,
SGD is preferred over Batch Gradient Descent for optimizing a
learning algorithm.
Training ANN with Stochastic Gradient Descent
For ex– The data points in the graph below clustered together can be
classified into one single group. We can distinguish the clusters, and
we can identify that there are 3 clusters in the below picture.
Why Clustering ?
Why Clustering ?
Clustering is very much important as it determines the intrinsic grouping among the
unlabeled data present. There are no criteria for a good clustering. It depends on the
user, what is the criteria they may use which satisfy their need. For instance, we
could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings (“useful” data
classes) or in finding unusual data objects (outlier detection). This algorithm must
make some assumptions which constitute the similarity of points and each
assumption make different and equally valid clusters.
Clustering Methods :
Example: The problem is as follows: We have an agent and a reward, with many
hurdles in between. The agent is supposed to find the best possible path to reach the
reward. The following problem explains the problem more easily.
Example: Reinforcement Learning
The above image shows the robot, diamond, and fire. The goal of the robot is to
get the reward that is the diamond and avoid the hurdles that are fire. The robot
learns by trying all the possible paths and then choosing the path which gives him
the reward with the least hurdles. Each right step will give the robot a reward and
each wrong step will subtract the reward of the robot. The total reward will be
calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –
Input: The input should be an initial state from which the model will
start
Output: There are many possible output as there are variety of solution
to a particular problem
Training: The training is based upon the input, The model will return a
state and the user will decide to reward or punish the model based on
its output.
The model keeps continues to learn.
The best solution is decided based on the maximum reward.
Supervised Learning vs Reinforcement Learning
Types of Reinforcement: There are two types of
Reinforcement:
Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive
effect on behavior. Advantages of positive reinforcement learning are:
Maximizes Performance
Sustain Change for a long period of time
Disadvantages of reinforcement learning:
Too much Reinforcement can lead to overload of states which can diminish the results
Negative –
Negative Reinforcement is defined as strengthening of a behavior because a negative
condition is stopped or avoided. Advantages of negative reinforcement learning:
Increases Behavior
Provide defiance to minimum standard of performance
Disadvantages of reinforcement learning:
It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement
Learning –
Humidity Ye Win
s d
No Ye No Ye
s s
Decision Tree
• Decision trees classify instances by sorting them down the tree from the
root to some leaf node, which provides the classification of the
instance.
• Each node in the tree specifies a test of some attribute of the instance.
• Each branch descending from a node corresponds to one of the
possible values for the attribute.
• Each leaf node assigns a classification.
• The instance
(Outlook=Sunny, Temperature=Hot, Humidity=High,
Wind=Strong) is classified as a negative instance.
When to Consider Decision Trees
• Instances are represented by attribute-value pairs.
– Fixed set of attributes, and the attributes take a small number of disjoint possible values.
• The target function has discrete output values.
– Decision tree learning is appropriate for a boolean classification, but it easily extends to learning
functions with more than two possible output values.
• Disjunctive descriptions may be required.
– decision trees naturally represent disjunctive expressions.
• The training data may contain errors.
– Decision tree learning methods are robust to errors, both errors in classifications of the training
examples and errors in the attribute values that describe these examples.
• The training data may contain missing attribute values.
– Decision tree methods can be used even when some training examples have unknown values.
• Decision tree learning has been applied to problems such as learning to classify
– medical patients by their disease,
– equipment malfunctions by their cause, and
– loan applicants by their likelihood of defaulting on payments.
Advantages
• As ID3 adds new nodes to grow the decision tree, the accuracy of the tree measured
over the training examples increases monotonically.
• However, when measured over a set of test examples independent of the
training examples, accuracy first increases, then decreases.
Avoid Overfitting
How can we avoid overfitting?
– Stop growing when data split not statistically significant
• stop growing the tree earlier, before it reaches the point where it perfectly
classifies the training data
– Grow full tree then post-prune
• allow the tree to overfit the data, and then post-prune the tree.
• The correct tree size is found by stopping early or by post-pruning, a
key question is what criterion is to be used to determine the correct
final tree size.
– Use a separate set of examples, distinct from the training examples, to evaluate the
utility of post-pruning nodes from the tree.
– Use all the available data for training, but apply a statistical test to estimate
whether expanding a particular node is likely to produce an improvement beyond
the training set. ( chi-square test )
Avoid Overfitting - Reduced-Error Pruning
• Split data into training and validation set
• Do until further pruning is harmful:
– Evaluate impact on validation set of pruning each possible
node (plus those below it)
– Greedily remove the one that most improves the validation
set accuracy
• Pruning of nodes continues until further pruning is harmful
(i.e., decreases accuracy of the tree over the validation set).
• Using a separate set of data to guide pruning is an effective
approach provided a large amount of data is available.
– The major drawback of this approach is that when data is limited, withholding part
of it for the validation set reduces even further the number of examples available
for training.
Main Points with Decision Tree Learning
• Decision tree learning provides a practical method for concept
learning and for learning other discrete-valued functions.
– decision trees are inferred by growing them from the root downward, greedily
selecting the next best attribute.
• ID3 searches a complete hypothesis space.
• The inductive bias in ID3 includes a preference for smaller trees.
• Overfitting training data is an important issue in decision tree learning.
– Pruning decision trees or rules are important.
• A large variety of extensions to the basic ID3 algorithm has
been developed. These extensions include methods for
– post-pruning trees, handling real-valued attributes, accommodating training
examples with missing attribute values, using attribute selection measures other
than information gain, and considering costs associated with instance attributes.
Bayesian Networks
When actual probabilities are entered into this net that reflect the
reality of real weather, lawn, and sprinkler-use-behavior, such a net
can be made to answer a number of useful questions, like, "if the lawn
is wet, what are the chances it was caused by rain or by the sprinkler",
and "if the chance of rain increases, how does that affect my having to
budget time for watering the lawn".
Here is another simple Bayes net called Asia. It is an example which
is popular for introducing Bayes nets and is
from Lauritzen&Spiegelhalter88. Note, it is for example purposes
only, and should not be used for real decision making.
Bayesian Networks
The direction of the link arrows roughly corresponds to "causality". That is the
nodes higher up in the diagram tend to influence those below rather than, or, at least,
more so than the other way around.
In a Bayes net, the links may form loops, but they may not form cycles. This is not
an expressive limitation; it does not limit the modeling power of these nets. It only
means we must be more careful in building our nets. In the left diagram below, there
are numerous loops. These are fine. In the right diagram, the addition of the link
from D to B creates a cycle, which is not permitted.
•Face observation.
•Text and Hypertext arrangement.
•Bioinformatics
•Handwriting Recognition.
•Generalized Predictive Control.
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of
natural selection and genetics. These are intelligent exploitation of random search
provided with historical data to direct the search into the region of better
performance in solution space. They are commonly used to generate high-quality
solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those
species who can adapt to changes in their environment are able to survive and
reproduce and go to next generation. In simple words, they simulate “survival of the
fittest” among individual of consecutive generation for solving a problem. Each
generation consist of a population of individuals and each individual represents a
point in search space and possible solution. Each individual is represented as a string
of character/integer/float/bits. This string is analogous to the Chromosome.
Foundation of Genetic Algorithms
• When and how can prior knowledge held by the learner guide
the process of generalizing from examples ?
Issues in Machine Learning (cont.)
The image represents the five stages of the data science life cycle: Capture, (data
acquisition, data entry, signal reception, data extraction); Maintain (data
warehousing, data cleansing, data staging, data processing, data
architecture); Process (data mining, clustering/classification, data modeling, data
summarization); Analyze (exploratory/confirmatory, predictive analysis, regression,
text mining, qualitative analysis); Communicate (data reporting, data visualization,
business intelligence, decision making).
Data Science vs Machine Learning
THANK YOU