Deep Learning Unit-II

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

DEEP LEARNING

UNIT-II
Topics: Machine Learning and Deep Learning, Representation Learning, Width and
Depth of Neural Networks. Activation functions: RELU, LRELU, ERELU,
Unsupervised Training of Neural Networks, Restricted Boltzmann Machines.

Machine Learning and Deep Learning


Machine Learning and Deep Learning are the two main concepts of Data Science and the subsets
of Artificial Intelligence. Most of the people think the machine learning, deep learning, and as
well as artificial intelligence as the same buzzwords. But in actuality, all these terms are different
but related to each other.

What is Machine Learning?


Machine learning is a part of artificial intelligence and growing technology that enables
machines to learn from past data and perform a given task automatically.

Machine Leaning allows the computers to learn from the experiences by its own, use
statistical methods to improve the performance and predict the output without being
explicitly programmed.

The popular applications of ML are Email spam filtering, product recommendations, online
fraud detection.

Some useful ML algorithms are:

o Decision Tree algorithm


o Naïve Bayes
o Random Forest
o K-means clustering
o KNN algorithm
o Apriori Algorithm, etc.
How does Machine Learning work?
The working of machine learning models can be understood by the example of identifying
the image of a cat or dog. To identify this, the ML model takes images of both cat and
dog as input, extracts the different features of images such as shape, height, nose, eyes,
etc., applies the classification algorithm, and predict the output. Consider the below
image:

What is Deep Learning?


Deep Learning is the subset of machine learning or can be said as a special kind of machine
learning. It works technically in the same way as machine learning does, but with different
capabilities and approaches. It is inspired by the functionality of human brain cells, which
are called neurons, and leads to the concept of artificial neural networks. It is also called
a deep neural network or deep neural learning.

In deep learning, models use different layers to learn and discover insights from the data.

Some popular applications of deep learning are self-driving cars, language translation,
natural language processing, etc.

Some popular deep learning models are:

o Convolutional Neural Network


o Recurrent Neural Network
o Autoencoders
o Classic Neural Networks, etc.

How Deep Learning Works?


We can understand the working of deep learning with the same example of identifying
cat vs. dog. The deep learning model takes the images as the input and feed it directly to
the algorithms without requiring any manual feature extraction step. The images pass to
the different layers of the artificial neural network and predict the final output.

Consider the below image:


Key comparisons between Machine Learning and Deep
Learning
Let's understand the key differences between these two terms based on different
parameters:

Parameter Machine Learning Deep Learning

Data Although machine Deep Learning algorithms


Dependency learning depends on the highly depend on a large
huge amount of data, it amount of data, so we need to
can work with a smaller feed a large amount of data for
amount of data. good performance.

Execution time Machine learning Deep Learning takes a long


algorithm takes less time execution time to train the
to train the model than model, but less time to test the
deep learning, but it model.
takes a long-time
duration to test the
model.

Hardware Since machine learning The deep learning model needs


Dependencies models do not need a huge amount of data to work
much amount of data, so efficiently, so they need GPU's
they can work on low- and hence the high-end
end machines. machine.

Feature Machine learning models Deep learning is the enhanced


Engineering need a step of feature version of machine learning, so
extraction by the expert, it does not need to develop the
and then it proceeds feature extractor for each
further. problem; instead, it tries to
learn high-level features from
the data on its own.

Problem- To solve a given problem, The problem-solving approach


solving the traditional ML model of a deep learning model is
approach breaks the problem in different from the traditional
sub-parts, and after ML model, as it takes input for
solving each part, a given problem, and produce
produces the final result.
the end result. Hence it follows
the end-to-end approach.

Interpretation The interpretation of the The interpretation of the result


of result result for a given problem for a given problem is very
is easy. As when we work difficult. As when we work with
with machine learning, the deep learning model, we
we can interpret the may get a better result for a
result easily, it means why given problem than the
this result occur, what machine learning model, but
was the process. we cannot find why this
particular outcome occurred,
and the reasoning.

Type of data Machine learning models Deep Learning models can


mostly require data in a work with structured and
structured form. unstructured data both as they
rely on the layers of the
Artificial neural network.

Suitable for Machine learning models Deep learning models are


are suitable for solving suitable for solving complex
simple or bit-complex problems.
problems.

Which one to select among ML and Deep Learning?

As we have seen the brief introduction of ML and DL with some comparisons, now why
and which one needs to be chosen to solve a particular problem. So, it can be understood
by the given flowchart:

Hence, if you have lots of data and high hardware capabilities, go with deep learning. But
if you don't have any of them, choose the ML model to solve your problem.

Conclusion: In conclusion, we can say that deep learning is machine learning with more
capabilities and a different working approach. And selecting any of them to solve a
particular problem is depend on the amount of data and complexity of the problem.
Representation learning
Although traditional unsupervised learning techniques will always be staples of machine
learning pipelines, representation learning has emerged as an alternative approach to
feature extraction with the continued success of deep learning. In representation
learning, features are extracted from unlabeled data by training a neural network on a
secondary, supervised learning task.

Due to its popularity, word2vec has become the de facto "Hello, world!" application of
representation learning. When applying deep learning to natural language processing
(NLP) tasks, the model must simultaneously learn several language concepts:

1. the meanings of words


2. how words are combined to form concepts (i.e., syntax)
3. how concepts relate to the task at hand.

Representation Learning is concerned with training machine learning algorithms to learn


useful representations, e.g. those that are interpretable, have latent features, or can be
used for transfer learning.

Deep neural networks can be considered representation learning models that typically
encode information which is projected into a different subspace. These representations
are then usually passed on to a linear classifier to, for instance, train a classifier.

Representation learning can be divided into:

• Supervised representation learning: learning representations on task A using


annotated data and used to solve task B
• Unsupervised representation learning: learning representations on a task in an
unsupervised way (label-free data). These are then used to address downstream
tasks and reducing the need for annotated data when learning news tasks.
Powerful models like GPT and BERT leverage unsupervised representation
learning to tackle language tasks.

More recently, self-supervised learning (SSL) is one of the main drivers behind
unsupervised representation learning in fields like computer vision and NLP.
Activation Functions
when we have so much information, the challenge is to segregate between relevant and

irrelevant information.

When our brain is fed with a lot of information simultaneously, it tries hard to understand

and classify the information into “useful” and “not-so-useful” information. We need a similar

mechanism for classifying incoming information as “useful” or “less-useful” in case of Neural

Networks.

This is important in the way a network learns because not all the information is equally

useful. Some of it is just noise. This is where activation functions come into picture. The

activation functions help the network use the important information and suppress the

irrelevant data points.

Let us go through these activation functions, learn how they work and figure out which

activation functions fits well into what kind of problem statement.

Brief overview of neural networks

Before I delve into the details of activation functions, let us quickly go through the concept

of neural networks and how they work. A neural network is a very powerful machine

learning mechanism which basically mimics how a human brain learns.

The brain receives the stimulus from the outside world, does the processing on the input,

and then generates the output. As the task gets complicated, multiple neurons form a

complex network, passing information among themselves.


An Artificial Neural Network tries to mimic a similar behavior. The network you see below

is a neural network made of interconnected neurons. Each neuron is characterized by its

weight, bias and activation function.

The input is fed to the input layer, the neurons perform a linear transformation on this input

using the weights and biases.

x = (weight * input) + bias

Post that, an activation function is applied on the above result.

Finally, the output from the activation function moves to the next hidden layer and the

same process is repeated. This forward movement of information is known as the forward

propagation.

What if the output generated is far away from the actual value? Using the output from the

forward propagation, error is calculated. Based on this error value, the weights and biases

of the neurons are updated. This process is known as back-propagation.

Can we do without an activation function?

We understand that using an activation function introduces an additional step at each layer

during the forward propagation. Now the question is – if the activation function increases

the complexity so much, can we do without an activation function?


Imagine a neural network without the activation functions. In that case, every neuron will

only be performing a linear transformation on the inputs using the weights and biases.

Although linear transformations make the neural network simpler, but this network would

be less powerful and will not be able to learn the complex patterns from the data.

A neural network without an activation function is essentially just a linear regression model.

Thus we use a non linear transformation to the inputs of the neuron and this non-linearity

in the network is introduced by an activation function.

In the next section we will look at the different types of Activation Functions, their

mathematical equations, graphical representation and python codes.

Popular types of activation functions and when to use them


1.ReLU

The ReLU function is another non-linear activation function that has gained popularity in

the deep learning domain. ReLU stands for Rectified Linear Unit. The main advantage of

using the ReLU function over other activation functions is that it does not activate all the

neurons at the same time.

This means that the neurons will only be deactivated if the output of the linear

transformation is less than 0. The plot below will help you understand this better-

f(x)=max(0,x)
For the negative input values, the result is zero, that means the neuron does not get

activated. Since only a certain number of neurons are activated, the ReLU function is far

more computationally efficient when compared to the sigmoid and tanh function. Here is

the python function for ReLU:

def relu_function(x):
if x<0:
return 0
else:
return x
relu_function(7), relu_function(-7)

Output:

(7, 0)

Let’s look at the gradient of the ReLU function.

f'(x) = 1, x>=0
= 0, x<0

If you look at the negative side of the graph, you will notice that the gradient value is zero.

Due to this reason, during the backpropogation process, the weights and biases for some

neurons are not updated. This can create dead neurons which never get activated. This is

taken care of by the ‘Leaky’ ReLU function.

2. Leaky ReLU
Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw

that for the ReLU function, the gradient is 0 for x<0, which would deactivate the neurons

in that region.

Leaky ReLU is defined to address this problem. Instead of defining the Relu function as 0

for negative values of x, we define it as an extremely small linear component of x. Here is

the mathematical expression-

f(x)= 0.01x, x<0


= x, x>=0

By making this small modification, the gradient of the left side of the graph comes out to

be a non zero value. Hence we would no longer encounter dead neurons in that region.

Here is the derivative of the Leaky ReLU function

f'(x) = 1, x>=0
=0.01, x<0
Since Leaky ReLU is a variant of ReLU, the python code can be implemented with a small

modification-

def leaky_relu_function(x):
if x<0:
return 0.01*x
else:
return x
leaky_relu_function(7), leaky_relu_function(-7)

Output:

(7, -0.07)

Apart from Leaky ReLU, there are a few other variants of ReLU, the two most popular are

– Parameterised ReLU function and Exponential ReLU.

3.Exponential Linear Unit

Exponential Linear Unit or ELU for short is also a variant of Rectiufied Linear Unit (ReLU)

that modifies the slope of the negative part of the function. Unlike the leaky relu and

parametric ReLU functions, instead of a straight line, ELU uses a log curve for defning the

negatice values. It is defined as

f(x) = x, x>=0
= a(e^x-1), x<0

Let’s define this function in python

def elu_function(x, a):


if x<0:
return a*(np.exp(x)-1)
else:
return x
elu_function(5, 0.1),elu_function(-5, 0.1)

Output:

(5, -0.09932620530009145)
The derivative of the elu function for values of x greater than 0 is 1, like all the relu variants.

But for values of x<0, the derivative would be a.e^x .

f'(x) = x, x>=0
= a(e^x), x<0

Unsupervised Training of Neural Networks


Unsupervised Learning
Unsupervised learning means you’re only exposing a machine to input data. There is no
corresponding output data to teach the system the answers it should be arriving at.
With unsupervised learning, you train the machine with unlabeled data that offers it no
hints about what it’s seeing. Because it doesn’t know which pictures show cats and
which show dogs, it can’t learn how to tell them apart. Instead, it can learn the
similarities between all the pictures you expose it to.
That doesn’t help with classifying images (this neural network will never tell you when
a picture contains a dog or a cat). But it is helpful for lots of other tasks. It can let you
know when a new picture is so different from what it’s previously been exposed to that
it’s confident the picture contains neither dogs nor cats.
It can take large images of cats or dogs and distill them down to lists of
characteristics (like ‘pointy ears’ or ‘soft’) that take up less space for storage, and then
expand them out to pictures again.
It can even dream up new images of cats or dogs.
Unsupervised learning can be compared to the way children learn about the world
without the insights of adult supervision. No one teaches children to be surprised and
curious about a species of animal they’ve never seen before.
No one needs to teach children to associate a quality like softness with an animal’s fur,
only how to articulate the association they’ve already made themselves from patterns
of experience.
You may not be able to identify that a child’s finger-painting represents a dog, but
they’re still able to draw a picture that, to them, expresses what they’ve learned about
how dogs appear.
ThreatWarrior™ and Unsupervised Neural Networks
Supervised learning is great when you have a large, curated library of labelled
examples. When you can provide thousands and thousands of examples of what a
machine should learn, you can supervise machine learning. However, that’s not always
feasible. It can take a long time and a lot of manual labour to build that kind of library.
And sometimes problems just aren’t suited to it. That’s when you turn to unsupervised
learning. Unsupervised neural networks are particularly useful in areas like digital art,
fraud detection and cybersecurity.
ThreatWarrior is the first solution to use unsupervised neural networks for cyber
defense. We applied unsupervised neural networks because we’re seeking threats for
which we have no prior experiences.
While we also have supervised neural networks that we utilize for prior lessons learned
and experiences we can pass down (our customers provide the supervision through
human oversight in their environments), many threats don’t have signatures that we
can simply recognize. For this, we need the machine to self-learn patterns of
behaviour, so that it can develop its own instincts.
By learning what’s ‘normal’ for a network, ThreatWarrior also learns what’s abnormal. If
there is activity or behaviours that fall outside the learned pattern, ThreatWarrior will
alert to these anomalies.
Another big advantage of neural networks is that they excel at feature extraction:
building complex hierarchies of meaning to express information from raw data.
Apply this to cybersecurity, and you can derive information from raw traffic like, “who
talked to whom about what” to conceptualize higher-order patterns in the
environment.
Using unsupervised neural networks to perform deep learning allows you to observe
significantly more detail, so what you see is a better, more accurate picture of your
security environment.
Antiquated solutions can require manual work for programmers to codify examples of
what’s normal into their platforms, taking up valuable time and resources. Threat
Warrior does this without any supervision and with no feature engineering, meaning
our solution is trained uniquely on your network data.

Restricted Boltzmann Machine


Nowadays, Restricted Boltzmann Machine is an undirected graphical model that plays a
major role in the deep learning framework. Initially, it was introduced by Paul Smolensky
in 1986 as a Harmonium, which then gained huge popularity in recent years in the context
of the Netflix Price, where RBM achieved state-of-the-art performance in collaborative
filtering and have beaten most of the competition.
Many hidden layers can be efficiently learned by composing restricted Boltzmann
machines using the future activations of one as the training data for the next. These are
basically the neural network that belongs to so-called energy-based models. It is an
algorithm that is used for dimensionality reduction, classification, regression collaborative
filtering, feature learning, and topic modeling.

Autoencoders vs. Restricted Boltzmann Machine


Autoencoders are none other than a neural network that encompasses 3-layers, such that
the output layer is connected back to the input layer. It has much less hidden units in
comparison to the visible units. It performs the training task in order to minimize
reconstruction or error. In simple words, we can say that training helps in discovering an
efficient way for the representation of the input data.

However, RBM also shares a similar idea, but instead of using deterministic distribution, it
uses the stochastic units with a particular distribution. It trains the model to understand
the association between the two sets of variables.

Advertisement

RBM has two biases, which is one of the most important aspects that distinguish them
from other autoencoders. The hidden bias helps the RBM provide the activations on the
forward pass, while the visible layer biases help the RBM learns the reconstruction on the
backward pass.

Layers in Restricted Boltzmann Machine


The Restricted Boltzmann Machines are shallow; they basically have two-layer neural nets
that constitute the building blocks of deep belief networks. The input layer is the first layer
in RBM, which is also known as visible, and then we have the second layer, i.e., the hidden
layer. Each node represents a neuron-like unit, which is further interconnected to each
other crossways the different layers.

But no two nodes of the same layer are linked, affirms that there is no intralayer
communication, which is the only restriction in the restricted Boltzmann machine. At each
node, the calculation takes place by simply processing the inputs and makes the stochastic
decisions about whether it should start transmitting the input or not.

Working of Restricted Boltzmann Machine


A low-level feature is taken by each of the visible node from an item residing in the
database so that it can be learned; for example, from a dataset of grayscale images, each
visible node would receive one-pixel value for each pixel in one image.

Let's follow that single pixel value X through the two-layer net. At the very first node of the
hidden layer, X gets multiplied by a weight, which is then added to the bias. After then
the result is provided to the activation function so that it can produce the output of that
node, or the signal's strength, which passes through it when the input x is already given.
After now, we will look at how different inputs get combines at one particular hidden node.
Basically, each X gets multiplied by a distinct weight, followed by summing up their
products and then add them to the bias. Again, the result is provided to the activation
function to produce the output of that node.

Each of the input X gets multiplied by an individual weight w at each hidden node. In other
words, we can say that a single input would encounter three weights, which will further
result in a total of 12 weights, i.e. (4 input nodes x 3 hidden nodes). The weights between
the two layers will always form a matrix where the rows are equal to the input nodes, and
the columns are equal to the output nodes.
Here each of the hidden nodes is going to receive four inputs, which will get multiplied by
the separate weights followed by again adding these products to the bias. Then it passes
the result through the activation algorithm to produce one output for each hidden node.

Training of Restricted Boltzmann Machine


The training of a Restricted Boltzmann Machine is completely different from that of the
Neural Networks via stochastic gradient descent.

Following are the two main training steps:

o Gibbs Sampling

Gibbs sampling is the first part of the training. Whenever we are given an input vector v,
we use the following p(h| v) for predicting the hidden values h. However, if we are given
the hidden values h, we use p(v| h) to predict the new input values v.

This process is repeated numerously (k times), such that after each iteration (k), we obtain
another input vector v_k, which is recreated from the original input value v_0.

o Contrastive Divergence Step

During the contrastive divergence step, it updates the weight matrix gets. To analyze the
activation probabilities for hidden values h_0 and h_k, it uses the vector v_0 and v_k.
The update matrix is calculated as a difference between the outer products of the
probabilities with input vectors v_0 and v_k, which is represented by the following matrix.

Now with the help of this update weight matrix, we can analyze new weight with the
gradient descent that is given by the following equation.

Training to Prediction
Step1: Train the network on the data of all the users.

Step2: Take the training data of a specific user during inference time.

Step3: Use the data to obtain the activations of the hidden neuron.

Step4: Use the hidden neuron values to get the activations of the input neurons.

Step5: The new values of input neurons show the rating the user would give.

You might also like