Deep Learning Unit-II
Deep Learning Unit-II
Deep Learning Unit-II
UNIT-II
Topics: Machine Learning and Deep Learning, Representation Learning, Width and
Depth of Neural Networks. Activation functions: RELU, LRELU, ERELU,
Unsupervised Training of Neural Networks, Restricted Boltzmann Machines.
Machine Leaning allows the computers to learn from the experiences by its own, use
statistical methods to improve the performance and predict the output without being
explicitly programmed.
The popular applications of ML are Email spam filtering, product recommendations, online
fraud detection.
In deep learning, models use different layers to learn and discover insights from the data.
Some popular applications of deep learning are self-driving cars, language translation,
natural language processing, etc.
As we have seen the brief introduction of ML and DL with some comparisons, now why
and which one needs to be chosen to solve a particular problem. So, it can be understood
by the given flowchart:
Hence, if you have lots of data and high hardware capabilities, go with deep learning. But
if you don't have any of them, choose the ML model to solve your problem.
Conclusion: In conclusion, we can say that deep learning is machine learning with more
capabilities and a different working approach. And selecting any of them to solve a
particular problem is depend on the amount of data and complexity of the problem.
Representation learning
Although traditional unsupervised learning techniques will always be staples of machine
learning pipelines, representation learning has emerged as an alternative approach to
feature extraction with the continued success of deep learning. In representation
learning, features are extracted from unlabeled data by training a neural network on a
secondary, supervised learning task.
Due to its popularity, word2vec has become the de facto "Hello, world!" application of
representation learning. When applying deep learning to natural language processing
(NLP) tasks, the model must simultaneously learn several language concepts:
Deep neural networks can be considered representation learning models that typically
encode information which is projected into a different subspace. These representations
are then usually passed on to a linear classifier to, for instance, train a classifier.
More recently, self-supervised learning (SSL) is one of the main drivers behind
unsupervised representation learning in fields like computer vision and NLP.
Activation Functions
when we have so much information, the challenge is to segregate between relevant and
irrelevant information.
When our brain is fed with a lot of information simultaneously, it tries hard to understand
and classify the information into “useful” and “not-so-useful” information. We need a similar
Networks.
This is important in the way a network learns because not all the information is equally
useful. Some of it is just noise. This is where activation functions come into picture. The
activation functions help the network use the important information and suppress the
Let us go through these activation functions, learn how they work and figure out which
Before I delve into the details of activation functions, let us quickly go through the concept
of neural networks and how they work. A neural network is a very powerful machine
The brain receives the stimulus from the outside world, does the processing on the input,
and then generates the output. As the task gets complicated, multiple neurons form a
The input is fed to the input layer, the neurons perform a linear transformation on this input
Finally, the output from the activation function moves to the next hidden layer and the
same process is repeated. This forward movement of information is known as the forward
propagation.
What if the output generated is far away from the actual value? Using the output from the
forward propagation, error is calculated. Based on this error value, the weights and biases
We understand that using an activation function introduces an additional step at each layer
during the forward propagation. Now the question is – if the activation function increases
only be performing a linear transformation on the inputs using the weights and biases.
Although linear transformations make the neural network simpler, but this network would
be less powerful and will not be able to learn the complex patterns from the data.
A neural network without an activation function is essentially just a linear regression model.
Thus we use a non linear transformation to the inputs of the neuron and this non-linearity
In the next section we will look at the different types of Activation Functions, their
The ReLU function is another non-linear activation function that has gained popularity in
the deep learning domain. ReLU stands for Rectified Linear Unit. The main advantage of
using the ReLU function over other activation functions is that it does not activate all the
This means that the neurons will only be deactivated if the output of the linear
transformation is less than 0. The plot below will help you understand this better-
f(x)=max(0,x)
For the negative input values, the result is zero, that means the neuron does not get
activated. Since only a certain number of neurons are activated, the ReLU function is far
more computationally efficient when compared to the sigmoid and tanh function. Here is
def relu_function(x):
if x<0:
return 0
else:
return x
relu_function(7), relu_function(-7)
Output:
(7, 0)
f'(x) = 1, x>=0
= 0, x<0
If you look at the negative side of the graph, you will notice that the gradient value is zero.
Due to this reason, during the backpropogation process, the weights and biases for some
neurons are not updated. This can create dead neurons which never get activated. This is
2. Leaky ReLU
Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw
that for the ReLU function, the gradient is 0 for x<0, which would deactivate the neurons
in that region.
Leaky ReLU is defined to address this problem. Instead of defining the Relu function as 0
By making this small modification, the gradient of the left side of the graph comes out to
be a non zero value. Hence we would no longer encounter dead neurons in that region.
f'(x) = 1, x>=0
=0.01, x<0
Since Leaky ReLU is a variant of ReLU, the python code can be implemented with a small
modification-
def leaky_relu_function(x):
if x<0:
return 0.01*x
else:
return x
leaky_relu_function(7), leaky_relu_function(-7)
Output:
(7, -0.07)
Apart from Leaky ReLU, there are a few other variants of ReLU, the two most popular are
Exponential Linear Unit or ELU for short is also a variant of Rectiufied Linear Unit (ReLU)
that modifies the slope of the negative part of the function. Unlike the leaky relu and
parametric ReLU functions, instead of a straight line, ELU uses a log curve for defning the
f(x) = x, x>=0
= a(e^x-1), x<0
Output:
(5, -0.09932620530009145)
The derivative of the elu function for values of x greater than 0 is 1, like all the relu variants.
f'(x) = x, x>=0
= a(e^x), x<0
However, RBM also shares a similar idea, but instead of using deterministic distribution, it
uses the stochastic units with a particular distribution. It trains the model to understand
the association between the two sets of variables.
Advertisement
RBM has two biases, which is one of the most important aspects that distinguish them
from other autoencoders. The hidden bias helps the RBM provide the activations on the
forward pass, while the visible layer biases help the RBM learns the reconstruction on the
backward pass.
But no two nodes of the same layer are linked, affirms that there is no intralayer
communication, which is the only restriction in the restricted Boltzmann machine. At each
node, the calculation takes place by simply processing the inputs and makes the stochastic
decisions about whether it should start transmitting the input or not.
Let's follow that single pixel value X through the two-layer net. At the very first node of the
hidden layer, X gets multiplied by a weight, which is then added to the bias. After then
the result is provided to the activation function so that it can produce the output of that
node, or the signal's strength, which passes through it when the input x is already given.
After now, we will look at how different inputs get combines at one particular hidden node.
Basically, each X gets multiplied by a distinct weight, followed by summing up their
products and then add them to the bias. Again, the result is provided to the activation
function to produce the output of that node.
Each of the input X gets multiplied by an individual weight w at each hidden node. In other
words, we can say that a single input would encounter three weights, which will further
result in a total of 12 weights, i.e. (4 input nodes x 3 hidden nodes). The weights between
the two layers will always form a matrix where the rows are equal to the input nodes, and
the columns are equal to the output nodes.
Here each of the hidden nodes is going to receive four inputs, which will get multiplied by
the separate weights followed by again adding these products to the bias. Then it passes
the result through the activation algorithm to produce one output for each hidden node.
o Gibbs Sampling
Gibbs sampling is the first part of the training. Whenever we are given an input vector v,
we use the following p(h| v) for predicting the hidden values h. However, if we are given
the hidden values h, we use p(v| h) to predict the new input values v.
This process is repeated numerously (k times), such that after each iteration (k), we obtain
another input vector v_k, which is recreated from the original input value v_0.
During the contrastive divergence step, it updates the weight matrix gets. To analyze the
activation probabilities for hidden values h_0 and h_k, it uses the vector v_0 and v_k.
The update matrix is calculated as a difference between the outer products of the
probabilities with input vectors v_0 and v_k, which is represented by the following matrix.
Now with the help of this update weight matrix, we can analyze new weight with the
gradient descent that is given by the following equation.
Training to Prediction
Step1: Train the network on the data of all the users.
Step2: Take the training data of a specific user during inference time.
Step3: Use the data to obtain the activations of the hidden neuron.
Step4: Use the hidden neuron values to get the activations of the input neurons.
Step5: The new values of input neurons show the rating the user would give.