A Modified Adam Algorithm For Deep Neural Network Optimization
A Modified Adam Algorithm For Deep Neural Network Optimization
A Modified Adam Algorithm For Deep Neural Network Optimization
https://doi.org/10.1007/s00521-023-08568-z (0123456789().,-volV)(0123456789().,-volV)
ORIGINAL ARTICLE
Received: 8 January 2022 / Accepted: 5 April 2023 / Published online: 25 April 2023
Ó The Author(s) 2023
Abstract
Deep Neural Networks (DNNs) are widely regarded as the most effective learning tool for dealing with large datasets, and
they have been successfully used in thousands of applications in a variety of fields. Based on these large datasets, they are
trained to learn the relationships between various variables. The adaptive moment estimation (Adam) algorithm, a highly
efficient adaptive optimization algorithm, is widely used as a learning algorithm in various fields for training DNN models.
However, it needs to improve its generalization performance, especially when training with large-scale datasets. Therefore,
in this paper, we propose HN Adam, a modified version of the Adam Algorithm, to improve its accuracy and convergence
speed. The HN_Adam algorithm is modified by automatically adjusting the step size of the parameter updates over the
training epochs. This automatic adjustment is based on the norm value of the parameter update formula according to the
gradient values obtained during the training epochs. Furthermore, a hybrid mechanism was created by combining the
standard Adam algorithm and the AMSGrad algorithm. As a result of these changes, the HN_Adam algorithm, like the
stochastic gradient descent (SGD) algorithm, has good generalization performance and achieves fast convergence like
other adaptive algorithms. To test the proposed HN_Adam algorithm performance, it is evaluated to train a deep con-
volutional neural network (CNN) model that classifies images using two different standard datasets: MNIST and CIFAR-
10. The algorithm results are compared to the basic Adam algorithm and the SGD algorithm, in addition to other five recent
SGD adaptive algorithms. In most comparisons, the HN Adam algorithm outperforms the compared algorithms in terms of
accuracy and convergence speed. AdaBelief is the most competitive of the compared algorithms. In terms of testing
accuracy and convergence speed (represented by the consumed training time), the HN-Adam algorithm outperforms the
AdaBelief algorithm by an improvement of 1.0% and 0.29% for the MNIST dataset, and 0.93% and 1.68% for the CIFAR-
10 dataset, respectively.
Keywords Optimizer Adaptive Moment Estimation Adam AMSGrad RMSprop Nesterov Accelerated Adam
Deep Neural Networks
1 Introduction
123
17096 Neural Computing and Applications (2023) 35:17095–17112
volume of data necessitates the development of deep neural additional parameters. The modified algorithm’s main
networks to keep up with the growing volume of data. contribution is an increase in both convergence speed and
Furthermore, significant progress in DNN training methods accuracy. There is a mathematical proof that shows the
has aided in the development of deep learning models that differences between the modified and basic algorithms.
are now used in a wide range of applications. In the vast Extensive experiments are carried out to demonstrate the
majority of applications, these models have delivered ele- proposed algorithm’s superiority to state-of-the-art opti-
gant results. All of these factors propelled DNNs to the mization algorithms, which are trained on two different
forefront of the machine learning field, inspiring many datasets. The results show that the proposed algorithm
researchers to work on improving their training methods outperforms the other algorithms in terms of convergence
[1, 4]. speed and accuracy.
Deep neural networks can be optimized in a variety of The sections of the paper are organized as follows. The
ways, including optimizing the network model’s structural second section summarizes recent reviews of adaptive
design and determining the optimal parameters such as optimization techniques for deep neural network opti-
weights and biases of a predefined network structure, pre- mization. Section 3 describes the deep neural network
processing of the datasets, and choosing the best opti- architectures. The convolution neural network (CNN) is
mization technique during the learning process. There are reviewed in Sect. 4. The concept and the mathematical
currently no established criteria for developing an ideal proof of the modified Adam algorithm are discussed in
deep neural architecture [5]. Any optimizer’s goal is to detail in Sect. 5. The experiments and results discussion are
minimize an objective function, also known as a loss given in Sect. 6. Section 7 concludes the paper and gives
function, which is the difference between the expected and the future work.
computed values. The minimization procedure determines
the best set of parameters for designing DNNs for classi-
fication, prediction, and clustering tasks. 2 Literature review of deep neural networks
Many researchers [6, 7] present several optimization optimization
algorithms for deep neural networks in the literature. The
gradient descent approach, which is a first-order differen- DNNs have been a hot topic in the machine learning
tial method used to obtain an array of weights that satisfy community in recent years. The optimization methods used
the error criteria, is used by the majority of these algo- to train DNNs can be divided into two types: first-order
rithms. The most widely used optimization algorithms for optimization methods and second-order optimization
deep neural networks in the literature are gradient descent methods [12]. The first-order derivative values of the
techniques such as back-propagation and adaptive moment objective function are used to direct the search process
estimation (Adam) algorithms [8–10]. towards the steepest decreasing direction in first-order
Training procedures remain relatively simple in com- methods. It should be noted that the gradient denotes the
parison to the increasing complexity of deep neural net- first-order derivative of a multivariate objective function
work topologies. The majority of practical optimization [12]. The gradient descent (GD) optimization algorithm is a
approaches for DNNs employ the stochastic gradient des- popular first-order optimization algorithm that uses the
cent (SGD) technique. However, as a hyper-parameter, the objective function’s negative gradient to find its minimum.
SGD learning rate is frequently difficult to tune and must Since tuning the SGD algorithm’s learning rate as a
be adjusted throughout the training process. To address this hyper-parameter is difficult, it is adjusted throughout the
issue, several adaptive SGD variants have been developed, training process [14]. The adaptive variants of SGD algo-
including adaptive gradient (AdaGrad), adaptive delta rithms attempt to automatically adapt the learning rate for
(Adadelta), root mean square propagation (RMSProp), and parameter updates based on gradient statistics. Although
Adam. Based on the gradient statistics, these SGD variants these adaptive variants simplify learning rate settings and
aim to automatically adapt the learning rate of parameter increase convergence speed, in some applications, their
updates. This explains why the SGD method is still used in overall performance is significantly worse than the basic
training the most recent DNN models, particularly the feed- SGD algorithm. As a result, the SGD (possibly with
forward type [11]. momentum) algorithm is still used in training cutting-edge
The main goal of this paper is to create an optimization deep neural models such as feed-forward DNNs [15, 16].
method that has good generalization performance like the Furthermore, recent studies have shown that the ability of
SGD method while also achieving fast convergence like DNN models to fit noisy data is dependent on the opti-
the adaptive methods. To address the shortcomings of mization methods used [17, 18].
current optimization algorithms, this paper proposes a The RMSProp algorithm [19] and the AdaGrad [20] are
modified Adam algorithm that does not require any two optimization methods that are strongly attributable to
123
Neural Computing and Applications (2023) 35:17095–17112 17097
Adam. These connections will be demonstrated later. Other order derivative values (also known as the Hessian matrix)
stochastic optimization methods discussed include vSGD to minimize it. They provide us with additional information
[21], AdaDelta [22], and the natural newton method [23]. about the objective function’s curvature surface, which aids
All of these optimizers use the first derivative (gradient) of in estimating a better step size for the learning rate.
the loss function to estimate the curvature of the loss sur- Newton’s method, quasi-newton method, gauss–newton
face and determine the optimal learning rate step sizes. As method [45, 46], and conjugate- gradient [47] are some
in the natural gradient descent (NGD) method, some common examples of second-order optimization methods.
variants of Adam use a preconditioner (like AdaGrad) that To train deep auto-encoders without using pre-training,
adjusts to the geometry of the data based on the approxi- Hessian-free optimization (HFO) [48] is used. The sum-of-
mation for the diagonal of the Fisher information matrix functions optimizer (SFO) [49] is a quasi-newton method
[24]. The adaptation mechanism in Adam’s preconditioner that employs mini-batches, which are small subsets of the
is more conventional than vanilla NGD [25]. Other variants dataset. Its performance is determined by the number of
of Adam have also been proposed such as NosAdam [26], mini-batches generated from the dataset. This method is
Sadam [27], and Adax [28]. frequently impractical on memory-constrained systems
Throughout this paper, we attempt to improve the such as GPUs.
learning rate update step during the training process for Second-order optimization methods are not widely used
first-order optimization methods. Other approaches, such as because they require more computations to obtain second-
Lookahead [29], update the weights slowly and quickly order derivatives [49]. Table 1 summarizes the survey
separately. It is regarded as a wrapper that can be combined results for some selected optimization methods for
with other optimizers. improving the performance of deep network networks that
When compared to the SGD algorithm, adaptive gradi- have recently appeared in the literature.
ent methods like Adam typically converge quickly in the Recently, other new versions of Adam have been arisen.
early training phases, but they still have poor generaliza- A new version of Adam based on combining adaptive
tion performance [31, 32]. Recent advances have attempted coefficients and composite gradients using randomized
to combine the advantages of adaptive methods and the block coordinate descent is proposed in [50]. It enhances
SGD method, such as switching from Adam to SGD with a the performance of the Adam algorithm to a certain extent
hard schedule, as in SWATS [33], or with a smooth tran- in terms of accuracy and convergence speed. The effect of
sition as in AdaBound [34]. Other Adam modifications are the second-order momentum and the use of different
also proposed. The AMSGrad algorithm [35] solves learning rates was not considered on the performance of the
Adam’s convergence analysis problem. Dokkyun et al. [36] original algorithm. In [51], an Adam-style algorithm,
solve the problem of trapping into a local minimum for denoted by Amos, is introduced. It uses adaptive learning-
non-convex cost functions. The Adam algorithm’s param- rate decay and weight decay to improve the performance of
eter update formula has been modified to include the cost the original algorithm. It utilizes model-specific informa-
function. The evolved gradient direction optimizer is a tion to establish the initial learning rate and decaying
novel gradient-based algorithm introduced by the authors schedules. In [52], a faster version of Adam algorithm
of [37]. (EVGO). It solves the vanishing gradient problem named Adan is suggested to accelerate the training process
by updating the weights of the DNNs using the first-order of deep neural networks effectively. It develops a new
gradient and a proposed hyperplane. The authors of [38] Nesterov momentum estimation method to estimate the
create YOGI, an adaptive optimization approach. It takes first- and second-order moments of the gradient in adaptive
into account the training dataset’s mini-batch size. The gradient algorithms like Adam. This method increases the
MSVAG algorithm [39] segregates Adam assign update convergence speed of the Adam algorithm.
and magnitude scaling, the RAdam algorithm [40] corrects
learning rate variance, the Fromage algorithm [41] controls
function space distance, and the AdamW algorithm [42] 3 Deep neural networks (DNNs)
decouples weight decay from gradient descent. architectures
Although these modifications outperform Adam in terms
of accuracy, they perform worse in terms of generalization Deep neural network (DNN) is a type of artificial neural
on large-scale datasets like ImageNet [43]. Furthermore, network (ANN) with multiple hidden layers between the
many optimizers are empirically unstable when training input and output layers [1]. DNN structures vary, but they
generative adversarial networks (GAN) compared to Adam all share the same basic building blocks, such as neurons,
[44]. synapses, weights, biases, and activation functions [1].
Aside from the first-order methods, there are second- They can be trained to perform functions similar to human
order methods that use the objective function’s second-
123
17098 Neural Computing and Applications (2023) 35:17095–17112
brains using supervised or unsupervised learning algo- methods used to train neural networks [53]: supervised
rithms [53]. learning and unsupervised learning. In most machine
learning practical applications, where the network model
3.1 Activation functions has a training dataset of inputs and outputs, supervised
learning is used. This type of learning is used to provide an
Because of matrix operations in artificial neural networks, approximation of a mapping function to represent the
the network and its components are linear. The established relationship between inputs and outputs.
linear structure is transformed into a nonlinear one using Classification and regression are two common problems
the activation functions. Choosing appropriate activation addressed by supervised learning. Unsupervised learning is
functions makes it simple to increase the network’s com- used when the network model only has input data and no
putation speed. The common activation functions that are corresponding outputs, such as in clustering problems. The
used in the deep neural networks are sigmoid, tangent goal of this type of learning is to learn more about the data
hyperbolic (Tanh), rectified linear unit (ReLU) and leaky by modeling the underlying structure or distribution
rectified linear unit (Leaky ReLU) [53, 54]. [13, 54].
The basic learning algorithm used to train DNNs for
3.2 Training of deep neural networks supervised learning is back-propagation, which has two
operating phases, forward propagation and backward
Deep neural network training (or learning) is the process of propagation [53, 60]. It is based on the gradient descent
determining the weight of neuron connections to achieve algorithm (GD), which calculates gradients across the
the required relationships between inputs and outputs with entire dataset. This results in a large number of iterations
a certain precision. There are two types of learning and increases the risk of becoming trapped in local
123
Neural Computing and Applications (2023) 35:17095–17112 17099
optimums with early convergence. Due to these issues, the dimension slopes up and the other dimension slopes down,
mini-batch gradient descent method was proposed. The the SGD algorithm does not perform well [6].
training dataset is divided into fixed-size batches for use
2) Gradient descent with momentum
during the training process in this method. The total error is
computed, and the weights for each sub batch are updated. The learning steps in gradient descent methods is desired
When the mini-batch value is set to ‘‘1,’’ the stochastic to move faster towards the best result. When the learning
gradient descent algorithm is used. In this case, the error is steps are very large, the global optimal cannot always be
calculated for one sample at a time, and the weights are reached. These large steps can have a direct impact on the
also updated, resulting in faster convergence through direct time required to achieve global optimal. To address these
data vectoring. The error value is then propagated back issues, the momentum gradient descent method has been
through the networks, and the weights are updated using proposed. It limits the speed of the next learning step by
GD in the opposite direction of the curvature. The network using the average speed of previous learning steps. In this
parameters (h) to be optimized are updated according to the method, the dynamic average of the past gradients ðmt1 Þ is
following formula [10]. exponentially decreased and it is kept. Its direction is
htþ1 ¼ ht g rh J ðht Þ ð1Þ determined by taking these dynamic averages into account.
In this way, learning steps move faster towards the best
where g is the learning rate, rh Jðht Þ is the gradient of the result with less deviation [6, 10]. We can express the
loss function Jðht Þ with respect to ht . updating rules as [10]:
The updated weight values are affected by the value of
mt ¼ cmt1 þ g rh J ðht Þ ð3Þ
the learning rate. It converges for convex surface areas and
non-convex surfaces on a global minimum. The batch- htþ1 ¼ ht mt ð4Þ
gradient descent is also known as the vanilla gradient
descent. It works with extremely large training datasets where c is the momentum parameter, it is usually set to 0.9
where it performs intensive calculations that take up a lot or a similar value.
of memory space, making it difficult to use. Furthermore, it 3) Nesterov Accelerated Gradient (NAG)
provides numerous redundant updates that we do not
require. As a result, several methods based on stochastic Nesterov accelerated gradient (NAG) is a method to give
gradient descent have been developed for use in practical our momentum term this kind of prediction. The NAG
applications. Because the network only processes one algorithm determines the first step in the direction of the
training sample at a time, stochastic gradient descent is average gradient for the current position before measuring
easy to fit in memory and fast in computations. This suits the new position. The momentum term cmt1 will be used
the large datasets as it updates the parameters more fre- to move the parameters ht . Computing the term ht cmt1
quently and converges faster. Some of the improved will give an approximation to the next position of the
algorithms that are based on the stochastic gradient descent parameters and this is considered a rough idea to know
are illustrated in the following section. where our parameters are going to be [6, 10]. The param-
eters are updated based on the following two equations
1) Stochastic-Gradient Descent (SGD) [10].
The stochastic gradient descent (SGD) algorithm calculates mt ¼ cmt1 þ g rh J ðht cmt1 Þ ð5Þ
the lost function for a single training sample at a time htþ1 ¼ ht mt ð6Þ
rather than considering all training data samples. Memory
deficiency problems can be avoided in this manner. SGD 4) Adaptive Gradient Algorithm (AdaGrad)
was created to address the shortcomings of the batch-gra- The adaptive gradient (AdaGrad) algorithm divides the
dient descent algorithm. The problem with using SGD is learning rate component by the square root of vt , which is
determining the proper learning rate value to avoid oscil- the sum of the current and past squared gradients up to time
lations and reach the global optimal. For parameter instant t. The gradient component, like in SGD, remains
updating, it employs the following equation [6, 10]. unchanged. AdaGrad makes different updates for each
htþ1 ¼ ht g rh Jðht ; xðiÞ ; yðiÞ Þ ð2Þ parameter by using different learning rates for each step.
The most significant advantage of using AdaGrad is that
where g is the learning rate, rh Jðht Þ is the gradient of the the learning rate is not manually adjusted, as in other
loss function Jðht Þ with respect to ht . Also xðiÞ and yðiÞ adaptive learning systems. The update equations of the
represent the training data in the form of inputs-outputs AdaGrad can be expressed as [10]:
pairs. If the loss function curve has saddle points where one
123
17100 Neural Computing and Applications (2023) 35:17095–17112
123
Neural Computing and Applications (2023) 35:17095–17112 17101
made up of a set of convolutional filters known as Because the CNN is a feed-forward ANN, the input to
kernels. The input image is convolved with these the FC layer comes from the previous pooling or convo-
filters to map an output feature, which is expressed as lutional layer. This input takes the form of a vector, which
N-dimensional metrics. The Kernel is a grid of dis- is generated after flattening the feature maps. As shown in
crete values representing the kernel weights. The Fig. 1, the FC layer output represents the final CNN output.
convolutional operation is carried out in the follow- There are many reasons to use a CNN instead of a
ing order. The CNN input format is first described. standard multi-layer perceptron network for classifying
The vector format is the traditional neural network’s images [55]. The main reason is the weight sharing feature,
input, whereas the multi-channeled image is the which reduces the number of trainable network parameters
CNN’s input. For instance, the single-channeled is and enhances the generalization performance and prevents
the format of the gray-scale image, while the RGB overfitting of the network model. Moreover, concurrently
image format is three-channeled. In the CNN model, learning of the feature extraction layers and the classifi-
a convolutional layer often incorporates with the cation layer causes the model output to be both highly
ReLU activation function to be as one layer and then organized and highly reliant on the extracted features.
it followed by a pooling layer [56].
(B) Pooling layer
5 The proposed modification of Adam
The pooling layer’s primary purpose is to sub-
algorithm
sample the feature maps. These maps are created by
using convolutional operations. The pooling layer is
There are various reasons to use a CNN rather than a
available in several variations, but its general pur-
standard multi-layer perceptron network for image classi-
pose is to replace the output of the convolutional
fication [55]. The main reason is the weight sharing feature,
layer with a summary statistic of the neighboring
which reduces the number of trainable network parameters
outputs. There are several types of pooling methods
while improving generalization performance and prevent-
that can be used in different pooling layers. Tree
ing network overfitting. Furthermore, learning the feature
pooling, gated pooling, average pooling, min pool-
extraction and classification layers concurrently results in a
ing, max pooling, global average pooling (GAP), and
model output that is both highly organized and heavily
global max pooling are examples of these methods.
reliant on the extracted features.
The most common and widely used pooling methods
In this section, we will present our proposed modified
are the max, min, and GAP pooling.
algorithm, which is based on the standard Adam optimizer.
(C) Fully connected layer Adam is one of the best optimization algorithms for
The pooling layer’s primary function is to sub- training DNNs, and it is gaining popularity [53]. As a result
sample the feature. This layer is typically found at of some issues that arose when it was used in some
the end of the CNN architecture. Each neuron in this applications, such as the generalization performance
layer is connected to all neurons in the previous problem and the convergence problem, several trials were
layer, which is known as the fully connected (FC) conducted to improve its performance, as in the case of the
approach. It serves as the CNN classifier. It adheres SGD optimizer with momentum. Algorithm 1 describes the
to the fundamental layers of the conventional mul- pseudo-code for the basic Adam algorithm.
tiple-layer perceptron. Throughout this paper, we attempt to tackle the con-
vergence problem associated with the standard Adam
123
17102 Neural Computing and Applications (2023) 35:17095–17112
algorithm in order to achieve a high convergence speed. value of gradients (jgt j) and the exponential moving aver-
The proposed modified algorithm, denoted by HN Adam, is ages of the past gradient mt1 .
based on the adaptive norm technique and the hybrid The pseudo-code of the modified algorithm, HN_Adam,
technique between the original Adam algorithm and the is described in algorithm 2. The modifications that are
AMSGrad algorithm, with the letters ‘‘H’’ and ‘‘N’’ refer- made compared to the original Adam algorithm are in bold.
ring to the hybrid mechanism and the adaptive norm, As the HN_Adam algorithm uses a dynamic norm value,
respectively. To improve the generalization performance of the absolute value of the gradient must be taken before the
the basic Adam algorithm, we use a hybrid mechanism power is calculated. This is done to ensure that only pos-
with some modifications between the Adam algorithm and itive values will be added if it uses the possibly odd values
the AMSGrad algorithm. for the norm.
The main challenge that our proposed algorithm The threshold value of the norm (Kt0 ) is randomly
attempts to overcome is having good generalization per- chosen in the range from 2 to 4 and then the norm value
formance like the SGD while also achieving quick con- KðtÞ is adaptively computed depending on the value of the
vergence like the adaptive methods. The basic idea behind absolute gradient jgt j and the exponential moving average
our modification is to automatically adjust the learning rate of the past gradient mt1 , as described in the following
step size based on the adaptive norm for each current and equation.
past gradient, where the norm function for any two points is mt1
KðtÞ Kt0 ð20Þ
considered the Euclidean distance between them. The mmax
adaptive norm means that the norm value is changed
where mmax is the maximum value between jgt j and mt1 .
dynamically based on the gradient values obtained in each
epoch. Furthermore, a hybrid mechanism between the
original Adam algorithm and the AMSGrad algorithm has
also been made to enhance the generalization performance
and achieve a high speed of convergence for the most
architecture of DNNs. We validate the proposed algorithm
in extensive experiments of image classification through
two different standard datasets.
At first, the modified algorithm, HN_Adam, trains the
network model using the Adam algorithm but with the
adaptive (or dynamic) norm function to increase the step
size of the learning rate and avoid dropping in a local
minimum.
123
Neural Computing and Applications (2023) 35:17095–17112 17103
Fig. 2 Example of plotting the Adam search on a contour plot with different norm values for the loss function f ðxÞ ¼ x^2 þ y^2
better results in terms of accuracy and convergence speed. exploration ability of search as long as the norm value is
HN_Adam is designed to adjust the adaptation of the norm within the range from 2 to 4. Otherwise, it uses the
value for the standard Adam algorithm by changing its AMSGrad algorithm with more exploitation ability.
power value at every update. This is done based on the Figure 2 illustrates the effect of increasing the norm
information of the previous gradient updates. value from 1 to 5 for the standard Adam algorithm using
It is highly recommended to keep the norm value in the the loss function f ðxÞ ¼ x^2 þ y^2 . It shows that increasing
range between 1 and 4 since smaller values lead generally the norm values leads to a decrease in the number of
to bad results and higher values lead hardly to improve- updates, the number of epochs and the learning period.
ments with more expensive computations being exerted,
see Fig. 2. 5.1 Comparison to Adam
In Eq. (20), the ratio mmmax
t1
is less than or equal to 1, this
implies that the norm value will be in the range from 1 to 4. In this part, the comparison between the standard Adam
So, the sequence is switched to the AMSGrad algorithm algorithm and the modified algorithm, HN_Adam, will be
under the condition that KðtÞ\2. This means that highlighted as well as the differences between them will be
HN_Adam uses the modified Adam algorithm with more explained. Also, the enhancement of the HN_Adam algo-
rithm in terms of accuracy and convergence speed will be
123
17104 Neural Computing and Applications (2023) 35:17095–17112
DhSGD
t ¼ g:mt ð21Þ
mDt
DhAdam
t g: pffiffiffiffiffi ð22Þ
vDt
mt
DhtHN Adam
¼ g: 1=K
ð23Þ
vt
where jDht j is the step size for the parameter update at the
instant t.
The first, second, and third regions are denoted as 1, 2,
and 3, respectively, in Fig. 4. Now, for these three regions,
we will compare the step sizes of HN Adam, SGD, and
Adam to the ideal optimizer’s step size. The gradient is
close to 0 in the first region because the loss function is flat.
Fig. 3 Curvature of the loss function for an ideal optimizer [57] To increase its learning rate, the ideal optimizer should
take large steps. The SGD algorithm, unlike the ideal
discussed. As shown in Algorithm (1) and Algorithm (2), optimizer, will take small steps because it is proportional to
m^
the update direction of the original Adam algorithm is ptffiffiffi^ffi, the EMA of the gradient mt . While both the Adam algo-
vt
rithm and the HN_Adam will make large step sizes like the
where m^tis bias corrected for the exponential moving
ideal optimizer because v^t is a small value and the norm
average (EMA) of the gradient (gt) and v^t is bias corrected
value K is a large value.
for the exponential moving average (EMA) of the squared
In the second region, both jgt j and mt are large since the
gradient (g2t ). The update direction of the HN_Adam loss function in this region oscillates in a steep and narrow
mt 1=K
algorithm is 1=K , where vt is the EMA of gK
t and K is the valley. To reach the global optimum, the ideal optimizer
vt
adaptive norm value. should decrease its learning rate and make small steps. The
We can observe that HN_Adam takes a small step size SGD algorithm, unlike the ideal optimizer, will take large
of the learning rate when the absolute gradient jgt j is close steps because its learning rate is proportional to mt . Finally,
to mmax ; like Adam, and a large step size when the gradient the ideal optimizer should increase its learning rate and use
significantly deviates from mmax . a large step size in the third region where the loss function
Now, we will demonstrate that HN_Adam can use the has a large v^t value and the norm value K is a small value.
curvature information of the loss functions to choose a Finally, in the third region where the loss function has a
proper step size of the learning rate in order to enhance the large jgt j with a small curvature, the ideal optimizer should
training process. For explanation, Let us consider the loss increase its learning rate and apply a large step size. Unlike
function shown in Fig. 3 [57]. We use three regions on the the ideal optimizer, the Adam algorithm will make a small
pffiffiffiffiffi
graph to explain the behavior of the HN_Adam algorithm step size because the denominator v^t in its update for-
that concerns with the amount of parameter updates while mula is large. Despite that jgt j and vt are large, the norm
searching the loss function to find the global minimum. value KðtÞ is also large this could happen because the ratio
These regions are used the same as in [57]. The learning of the exponentially moving average of past gradients and
rate can be expressed in terms of the step size which is the current absolute gradient is small and the HN_Adam
responsible for the amount of change in parameter updates. will use a large step size as in the ideal optimizer. The SGD
So, we will clarify that the HN_Adam algorithm can algorithm will also take a large step size.
choose an appropriate value of the step size and matches We summarize these three cases in Table 2, where S and
the ideal behavior to make suitable amount of changes for L refer to small and large values, respectively. jDht jideal is
parameters updating. the step size for the parameter update of the ideal opti-
Figure 3 shows how an ideal optimizer considers the mizer. The HN_Adam algorithm matches the behavior of
curvature information to determine the proper step size for the ideal optimizer over the three tested regions.
the three tested regions. We use it as a reference in eval-
uating the HN_Adam algorithm. Furthermore, we make a 5.2 Mathematical illustration of the learning
comparison between the HN_Adam algorithm and two rate step size for HN_Adam
other algorithms SGD and Adam. The step size formulas
for SGD, Adam, and HN_Adam can be written as: The HN Adam algorithm’s updated term differs slightly
from the standard Adam algorithm. It is based on a
123
Neural Computing and Applications (2023) 35:17095–17112 17105
Fig. 4 Architecture of the deep CNN model using the MNIST dataset
123
17106 Neural Computing and Applications (2023) 35:17095–17112
between the parameters of the two algorithms, we use the carried out using the Python programming language as well
subscript x for HN_Adam’s hyper-parameters and the as two open-source libraries called Tensorflow and Keras.
subscript y for Adam’s hyper-parameters. The same sym- All experiments and results are obtained using the same
bols are used for the learning rate (g) and the smoothing hardware device, a digital computer equipped with a CPU
term (e). Taking all of the above into account, Eq. 31 can core i5-5300U (2.30 GHz) and 8.00 GB of RAM.
be simplified to,
Pt Pt The HN Adam algorithm is compared to the basic Adam
ti ti
i¼0 gi ðhi Þ:b1;x ? i¼0 gi ðhi Þ:b1;y algorithm and the SGD algorithm, as well as five other
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pt ffi ¼ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pt ð32Þ SGD adaptive algorithms: AdaBeilf [30], Adam, RMSprop,
K K ti 2 2 ti
jg
i¼0 i i ð h Þj :b 2;x þ e i¼0 g i ð h i Þ :b 2;y þ e
AMSGrad, and Adagrad. We use the default parameter
If we assume that b1;x =b1;y , the condition that makes the settings where b1 = 0.9, b2 = 0.99, e = 10–8, and g = 0.001.
above examined equality to be true is: For all compared algorithms, the training, validation, and
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi testing datasets are batched with a size of 128. The
u t u t
u X u X experimental findings are divided into two sections, one for
tK
jg ðhi ÞjK :bti ¼ t
2
g ðhi Þ2 :bti ð33Þ
i 2;x i 2;y each dataset.a) The first experiment: training a deep CNN
i¼0 i¼0
model using the MNIST dataset
Just by looking, K ¼ 2 makes the two sides of Eq. 33 to The MNIST dataset [13] contains 60,000 handwritten
be the same if b2;x =b2;y . So we will choose a different norm digit images. It is divided into three sets: the first set of
value K 6¼ 2 with t [ 0. For easy, let try to use K ¼ 1 and 40,000 images is the training, the second one of 10,000
t ¼ 1, Eq. 31 becomes, images is the validation set and the third set of 10,000
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi images is the testing set. The digits have been centered in a
u 1
X 1 u X fixed-size (28 9 28 pixel) image with values ranging from
jg ðh Þj:b ¼ t
2
ti
i i 2;x g ðh Þ2 :bti
i i 2;y ð34Þ 0 to 255. All images are converted to float32 data type with
i¼0 i¼0
size-normalized values in the range from 0 to 1.
After expanding the summation, Eq. (34) becomes 2) Network architecture
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
jg0 ðh0 Þj:b2;x þ jg1 ðh1 Þj ¼ g0 ðh0 Þ2 b2;y þ g1 ðh1 Þ2
2
ð35Þ The convolutional neural network is built in the first
experiment, as shown in Fig. 4. It begins with two 3 9 3
By squaring the both sides, convolutional layers of 32 kernels each, followed by a
jg0 ðh0 Þj2 :b22;x þ 2:jg0 ðh0 Þj:b2;x :jg1 ðh1 Þj þ jg1 ðh1 Þj2 max-pooling layer with a 2 9 2 window. Following that, a
¼ g0 ðh0 Þ2 b2;y þ g1 ðh1 Þ2 ð36Þ ReLU activation function is used. Following that, two more
convolutional layers with 64 kernels of size 3 9 3 are
By omitting jg1 ðh1 Þj2 from both sides and dividing both added, followed by a max-pooling layer with a 2 9 2
sides by g0 ðh0 Þ2 , we obtain window. A ReLU activation function is also used. Fol-
2:b2;x :jg1 ðh1 Þj lowing that, the max-pooling layer’s 2-dimensional output
b22;x þ ¼ b2;y ð37Þ vector is converted to a 1-dimensional vector with a size of
jg0 ðh0 Þj
1024 9 1 using a flatten module from the tensorflow
This means that, in order for the Adam algorithm to package. The converted vector is then passed through four
behave like the HN Adam algorithm, its hyper-parameter hidden layers. These hidden layers have 512, 128, 256, and
b2 needs to be modified to be dependent on current and past 32 nodes, respectively. The ReLU activation function is
gradients, rather than just the HN Adam algorithm’s hyper- applied after each hidden layer. Then the dropout layer is
parameter b2 . This ensures that the hyper-parameter of the included with a default probability value of 0.1. Finally, a
modified algorithm, HN_Adam, is dependent on the hidden layer of 10 nodes is used and the Softmax activation
obtained gradients in each epoch and makes use of the loss function is applied to produce the output from the output
function’s curvature information. layer.
3) Experimental setup
6 Experiments and results The MNIST dataset is used to train a deep CNN model
with a total of 697,034 parameters. The model is trained
The modified algorithm, HN_Adam, is tested by using it to using the optimization algorithms HN_Adam, AdaBelief,
train a deep convolutional neural network using two dif- Adam, AMSGrad, SGD, RMSProp, and AdaGrad indi-
ferent datasets CIFAR-10 [30] and MNIST [13]. Each of vidually. These algorithms are used to train the CNN
these datasets contains ten classes. The experiments are model as learning algorithms. The performance for each
123
Neural Computing and Applications (2023) 35:17095–17112 17107
Fig. 5 Training accuracy of the CNN model for the compared algorithms, case of using MNIST dataset
Fig. 6 Loss function minimization during the training process for HN_Adam, Adam and AdaBelief, case of using MNIST dataset
Table 3 Accuracy results, case of using MNIST dataset compared algorithm is measured in terms of the minimum
training loss function and the testing accuracy.
Algorithm Min_Loss_Training Dataset Test_Accuracy
4) Results and discussions
HN_Adam 1.471718 98.59%
AdaBelief [30] 1.48115 97.6% The response curves of the compared algorithms during
Adam [8] 1.483735 97.04% training process are indicated in Figs. 5 and 6. Figure 5
AMSGrad [35] 1.484545 97.09% shows the accuracy curves for the compared algorithms
SGD [33] 2.296593 96.84% during the training process for the CNN model. We focus
RMSprop [19] 1.476105 97.09% on the basic Adam algorithm and the AdaBelief algorithms
Adagrad [20] 2.279421 96.97% as they are the most competitive ones of the compared
algorithms. Figure 6 shows the loss function minimization
Bold indicates the best achieved value for each response characteristic
curves of the compared algorithms through the training
process for the CNN model.
Table 4 The consumed training time for the compared algorithms, To demonstrate the differences between these response
case of using MNIST dataset curves, the response characteristics in terms of the mini-
Algorithm Training Time mum loss function during training process, the accuracy of
the testing on test dataset are calculated and listed in
HN_Adam 1048 s
Table 3. For simplicity, the minimum training loss function
AdaBelief [29] 1051 s
and testing accuracy are determined after 200 epochs for 5
Adam [8] 1067 s
independent runs with randomly shuffled training data. The
AMSGrad [38] 1051 s
best achieved value for each response characteristic is
SGD [36] 1320 s
highlighted in bold.
RMSprop [18] 1075 s
As shown in Figs. 5 and 6, HN_Adam could achieve fast
Adagrad [19] 1050 s
convergence like the adaptive methods with better accu-
racy. The results illustrated in Table 3 confirm this, as it
123
17108 Neural Computing and Applications (2023) 35:17095–17112
outperforms the other compared algorithms and achieves 2-dimensional output vector is then converted to a
values of 1.471718, and 98.59%, for the minimum training 1-dimensional vector with a size of 1600 9 1 using a
loss function, and the testing accuracy, respectively. flatten module from the TensorFlow package. The
Table 4 also includes the training time in seconds con- converted vector is then passed through four hidden
sumed by the compared algorithms during the training layers. These hidden layers have 512, 128, 32, and 10
process, demonstrating the increase in convergence speed. nodes, respectively. Following each of these hidden
The learning algorithms use these values of the training layers, the ReLU activation function is used. Finally,
time to train the CNN model and achieve the reported the output layer is generated using a hidden layer of
accuracy results in Table 3, where 10 epochs are consid- ten nodes and the Softmax activation function.
ered for simplicity.
With a minimum training time of 1048 s, the HN_Adam
algorithm clearly outperforms the other optimizers and
(B) Experimental setup
achieves a high speed of convergence.b) The second
The CIFAR-10 dataset is used to train a deep
experiment: training a deep CNN model using the CIFAR-
CNN model with a total of 955,512 parameters. The
10 dataset
model is trained using the optimization algorithms
Like the first experiment, the second one is conducted
HN_Adam, AdaBelief, Adam, AMSGrad, SGD,
on another convolutional neural network, with slight dif-
RMSProp, and AdaGrad individually. These algo-
ferences in architecture from the previous model and using
rithms are used as learning algorithms to train the
a different type of input data. The CIFAR-10 dataset [30] is
CNN model. The performance for each compared
used to train the CNN model. It consists of 60,000 color
algorithm is measured in terms of the minimum
images fragmented into 10 classes, with 6000 images in
training loss function and the testing accuracy.
each. The dataset is divided into three sets: the training set
of 40,000 images, the validation set of 10,000 images and
(C) Results and discussions
the testing set of 10,000 images. The images have been
The response curves of the compared algorithms
centered in a fixed-size image (32 9 32 pixels) with values
during training process are indicated in Figs. 8 and 9.
ranging from 0 to 255. All image sizes are n1ormalized on
Figure 8 shows the accuracy curves for the compared
a scale of 0 to 1.
algorithms during the training process for the CNN
(1) Network architecture model. We focus on the basic Adam algorithm and
In this experiment, the CNN model is constructed the AdaBelief algorithms as they are the most com-
as shown in Fig. 7. It starts with two convolutional petitive ones of the compared algorithms. Figure 9
layers of 32 kernels of size 3 9 3, followed by a shows the loss function minimization curves of the
max-pooling layer with a 2 9 2 window. Following compared algorithms through the training process for
that, a ReLU activation function is used. After that, the CNN model.
two more convolutional layers with 64 kernels of
size 3 9 3 are added, followed by a max-pooling
layer with a 2 9 2 window. A ReLU activation To illustrate the differences between these response
function is also used. The max-pooling layer’s curves, the response characteristics in terms of the
Fig. 7 Architecture of the deep CNN model using the CIFAR-10 dataset
123
Neural Computing and Applications (2023) 35:17095–17112 17109
Fig. 8 Training accuracy of the CNN model for the compared algorithms, case of using CIFAR-10 dataset
Fig. 9 Loss function minimization during the training process for HN_Adam, Adam and AdaBelief, case of using CIFAR-10 dataset
Table 5 Accuracy results, case of using CIFAR-10 dataset Table 6 The consumed training time for the compared algorithms,
case of using CIFAR-10 dataset
Algorithm Min_Loss_Training Dataset Test_Accuracy
Algorithm Training Time
HN_Adam 0.0188 97.51%
AdaBelief [30] 0.0101 96.60% HN_Adam 2737 s
Adam [8] 0.0292 96.0431% AdaBelief [30] 2784 s
AMSGrad [35] 0.0281 96.096% Adam [8] 2767 s
SGD [33] 0.0318 96.042% AMSGrad [35] 2822 s
RMSprop [19] 0.0577 96.091% SGD [33] 2788 s
Adagrad [20] 0.387 96.97% RMSprop [19] 2851 s
Adagrad [20] 2780 s
Bold indicates the best achieved value for each response characteristic
Bold indicates the best achieved value for each response characteristic
123
17110 Neural Computing and Applications (2023) 35:17095–17112
Table 7 Top-1 accuracy results using ImageNet dataset Remark 2. It should be noted that while advanced com-
Algorithm Top -1 Accuracy
putational devices can train deep CNN models quickly,
they cannot solve the convergence problem for more
HN_Adam 73.20% complex deep neural network models with different
AdaBelief [30] 70.08% architectures that can be handled by the proposed algo-
Adam [8] 63.79% rithm. Furthermore, the proposed algorithm can be easily
SGD [33] 70.23% applied to computational devices with limited hardware
Yogi[38] 68.23% resources.
RAdam [40] 67.62%
Remark 3. To ensure the good performance of the modi-
MSVAG [39] 65.99%
fied algorithm, HN_Adam, over large-scale datasets, we
Bold indicates the best achieved value for each response characteristic evaluate it using the ImageNet dataset that contains 3.2
million cleanly annotated images spread over 5247 cate-
optimizers with a minimum training time value of 2737 s gories [64]. This dataset is used to train a deep CNN model
and thus it achieves a high speed of convergence. of the ResNet-18 architecture [65], which has a total of
11,196,042 parameters (11,186,442 trainable parameters
Remark 1. Without loss of generality, the modified algo- and 9,600 non-trainable parameters). We use HN_Adam,
rithm, HN_Adam, is applied to the deep CNN models of AdaBelief [30], Adam [8], SGD [33], Yogi [38], RAdam
the sequential architecture. It can also be applied to more [40] and MSVAG [39] as learning algorithms during the
complex and diverse deep CNN architectures such as training process of the ResNet18 deep network model. The
LeNet-5 [58], ResNet [59] and AlexNet [60]. The authors results are obtained in terms of the top-1 accuracy con-
of [58], for example, use the EVGO algorithm to train three sidering the testing dataset for 100 epochs. The top-1
different CNN models based on these architectures. The accuracy represents conventional accuracy considering the
first model employs the LeNet-5 architecture, which has a class with the highest probability (the top one). The results
total of 81,194 parameters. The second model employs the of the top-1 accuracy for the learning algorithms are listed
AlexNet architecture, which has a total of 1,250,666 in Table 7. The results of the compared algorithms are
parameters (1,249,866 trainable and 800 non-trainable). taken the same as in [30, 66]. The results of our proposed
The final model employs the ResNet architecture, which HN_Adam algorithm are obtained considering the param-
has a total of 271,690 parameters. The first model, like eter settings for Mini-batch size, learning rate (g), b1 , b2 ,
ours, is trained on the MNIST dataset [13], while the other and e to be the same as in [30].
two models are trained on the CIFAR-10 dataset [30]. As illustrated in Table 7, HN_Adam achieves the
Their results in terms of maximum training accuracy, highest top-1 accuracy with a value of 73.2% and
minimum training cost, maximum validation accuracy, and outperforms the other adaptive methods. This confirms
minimum validation cost are (99.90%, 9.69E-06, 97.98%, that HN_Adam has a good generalization performance for
and 0.066) for the first model, (98.11%, 0.0534, 80.42%, different deep CNN models over different sizes of datasets.
and 0.066) for the second model, and (91.06%, 0.6192,
87.25%, and 0.4666) for the third model, as shown in [37]. 7 Conclusion and future work
To ensure that the HN_Adam algorithm can be used effi-
ciently with a variety of CNN model architectures, we used We proposed a simple and intuitive approach for modify-
it to train the same CNN model architectures as in [37]. ing the basic Adam algorithm to address the generalization
Based on the results, HN_Adam outperforms the EVGO performance and convergence issues. The modified algo-
algorithm for all three architectures tested. The maximum rithm, denoted by HN_Adam, can improve the basic Adam
training accuracy, minimum training cost, maximum vali- algorithm’s generalization performance and reduce training
dation accuracy, and minimum validation cost for the time without increasing its complexity. HN_Adam is used
LeNet-5 architecture are (100%, 5.81E-06, 99.23%, and to train a deep CNN model over two different benchmark
0.0388), respectively, for the AlexNet architecture datasets. To evaluate the HN_Adam algorithm, it is com-
(99.29%, 0.0230, 97.89%, and 0.0827), and for the ResNet pared to the following learning algorithms: AdaBelief,
architecture (98.00%, 0.2689, 95.49%, and 0.3382). This Adam, AMSGrad, SGD, RMSProp, and AdaGrad. The
demonstrates that the HN Adam algorithm can deal with results are presented in terms of the minimum training cost,
various CNN architectures while achieving high perfor- maximum training accuracy, minimum validation cost,
mance results. maximum validation accuracy, maximum test accuracy,
and training time consumed. Where the minimum training
and validation costs are the least values of the loss function
123
Neural Computing and Applications (2023) 35:17095–17112 17111
that are attained by the learning algorithms during the supported by NSFC Project 61370175 and Shanghai Sailing
training and validation processes, respectively. Moreover, Program 17YF1404600
7. Qbal I, Sarker H (2021) Machine learning: algorithms, real-world
the accuracy curves during the training and validation applications and research directions, J SN Comput Sci
processes are also given. The results demonstrate that 8. Kingma DP, Jimmy B (2015) Adam: a method for stochastic
HN_Adam outperforms the compared algorithms for the optimization, Presented at International Conference on Learning
majority of the compared items. Representations (ICLR)
9. Liangchen L, Yuanhao X, Liu Y, Sun X (2019) Adaptive gradient
For future work, the modified algorithm can be used to methods with dynamic bound of learning rate, arXiv preprint
enhance the learning stability for other more complex deep arXiv:1902.09843
learning models such as the generative adversarial net- 10. Ruder S, Park SM, Sim KB (2017) An overview of gradient
works (GANs), and the autoencoders networks. descent optimization algorithms, arXiv:1609.04747v2 [cs.LG]
11. Sebastian B, Josef G, Martin W (2018) An improvement of the
convergence proof of the Adam-optimizer, CoRR, abs/
1804.10587
Funding Open access funding provided by The Science, Technology 12. Agnes L, Sagayaraj F (2019) A survey of optimization techniques
& Innovation Funding Authority (STDF) in cooperation with The for deep learning networks, Int J Res Eng Appl Manag (IJREAM)
Egyptian Knowledge Bank (EKB). 5:2
13. Zhang Z (2018) Improved Adam optimizer for deep neural net-
Data availability All data generated or analyzed during this study are works, IEEE/ACM 26th International Symposium on Quality of
included in this published article. Derived data supporting the findings Service (IWQoS), pp. 1–2
of this study are available from the corresponding author on request. 14. Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The
marginal value of adaptive gradient methods in machine learning,
in Advances in Neural Information Processing Systems
Declarations 15. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D,
Erhan D, Vanhoucke V , Rabinovich A (2015) Going deeper with
Conflict of interest The authors declare that they have no known convolutions. In Proceedings of the IEEE conference on com-
competing financial interests or personal relationships that could have puter vision and pattern recognition, pp. 1–9
appeared to influence the work reported in this paper. 16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
image recognition. In Proceedings of the IEEE conference on
Open Access This article is licensed under a Creative Commons computer vision and pattern recognition, pp. 770–77
Attribution 4.0 International License, which permits use, sharing, 17. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017)
adaptation, distribution and reproduction in any medium or format, as Understanding deep learning requires rethinking generalization,
long as you give appropriate credit to the original author(s) and the in ICLR 2017
source, provide a link to the Creative Commons licence, and indicate 18. Arpit D, Jastrz˛ebski S, Ballas N, Krueger D, Bengio E, Kanwal
if changes were made. The images or other third party material in this MS, Maharaj T, Fischer A, Courville A, Bengio .Y (2017) A
article are included in the article’s Creative Commons licence, unless closer look at memorization in deep networks, arXiv preprint
indicated otherwise in a credit line to the material. If material is not arXiv:1706.05394
included in the article’s Creative Commons licence and your intended 19. Tieleman T, Hinton G (2012) Lecture 6.5—RmsProp: divide the
use is not permitted by statutory regulation or exceeds the permitted gradient by a running average of its recent magnitude, COUR-
use, you will need to obtain permission directly from the copyright SERA: Neural Networks for Machine Learning
holder. To view a copy of this licence, visit http://creativecommons. 20. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods
org/licenses/by/4.0/. for online learning and stochastic optimization. J Mach Learn Res
12:2121–2159
21. Tom S, Sixin Z, Yann L (2012) No more pesky learning rates.
arXiv preprint arXiv:1206.1106
References 22. Zeiler MD (2012)Adadelta: an adaptive learning rate method,
arXiv preprint arXiv:1212.5701
1. Alzubaidi L, Zhang J, Humaidi AJ (2021) Review of deep 23. Nicolas RL, Andrew FW (2010) A fast natural newton method. In
learning: concepts, CNN architectures, challenges, applications, Proceedings of the 27th International Conference on Machine
future directions. J Big Data 8:53 Learning (ICML-10), pp. 623–630
2. Michael G, Kaldewey T, Tam D (2017) Optimizing the efficiency 24. Razvan P, Yoshua B (2013) Revisiting natural gradient for deep
of deep learning through accelerator virtualization. IBM J Res networks. arXiv preprint arXiv:1301.3584
Dev 61:121–1211. https://doi.org/10.1147/JRD.2017.2716598 25. Amari S (1998) Natural gradient works efficiently in learning.
3. Maurizio C, Beatrice B, Alberto M, Muhammad S, Guido M Neural Comput 10(2):251–276
(2020) An updated survey of efficient hardware architectures for 26. Huang H, Wang C, B Dong (2018) Nostalgic adam: weighting
accelerating deep convolutional neural networks. J Fut Inter more of the past gradients when designing the adaptive learning
12:113 rate, arXiv preprint arXiv:1805.07557
4. Pouyanfar S, Sadiq S, Yan Y (2018) A survey on deep learning: 27. Wang G, Lu S, Tu W, Zhang. (2019) LSadam: A variant of adam
algorithms, techniques, and applications. ACM Comput Surv for strongly convex functions, arXiv preprint arXiv:1905.02957
51:5 28. Li W, Zhang Z, Wang X, Luo P (2020) Adax: Adaptive gradient
5. Hassen L, Slim B, Ali L, Chih CH, Lamjed BS (2021) Deep descent with exponential long term memory, arXiv preprint
convolutional neural network architecture design as a bi-level arXiv:2004.09740
optimization problem. J Neuro Comput 439:44–62 29. Zhang M, Lucas J, Ba J, Hinton GE (2019) Lookahead optimizer:
6. Shiliang S, Zehui C, Han Z, Jing Z. (2019) A survey of opti- k steps forward, 1 step back, in Advances in Neural Information
mization methods from a machine learning perspective, Processing Systems, pp. 9593–9604
123
17112 Neural Computing and Applications (2023) 35:17095–17112
30. Zhuang J, Tang T, Ding Y, Tatikonda S, Dvornek, X. Papade- Proceedings of the 31 st International Conference on Machine
metris N, Duncan JS (2020) AdaBelief Optimizer: Adapting Learning, Beijing, China
stepsizes by the belief in observed gradients, 34th Conference on 50. Miaomiao L, Dan Y, Zhigang L, Jingfeng G, Jing C (2023) An
Neural Information Processing Systems (NeurIPS Improved adam optimization algorithm combining adaptive
31. Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The coefficients and composite gradients based on randomized block
marginal value of adaptive gradient methods in machine learning. coordinate descent. Hindawi Computational Intelligence and
In Advances in Neural Information Processing Systems, Neuroscience Volume, Article ID 4765891(2023).
pp. 4148–4158 51. Ran. T, Ankur. P. P ‘‘Amos: An Adam-style Optimizer with
32. Lyu K, Li J (2019) Gradient descent maximizes the margin of adaptive weight decay towards model-oriented scale’’, confer-
homogeneous neural networks,‘‘ arXiv preprint arXiv:1906. ence paper at ICLR (2023).
05890 52. Xingyu X, Pan Z, Huan L, Zhouchen L, Shuicheng Y (2022)
33. Keskar NS, Socher R (2017) Improving generalization perfor- Adan: adaptive nesterov momentum algorithm for faster opti-
mance by switching from adam to sgd, arXiv preprint arXiv: mizing deep models. arXiv:2208.06677v3 [cs.LG]
1712.07628 53. Keijsers NLW (2010) Neural Networks, in Encyclopedia of
34. Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods Movement Disorders
with dynamic bound of learning rate, arXiv preprint arXiv:1902. 54. Yang ZR, Yang Z (2014) Bioinformatics. In Comprehensive
09843 Biomedical Physics
35. Reddi SJ, Kale S, Kumar S (2019) On the convergence of adam 55. Jiuxiang G, Zhenhua W, Jason K, Lianyang M, Amir S, Bing S,
and beyond, arXiv preprint arXiv:1904.09237 Ting L, Xingxing W, Wangb L, Gang W, Jianfei C , Tsuhan C
36. Yi D, Ahn J, Ji S (2020) An effective optimization method for (2017) Recent advances in convolutional neural networks. Adv
machine learning based on ADAM. Appl Sci 10:1073. https://doi. Neural Inf Process Syst 4148–4158
org/10.3390/app10031073 56. Wang B, Sun Y, Xue B, Zhang M (2018) Evolving deep con-
37. Karabayir I, Akbilgic O, Tas N (2020) A novel learning algorithm volutional neural networks by variable-length particle swarm
to optimize deep neural networks: evolved gradient direction optimization for image classification. arXiv preprint arXiv:1803.
optimizer (EVGO). IEEE Transactions on Neural Networks and 06492
Learning Systems 57. Toussaint M (2012) Lecture notes, Some notes on gradient
38. Manzil Z, Sashank R, Devendra S, Satyen K, and Sanjiv K (2018) descent
Adaptive methods for nonconvex optimization. Adv Neural Inf 58. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based
Process Syst 9793–9803 learning applied to document recognition. Proc IEEE
39. Balles L, Hennig P (2017) Dissecting Adam: the sign, magnitude 86(11):2278–2324
and variance of stochastic gradients, arXiv preprint arXiv:1705. 59. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
07774 image recognition. In Proceedings of IEEE Conf. Computer
40. Liu L, Jiang H, He P, Chen W, Liu X, Gao J, and Han J (2019) Vision and Pattern Recognition (CVPR), pp. 770–778
On the variance of the adaptive learning rate and beyond, arXiv 60. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classi-
preprint arXiv:1908.03265 fication with deep convolutional neural networks In Proc Adv
41. Bernstein J, Vahdat A, Yue Y, Liu M (2019) On the distance Neural Inf Process Syst 1097–1105
between two neural networks and the stability of learning, arXiv 61. Wang S, Sun J, Xu Z HyperAdam (2019) A learnable task-
preprint arXiv:2002.03432 adaptive adam for network training, The Thirty-Third AAAI
42. Loshchilov I, Hutter F (2017) Decoupled weight decay regular- Conference on Artificial Intelligence (AAAI-19)
ization, arXiv preprint arXiv:1711.05101 62. Yao Z, Gholami A, Shen S, Keutzer K, Mahoney MW (2020)
43. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Adahessian: An adaptive second order optimizer for machine
Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet learning, arXiv preprint arXiv:2006.00719
large scale visual recognition challenge. Int J Comput Vis 63. Yuan W, Gao K (2020) Eadam optimizer: How epsilon impact
115(3):211–252 Adam, arXiv preprint arXiv:2011.02150
44. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, 64. Jia Li, Kai Li, Li Fei-Fei (2009) Imagenet: A large-scale hier-
Ozair S, Courville A, Bengio Y (2014) Generative adversarial archical image database. In Proceedings of the IEEE Conference
nets. Adv Neural Inf Process Syst 2672–2680 on Computer Vision and Pattern Recognition (CVPR), pages
45. Wedderburn RW (1974) Quasi-likelihood functions, generalized 248–255
linear models, and the gauss—newton method. Biometrika 65. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2016)
61(3):439–447 Deep residual learning for image recognition. In Proceedings of
46. Nocedal J (1980) Updating quasi-newton matrices with limited the IEEE Conference on Computer Vision and Pattern Recogni-
storage. Math Comput 35(151):773–782 tion (CVPR), pages 770– 778.
47. Pascanu .R, Bengio .Y (2013) Revisiting natural gradient for deep 66. Jinghui C, Quanquan G (2018) Closing the generalization gap of
networks, arXiv preprint arXiv:1301.3584 adaptive gradient methods in training deep neural networks,’’
48. Martens J (2010) Deep learning via hessian-free optimization. arXiv preprint arXiv:1806.06763
ICML 27:735–742
49. Jascha SD, Ben P, Surya G (2014) Fast large-scale optimization Publisher’s Note Springer Nature remains neutral with regard to
by unifying stochastic gradient and quasi-Newton methods, jurisdictional claims in published maps and institutional affiliations.
123