1998-scarselli-NN - Universal Approximation Using Feedforward Neural Networks A Survey of Some Existing Methods, and Some New Results PDF
1998-scarselli-NN - Universal Approximation Using Feedforward Neural Networks A Survey of Some Existing Methods, and Some New Results PDF
1998-scarselli-NN - Universal Approximation Using Feedforward Neural Networks A Survey of Some Existing Methods, and Some New Results PDF
15–37, 1998
q 1998 Elsevier Science Ltd. All rights reserved
Pergamon Printed in Great Britain
0893–6080/98 $19.00+.00
PII: S0893-6080(97)00097-X
CONTRIBUTED ARTICLE
Abstract—In this paper, we present a review of some recent works on approximation by feedforward neural networks. A
particular emphasis is placed on the computational aspects of the problem, i.e. we discuss the possibility of realizing a
feedforward neural network which achieves a prescribed degree of accuracy of approximation, and the determination of
the number of hidden layer neurons required to achieve this accuracy. Furthermore, a unifying framework is introduced
to understand existing approaches to investigate the universal approximation problem using feedforward neural net-
works. Some new results are also presented. Finally, two training algorithms are introduced which can determine the
weights of feedforward neural networks, with sigmoidal activation neurons, to any degree of prescribed accuracy. These
training algorithms are designed so that they do not suffer from the problems of local minima which commonly affect
neural network learning algorithms. q 1998 Elsevier Science Ltd. All rights reserved
Keywords—Approximation by neural networks, Approximation of polynomials, Constructive approximation, Feed-
forward neural networks, Multilayer neural networks, Radial basis functions, Universal approximation.
15
16 F. Scarselli and Ah Chung Tsoi
3 4
Note, however, that this approach considers only the approximation Here, ‘‘computable’’ or not is defined in the sense used in com-
aspects of the problem. It is an open and challenging question to define putational complexity theory. The construction of a network is com-
how different nodes can participate in approximating more than one putable if there exists an algorithm on the Turing machine that takes a
output and how learning algorithms and the number of necessary nodes representation of the function in input and returns a representation of the
in the network can be influenced by this fact. approximating network.
Universal Approximation 17
The set of continuous functions on R d is represented by function’’ itself may be used to indicate the functions k
C(R d) and, without further notice, is equipped with the and j of eqns (1) and (2), respectively.
supremum norm. The sets of the functions with con-
tinuous derivatives up to order r and the infinitely differ- 3. REVIEW OF CURRENT LITERATURE
entiable functions on R d are denoted by C r(R d) and
C `(R d), respectively. For functions defined almost every- There does not appear to be any analytical formula which
where in R d with respect to the Lebesgue measure, three describe generic functions in terms of sigmoids, RBF or
common norms are used: other commonly used activation functions. Approxima-
tion properties are studied by indirect strategies, for
kf kL` (Rd ) ¼ ess sup f (x)
x[Rd
example, with proofs by reduction to absurdity argu-
ments or by showing that FNNs can approximate a
Z 1=p certain set of functions which, in turn, is dense in C(R d)
kf kLp (Rd ) ¼ d
lf (x)lp dx or in C(K). In addition, at least as it stands at present,
R
approximation by using ridge functions and RBF func-
Z 1=p tions, respectively, require different approaches. We
kf kLp (m) ¼ lf (x)l dm(x)
p
begin by reviewing works on ridge functions first.
Rd
Simple and intuitive methods are available to under-
where m is a finite measure. The sets of functions essen- stand the approximation capabilities of the following
tially bounded according to these three norms, i.e. those particular FNN classes:
with a finite measure, are denoted by L `(R d), L p(R d) and
1. hidden layer neurons with step activation functions
L p(m), respectively. Similar norms k·kL` (K) and k·kLp (K) and
and a total of four layers, i.e. two hidden layers;
corresponding sets of functions are defined in the com-
2. hidden layer neurons with step activation functions, a
pact domain K , R d. When we say that a set of functions
total of three layers and one input only.
is dense in, respectively, L `(R d), L p(R d) or L p(m), without
further notice, we mean that the set is dense with respect Hence it is worthwhile to commence our review with
to the norm used to define the function class. Further- these particular cases first, because the results involved
more, the sets of locally bounded functions, denoted by will be referred to later in our discussion of more com-
L`loc (R d) or Lploc (R d), are the functions f for which kf kL` (K) , plicated examples.
kf kLp (K) are finite for every compact subset K. To measure
the smoothness of a function, we will use the modulus of
3.1. FNNs with Step-Activation Function and Two
continuity. For a function f on K , R, the modulus of
Hidden Layers
continuity is defined by
Linear combination of characteristic functions are com-
w(f , d) ¼ max{lf (x þ e) ¹ f (x)l: x þ e, x [ K, 0 , e # d}
monly used for function approximation in theory of inte-
It has the property that lim d → 0 w(f,d) ¼ 0 if and only if f gration (Rudin, 1991). A characteristic function of a set A
is continuous. Furthermore, if f satisfies the Lipschitz equals 1 for points in A and 0 otherwise. Step functions
condition then w(f,d) is O(d) as d → 0. The defintion of are a type of characteristic function which are often used
the modulus of continuity is easily extended to d-variable to discuss approximation properties of FNNs (Blum and
functions. In fact, let f be a function from K to R, where K Li, 1991; Kurkova, 1992; Geva and Sitte, 1992; Chen
is a subset of R d and consider the functions h i,x(y) ¼ et al., 1995).
f(x 1,…,x i¹1,x i þ y,x iþ1,…,x d), where x 1,…,x d are the Assume that the FNN hidden layer neurons have the
components of x and 1 # i # d holds. The modulus following step activation function:
(
of continity of f is defined in terms of the moduli of 1 x$0
continuity of f i,x: step(x) ¼
Xd 0 x,0
w(f , d) ¼ max{ w(fi, x , d) : x [ K} Let us consider the hypercube I ¼ [a 1,b 1) 3 [a 2,b 2) 3 …
i¼1
3 [a d,b d), where [a i,b i) denotes an interval closed on the
Finally, kvk is the Euclidean norm of the vector v and lal left and open on the right boundary. Observe that
is the modulus of the scalar a. (
To avoid clumsy terminology and without further 1 x[I
hstepI (x) ¼
mention, FNNs are three layered with a linear activation 0 otherwise
function neuron in the output layer. The symbol f denotes
the function we wish to approximate, f̂ the function com- can be re-written, after some simple algebraic manipula-
puted by the network and K , R is a compact subset tions, as !
where f is defined. According to whether the activation Xd
hstepI (x) ¼ step (step(xi ¹ ai ) ¹ step(xi ¹ bi )) ¹ d
function g is a RBF or a ridge function, the symbol Slg i¼1
may be replaced by Slk or Slj and the term ‘‘activation
18 F. Scarselli and Ah Chung Tsoi
2
¹ 2y 2
FIGURE 2. The function f(x,y) ¼ xe ¹ 2x .
Universal Approximation 19
I iþ1 ¼ [a i,a iþ1), 0 # i # n ¹ 1, where a i ¼ pi þ a and p ¼ for x Þ b. Substituting the step functions by the lsig
(b ¹ a)/n. We further add a degenerate segment to deal functions in equation (staircase) and passing to the
with the right extreme of the interval,5 i.e. I nþ1 ¼ {b}. limit we obtain a function
We approximate f by a staircase function. In each interval nX
þ1
I i the staircase function will have a constant value and stair f , K, n (x) ¼ lima→` (f (zi ) ¹ f (zi ¹ 1 ))lsig(ax ¹ aai )
will represent f in the middle of the interval. The staircase i¼1
function is easily written as a linear combination of the (6)
step functions which, in turn, can be interpreted as repre-
that is equal to stair f,K,n(x) for every x Þ a i, 0 # i # n.
senting the output of a three-layered FNN with step
Furthermore, since a i approaches z iþ1 for m → ` and
activation functions:
stair f , K, n (ai ) ¼ f (zi þ 1 ), then stair f , k, n converges to f
nX
þ1 also in the points a i, 0 # i # n.
stairf , K, n (x) ¼ (f (zi ) ¹ f (zi ¹ 1 ))step(x ¹ ai ) (5) This idea was widely deployed in the literature (Blum
i¼1
and Li, 1991; Geva and Sitte, 1992, Kurkova, 1992;
where z i, 1 # i # n þ 1 is the center of the interval I i and Leshno et al., 1993; Chen et al., 1995) where it is used
z 0 ¼ a 0. Note that this three-layered FNN has n þ 1 as a starting point to derive more refined results. This
hidden layer step activation neurons, and one output approach applies to all sigmoids which have lower and
layer neuron with linear activation. The weight connect- upper bounds (bounded sigmoids).
ing the ith hidden layer neuron and the output neuron is In Chen et al. (1995), for example, it is shown that for
f(z i) ¹ f(z i¹1). The error, with respect to the uniform every bounded sigmoid j, S3j is dense in C(R) and the
norm, is bounded by proof follows the above-mentioned guideline closely.
b¹a Geva and Sitte (1992) used a reasoning similar to what
lf (x) ¹ stairf , K, n (x)l # w(f ,) was described in the previous section about hstep to
2n
prove constructively that four-layered FNNs with logistic
This error is O(1/n) when f satisfies the Lipschitz condition.
sigmoid activation functions are universal approximators.
Note that the case of approximating f: R → R is
Observe that in this case, the input to hidden node
considerably simpler than the corresponding case of
weights, in the resulting network, have large magnitudes,
approximating f: R d → R. This observation will be
as they are the product of the limiting process for a → `
explored later in Section 4.
in eqn (6). In Leshno et al. (1993), the authors note that if
The above reasoning can be extended to FNNs whose
j has a point of discontinuity, say, in b, and is continuous
activation functions, in the limit, are similar to the step
in [b ¹ e,b) and [b,b þ e), where e is real, and e . 0, then
function. For example, the logistic sigmoid function lsig (
has the property that lim a → ` lsig(a x þ ab) ¼ step(x þ b), yl x,b
lima→0 j(ax ¹ ab) ¼ (7)
yr x.b
5
This artifact is necessary as we define each segment to be closed on where y l and y r are, respectively, the left and right limits
the left, and open on the right. Hence, in order to account for the closed
interval K, it is necessary to add this degenerate segment at the end of j. Thus, by employing small input to hidden node
which consists of only a single point {b}. weights it is possible to realize the step function by a
20 F. Scarselli and Ah Chung Tsoi
point of discontinuity in the activation function and to and A contains a neighborhood of the origin, then
prove that S3j is dense in L p(m). It is further interesting to S3j (A, B) is dense in C(K) with respect to the uniform
note that this approach uses a local property of the acti- norm for all compacts; K , R.
vation function. The activation function j can have any
behavior far away from b and the function can be realized Here S3j (A, B) denotes the class of functions which can be
using only a small part of j. Of course, the approximation realized by three-layered FNNs [see eqn (3)], with the
must be limited to a compact K, but this is not a problem additional constraints a i [ A, b i [ B, 1 # i # n.
in practice. Furthermore, the weights of the network can Another theorem in Hornik (1993) states that if j is
be bounded in magnitude using a small value of a in eqn essentially bounded instead of being Riemann integrable
(7). This is an important issue since both biological and then S3j (A, B) is dense in L p(m) for any compactly sup-
artificial neurons do not normally have unbounded ported measure m.
weights. However, this technique does not resolve the Similar results about the importance of the nonpoly-
problem completely; the above approach (Leshno et al., nomiality has been developed by Leshno et al. (1993),
1993) uses a very large number of neurons which is not using a different approach. They assume that j [ Lploc (R)
realistic in practice. In addition, it is sensitive to noise. and that the closure of the points of discontinuity of j
have 0 Lebesgue measure. Then they prove that S3j is dense
in L p(m) and in C(R d) if and only if j is not a polynomial.
3.3. The Hahn–Banach Theorem
Furthermore, Hornik (1993) proves that if j is analytic
The elegant approach employed by Cybenko (1989) is instead of Riemann integrable then there exists b [ B
based on a well-known theorem which is often used to such that S3j (A, {b}), where {b} denotes a single value, is
establish the denseness of a set of functions, i.e. the dense in C(K) and in L p(m).
Hahn–Banach theorem. Let I ¼ [0,1] d. A function j is The fact that a single threshold in the hidden layer is
called discriminatory on I if for a signed Borel regular sufficient raises interesting questions. Can one use effec-
measure m tively a single fixed threshold? In this case, the problem
Z becomes one to discover whether and when the learning
j(a9x þ b) dm(x) ¼ 0 algorithms have a better performance than the ones which
I
have a number of threshold units in the hidden layer.
for all a [ R implies that m ¼ 0.
d
This is a powerful and elegant approach which allows
Using the Hahn–Banach theorem, Cybenko (1989) information to be derived on how we can constrain para-
proves that meters while retaining the universal approximation
property. However, it has its shortcomings of being an
Proposition 1. If j is discriminatory on I then S3j is existence proof. It leaves open questions regarding the
dense in C(I). Furthermore, any bounded sigmoid is possibility of effectively realizing the approximation.
discriminatory. The networks whose existence is guaranteed by the
theorem could require a prohibitively long time to com-
The proof is an existence proof, as the Hahn–Banach pute, or worse, they may not be computable at all.
theorem is an existence proof in nature. An alternative strategy, which instead can give con-
In the work of Hornik (1990, 1993) this result is structive solutions, consists of reducing the problem of
further extended. In particular, in Hornik (1993) some approximation in R d to a problem of approximation in R,
theorems are presented which encompass almost all which, as we have seen in Section 3.2, is considerably
recent results on FNNs with ridge functions. The simpler. This can be achieved, for example, by using a
theorems state that three-layered FNNs are universal formula which describes functions in R d in terms of a set
approximators under very weak assumptions on the acti- functions in R. This is the case of the approaches based
vation functions and suggest that nonpolynomiality of on the Fourier transform, the Radon transform and
the activation function is the key property. He proves Kolmogorov’s theorem.
also that the approximation can be performed by weights
bounded as close to 0 as required and that for some
activation functions, a single threshold for the hidden 3.4. Kolmogorov’s Theorem
layer is sufficient.6 More precisely, Hornik’s theorem The earliest work where it is shown that FNNs are uni-
states that versal approximators is in Hecht-Nielsen (1989). Hecht-
Nielsen uses an improved version of Kolmogorov’s the-
Proposition 2. If j is a Riemann integrable and non- orem due to Sprecher (1965) which states that every
polynomial ridge function on some closed interval B continuous function f:[0,1] d → R can be written as
!
2dX
þ1 X
d
f (x) ¼
6 h
More precisely, by a network with a single threshold (in the hidden Fh l w(xk þ eh) þ h (8)
layer) we mean that all the thresholds (of the hidden layer neurons) have h¼1 k¼1
the same value.
Universal Approximation 21
where the real l and the continuous monotonic increas- (8), where the range of the index h of first summation and
ing function w are independent of f, the constant e is a the number of functions F h are not fixed a priori, but
positive rational number that can be chosen as close to 0 depend on f. Kurkova’s result states the following:
as desired and the continuous functions F h, 1 # h # 2d þ 1
depend on f. Proposition 4. For each real e and integers m and v such
Eqn (8) can be interpreted as representing a three- that m $ 2d þ 1, d/(n ¹ d) þ v , e/kf k`L and w(f,1/n)
layered FNN where
P the hth hidden nodes compute the , v(n ¹ d)/(2n ¹ d), there is a four-layered FNN, with
function zh ¼ dk ¼ 1 lh w(x
P kþþ1 eh) þ h, the output nodes dn(n þ 1) nodes in the first hidden layer and n 2(n þ 1) d in
compute the function 2d h ¼ 1 Fh (zh ) and z h is the output the second hidden layer, that approximates f with an
of the hidden layer. However, this is not one of the net- accuracy e with respect to the uniform norm. Further-
work architectures commonly used in practice and more, only the weights from the second hidden layer to
furthermore the proof of the Sprecher–Kolmogorov the output depend on f while the other weights are fixed
theorem was not constructive in nature; it only asserted for all functions h for which khk`L # kf k`L and w(h,d) #
the existence of the functions and parameters in eqn (8), w(f,d) for each d . 0.
but it did not prescribe how they can be obtained.
Since the above equations give
Katsuura and Sprecher (1994) resolve part of the pro-
kf kL` , e(n ¹ d)/[n þ w(f,1/n)(2n ¹ d)],
blem by defining a method to realize the functions F h and
then for a function which satisfies the Lipschitz condi-
w of eqn (8). They suppose that the function j:R → [0,1]
tion, the error decreases as O(n ¹1/(dþ2)). This is a result
is continuous and j(x) ¼ 0 for x # 0 and j(x) ¼ 1 for
which prevents use of the method for nonsmall values of
x $ 1 hold. Roughly speaking, the idea consists of
d. In fact, the proof of Kurkova (1992) provides a parti-
approximating each of the functions F h, 1 # h # 2d þ 1
tion of the domain and every node in the second layer
and w by three-layered FNNs with activation function j
deals with one element of the partition. For this reason,
and to combine the networks together in a four-layered
the number of nodes grows exponentially with respect to
FNN. More precisely, their results can be stated as follows.
the dimension d.
On the other hand, an interesting aspect of the works of
Proposition 3. Given a continuous function f, there
Kurkova (1992) and Katsuura and Sprecher (1994) is that
are three-layered networks N Fh, i and N wi, where 1 # h
only the weights from the second hidden layer to the
# 2d þ 1 and i . 0, that realize the out functions F h,i
output layer depend on f. Even if this was not explicitly
and w i, and there are four-layered networks Nfˆi ,
mentioned by the authors, this fact, together with the
whereP þi 1 . 0,Pd that h realizes the functions assumption that the output layer nodes use a linear acti-
fˆi ¼ 2d
h ¼ 1Fh, i ( k ¼ 1l wi (xk þ eh) þ h) such that:
vation function, allows an easy realization of the net-
1. lim i → `F h,i ¼ F h and lim i → `w h,i ¼ w h; moreover, work. In fact we can realize the first two layers of the
lim i → `f̂ i ¼ f uniformly with respect to the norm network by the proofs of Kurkova (1992) or Katsuura and
L `([0,1] d). Sprecher (1994); this part of the network is fixed and can
2. For each i . 0, the network Nfˆi can be realized by be re-used. Then we can adjust the second hidden layer to
putting together a number of instances of the networks the output layer weights depending on f. If we adopt the
NFh, i and Nwi , 1 # h # 2d þ 1. L 2 norm, as it is common in practice, then the error
3. All the parameters of the networks are independent of function is quadratic with respect to the weights and
f except for hidden layer to output layer weights of can be adjusted whilst avoiding the problems with local
networks Nfh, i and Nfˆi . minima. We will come back to this simple property in
Section 4, where we will use it to propose two new
An interesting aspect of the approach is that the four-
algorithms.
layered network Nfˆi is constructed by putting together
instances of simpler networks. Thus, the network Nfˆi
3.5. Polynomials Polynomial
can be represented by a number of parameters that is
lower than the total number of weights. In fact, the com- Chui and Li (1992) prove that if j is a sigmoid and
plexity of the network Nfˆi depends on the sum of the S3j (Z d , Z) is the set of FNNs with integer parameters,
complexities of the networks NFh, i , 1 # h # 2d þ 1, where Z denotes the set of integers, then S3j (Z d , Z) is
and Nwi . dense in C(K). In fact, polynomials are known to be
The reals l h and e and the networks NFh, i and Nwi , 1 # dense in C(R d). Chui and Li construct their proof by
h # 2d þ 1, are computable, so that the networks Nfˆi are showing that it is possible to realize polynomials as a
constructible. On the other hand, a bound on the error is sum of ridge functions. Then they follow the arguments
not given and it is not known how many neurons must be used by Cybenko (1989) for the approximation of ridge
used to obtain a desired degree of accuracy. Kurkova functions by sigmoids.
(1992) defines a bound using an approximated version The ridge functions, in this case, are powers of linear
of the Kolmogorov’s theorem. Accordingly, every con- combinations of the inputs, i.e. functions in the form of
tinuous function f is representable by a formula like eqn h(a9x) ¼ (a9x) r ¼ (a 1x 1 þ … þ a d x d) r, r $ 0. In fact,
22 F. Scarselli and Ah Chung Tsoi
where v 1,…,v n are a set of unitary vectors and m i ¼ Proposition 7. For every ball B r ¼ {xllxl # r}, every
dm(v i). The v i can be defined by a grid on the surface probability measure m with support on B r, every bounded
of a unit sphere. Each term h(v i9 x,v i) is a ridge function sigmoidal activation function j, every function f [ G and
and is, in fact, obtained by the variable substitution z ¼ v i9 x, every positive integer n, there is a three-layered FNN
a function in C(R). with n hidden nodes such that it computes the output
In Chen et al. (1995), the authors show, by using this function f̂ and
argument, that every function in C(R̄ d) can be uniformly 2rCf
approximated by f̂. They prove, by using the step func- llf ¹ fˆllL2 (m) # p (16)
n
tions argument shown in Section 3.1, that every term in
eqn (14) can be realized by a FNN with a bounded The first part of the proof of the proposition follows the
sigmoid j and conclude that S3j is dense in C(R̄ d). Here same idea of Section 3.6, where the Radon transform is
C(R̄ d) is the set of continuous functions on R d with finite replaced by the Fourier transform. In fact, since f is real-
limit for kxk → `. valued, we obtain
Carroll and Dickinson (1989) estimate the error due to Z
eqn (14) instead of eqn (13) with the following result. f (x) ¹ f (0) ¼ (eiwx ¹ 1)F̃(dw)
Z
Proposition 6. Let n be the number of terms in eqn (14), B
¼ (eiwx ¹ 1)eiwx F(dw)
the ball {xllxl # r B} and f [ C `(R d) have a compact Rd ¹ {0}
support included in the ball B. Then for each x in the ball Z
2 p ¼ (cos(wx þ v(w)) ¹ cos(v(w)))F(dw)
Rd ¹ {0}
f (x) ¹ fˆ(x)j # (vB p (rB þ 1)d) max ll=h(a, u)ll
1 ¹ 1#a#1 Z
n d¹1 kuk ¼ 1
¼ cos(wx þ v(w))F(dw)
Rd ¹ {0}
where v B is the volume of the ball. Z
¹ cos(v(w)))F(dw)
Rd ¹ {0}
The error is polynomial of order O(n ¹1/(d¹1)) with
respect to the number of terms. So f is represented by an infinite sum of sinusoids plus a
The technique is constructive in nature; however, constant. Barron (1993) uses the argument of the step
the need for a polynomial number of nodes of order function to prove that each sinusoid can be approximated
O(n ¹1/(d¹1)) can prevent a direct application of the by bounded sigmoids and concludes that S3j , for every
scheme in practice. Furthermore, the techniques com- bounded sigmoid j, is dense in G. The proof uses the
monly used to building filtered back-projection functions norm L 2(m).
h should be adapted to this particular case. However, we The bound on nodes is derived by a lemma on Hilbert
observe that this approach, with others surveyed in this spaces, which states
paper, allows us to investigate more closely the problem
of constructive approximations. Proposition 8. Let G l be a subset of a Hilbert space for
which every h [ G l is bounded by a constant l. If f is in
the closure of the convex hull of G l and c . l 2 ¹ kf k 2
3.7. Fourier Transform and Series then, for every n, there exists a function f n in the convex
Other approaches use the Fourier distributions and series. hull of n points of G l such that
Barron (1993) adopts the Fourier distributions to derive c
kf ¹ fn k2 #
bounds on the number of nodes. The Fourier distribution n
of a function f is a measure F̃(dw) ¼ e iv(w)F(dw), where
e iv(w) and F(dw) are the phase and the magnitude distri- A proof of the lemma is found in Barron (1993). The
butions, respectively, such that result applies to our case with G l ¼ {hlh(x) ¼ aj(v9x þ b),
Z a,b [ R, lal , l,v [ R m}.
f (x) ¼ eiwx F̃(dw) (15) This lemma is existence in nature, so that the whole
proof is nonconstructive. Thus an open question, which
If B is a bounded set of R d over which we wish to approx-
arises from this, is whether the same results can be repro-
imate f, then we can relax eqn (15), requiring that it is
duced in a constructive manner. On the other hand, pro-
satisfied only for x [ B. Barron (1993) defines the con-
position 7 states that, for the class of functions G, the
stant C f which gives an estimate of the complexity of a
error decreases as O(n ¹1/2), so that the order of the
function in the set B
Z error does not depend on d. It is also observed the number
Cf ¼ lwlB F(dw) of parameters of the network grows as O(n ¹1/2). This is
an important aspect from a computational point of view,
where lwl B ¼ sup x [ Blwxl. The functions for which C f is because it states that for this class of functions the curse
finite will be denoted by G. Barron (1993) proves that of the dimensionality has no effect. Furthermore, as
24 F. Scarselli and Ah Chung Tsoi
Barron noticed, the result appears to suggest that approx- coefficients must be sorted according to their magnitude.
imation by sigmoids has some advantages compared with This could be a problem in the sense that the range of
other common traditional approximation techniques, e.g. Fourier coefficients may be large, and it is not clear how
polynomials, splines, which, in contrast, requires an to order them and what is the computational complexity
exponential number of parameters. of the problem. With respect to the approach used by
Another approach, put forward by Mhaskar and Barron (1993), this one has the advantage of being
Micchelli (1992, 1994), is based on the Fourier series. ‘‘almost’’ constructive in nature, in that it assures that
The Fourier series of a function f is defined as the bounds can be effectively obtained by a computa-
X tional algorithm. The bound is weaker [O(n ¹1/4) com-
f (x) ¼ f˜(k)eikx (17)
pared with O(n ¹1/2)]; however, the constraints on f and
k[Z d
on j, respectively, are weaker as well. Observe, further,
where f̃(k), k [ Z d are the Fourier coefficients. In parti- as in Barron (1993), the order of the error does not
cular, Mhaskar and Micchelli address the question of depend on the dimension of the input space.
periodic functions for which kfk SF ¼ o k [ Zdlf˜(k)l is finite. In the end, the bounds found, respectively, by Barron
The basic idea is similar to the one adopted above. Since and Mhaskar and Micchelli are the lowest that we are
eqn (17) represents f, we will truncate the infinite sum to aware of. It would be interesting to further understand
a finite set of elements and rewrite e ikx in terms of the how this result depends on the particular manner Barron
activation functions. Mhaskar and Micchelli (1992), and Mhaskar and Micchelli adopts to measure the com-
however, take this another step further and show that if plexity of the function f. Note that the orders of the error
we reorder the terms in eqn (17) in a decreasing order [O(n -1/2) and O(n ¹1/4), respectively] are independent of
with respect to lf˜(k)l the error goes to 0 as O((m þ 1) ¹1/2), the dimension of the input space d, so that the error
where m is the number of total terms. Formally, we appears not to suffer the curse of dimensionality. How-
obtain ever, the error depends also on the constants C f and kf k SF
am (2p)d=2 llf llSF [eqns (16) and (19)]. Barron notices the C f could be large
llf ¹ f̄ llL2 (Qd ) # p and grows exponentially with respect to d, for some class
mþ1 of functions [in Barron (1993) values of C f are listed for
where different types of functions]. Note that, according to its
X definition, C f measures the complexity of the functions in
f̄ (x) ¼ f˜(k)eikx (18) terms of the Fourier transform and, in turn, the sinusoidal
k[L
functions. In this sense, those results depend on a parti-
Q is the interval [ ¹ p,p], {a m} is a set of positive reals cular method of measuring function complexity and it is
which converges to 0 as m → ` and L is a set containing a topic for further research to evaluate such a method and
the m vectors for which the values lf˜(k)l attain the max- to compare it with other solutions.
imum. Then, it is proved that the exponential term e ikx
can be substituted by the sums of activation functions.
The proof is different from the ones we have reviewed so 3.8. Radial Basis Functions
far in this overview. We will omit the proof here and For FNNs which use radial basis functions (RBFs) in the
refer the interested readers to the paper (Mhaskar and hidden layer activation functions, Park and Sandberg
Micchelli, 1994). (1991, 1993) adopt the following arguments to show
Mhaskar and Micchelli conclude that f can be approxi- that they constitute a class of universal approximators
mated by f̂, which is a sum of 2n 2 activation functions, for a general nonlinear mapping f.
with the error bounded to lie in the following range: Suppose that g(x,a,b) ¼R k((x ¹ a)/b) is a RBF, k is
integrable on R d and that RRd k(x) dx Þ 0. Assume with-
bn (2p)d=2 llj̃llSF
llf ¹ fˆllL2 (Qd ) # p 1 þ llf llSF (19) out loss of generality, that Rd k(x) dx ¼ 1, if the integral
nþ1 lj(1)l
is different R from 1, we substitute k with
where {b n} is a set of positive reals which converges to 0 k̄(x) ¼ k(x)= Rd k(x) dx. The function k e(x) ¼ e ¹dk(x/e),
as n → `. So the error is O(n ¹1/4) with respect to the e . 0, is usually called the e-mollifier of k. The convolu-
number of nodes of the network. In fact looking at the tion of k e and f R [ L p(R d) is denoted by k e*f and is defined
proof, it is observed that in order to make the error by (k e*f)(x) ¼ f(x)k e(y ¹ x) dy. The convolution satis-
decrease linearly, O(n 2) terms are required in eqn (18) fies llf ¹ kep f llLp (Rd ) → 0 when e → 0 for 1 # p , ` [see
and, for each term, there are O(n 2) nodes. Here, the basic Park and Sandberg (1991) for the proof].
assumptions are The idea is to fix e so that g e*f is sufficiently close to f
and, then, to approximate g e*f by a finite sum in the
1. the activation j is periodic; and
form of
2. kjk SF is finite.
X
n X
n
The construction of the network requires a know- fˆ(x) ¼ f (yi )ke (yi ¹ x) ¼ f (yi )g(x, ¹ yi , ¹ e)
ledge of the Fourier coefficients. Furthermore, Fourier i¼1 i¼1
Universal Approximation 25
where the y i, i ¼ 1,2,…,n, are selected according to a grid important suggestions about the solution of the general
over R d. If f and k are both continuous, then f(·)k e(· ¹ x) is problem of approximation by FNNs. In fact, in some
Riemann integrable and the convergence is assured. Park cases we will be able to extend, to continuous function
and Sandberg prove that approximation, the results that hold for polynomial func-
tion approximation.
Proposition 9. If g(x,a,b) ¼ k((x ¹ a)/b) is a RBF and k is In general, the results will give us an intuitive idea of
integrable then S3g is dense in L 1(R d) if and only if
R the general solution. Of course, sometimes the intuition
Rd k(x) dx Þ 0. might be vague or even misleading. However, from the
overview contained in the previous section it is evident
Park and Sandberg give other conditions for the other that all the existing approaches have this common draw-
norms (see Park and Sandberg, 1991, 1993 for details). back, because they study the problem of continuous func-
The proof is ‘‘almost’’ constructive in nature. There tion approximation by properties of a class of functions
are, however, two open questions associated with this or a representation formula. For example, results in
work, i.e. Section 3.3 depend on properties of discriminatory
functions, the ones in Section 3.4 on representation of
1. how to determine the value of e; and
functions by Kolmogorov’s theorem and so on. In this
2. how to determine the refinement of the grid which
sense, the approach we propose here provides one of the
allows the required approximation.
possible ways to consider the problem.
A problem of this approach is the polynomial growth
rate of the value of n with order d with respect to the
4.1. From d-Variable Functions to Univariate
distance of the points in the grid. This well-know pro-
Functions
blem of RBFs is clarified by the proof in Park and Sand-
berg (1991, 1993). When e → 0, k e becomes a function If we carefully review the discussions in the previous
with a neighborhood of the center as support. A node sections, we will note that the constructive proofs follow
influences the outputs of the network only for inputs a common strategy when the FNN has ridge activation
near to its center, so that we require an exponential functions. The proof consists of two parts.
number of neighborhoods to cover the entire domain.
1. The function f is decomposed as a sum of generic
This suggests that RBFs are well suited for problems
ridge functions with respect to a finite set of
with a small number of inputs, i.e. when d is small.
directions.
Another interesting point is that it is possible to fix the
2. Each ridge function is approximated by a linear com-
weights between the input and the hidden layer neurons,
bination of translations of the activation function.
according to a sufficiently fine grid and a small a. Adjust-
ing only the weights from the hidden layer to the output Different solutions are available for both parts of the
layer neurons, the network can approximate every func- proof and many properties of the proof depend on how
tion to an arbitrarily degree of accuracy. different solutions are combined together.
In the next section, we will prove that it is possible to From an intuitive point of view, the proofs split the
adjust the weights without the usual associated problems domain into a number of partitions. In fact, the weights
of local minima in the training algorithm. This suggests associated with a node identify a hyperplane {xl v i9x þ b i
that RBFs could be particularly suited for those problems ¼ 0}, where v i is the direction and b i the threshold. If the
for which the ridge functions require an exponential activation function is the logistic sigmoid, the hyperplane
number of nodes. divides the domain into two regions, one for which the
outputs of the node are greater than 1/2 and the other one
with values smaller than 1/2. The nodes participating to
4. OUR APPROACH TO POLYNOMIAL
realize the same ridge function form a set of parallel
APPROXIMATION
hyperplanes and all the hyperplanes of a network
In the following, we will consider principally the together split the domain in a particular partition. If
problem of constructive approximation of the class of thresholds are selected using the method of the staircase
algebraic polynomial functions. A solution to the functions as indicated in Section 3.1, the partition is
problem will be given. More precisely, it will be proven uniform and distributed over all domain. For example
that polynomials can be approximated up to any degree Figure 4 represents a possible partition for d ¼ 2 and
of precision by FNNs with a fixed number of hidden layer the set of directions {(1,¹1), (1,1/3), (1,¹1/3), (1,1)}.
neurons. Then, the discussion will turn to the structure of The hyperplanes are straight lines in R 2; the figure
the network which is necessary to realize a prescribed shows seven hyperplanes for each direction, each
degree of approximation. In addition, we will give a hyperplane corresponding to a different value of the
number of remarks and results. threshold b i.
The motivation of this study is that polynomials are The approach by Park and Sandberg for FNNs with
dense in the continuous functions and can give us RBFs (Section 3.8) presents some similarities, since, in
26 F. Scarselli and Ah Chung Tsoi
The following theorem, which is an alternative to continuous functions over compact sets. More precisely,
Theorem 2, further stresses the fact that the grid is not Theorem 2 suggests that it is sufficient that there exist an
important. It states that each one of the polynomial func- infinitive number of indexes i for which j (i)(b) Þ 0. On
tions
Pr þ 1hPi(v ix) can be approximated by sums in the form of the other hand, Theorem 2 provides that if j (i)(b) Þ 0 for
r
i¼1 l ¼ 0 w l,i j(alx þ alb i) where b i can be fixed in all i, then it is possible to employ one common value for
advance. Since the hyperplanes does not depend on a and all the thresholds [see eqn (21)]. Also note that eqns (21)
l, they themselves can be fixed in advance independently and (24) implies that the approximation can be carried
of the function to be approximated and the prescribed out such that the weights in the hidden layer are as small
degree of precision. as required. In fact, it is sufficient to select small a. The
following theorem sums up such reasoning.
THEOREM 4. Let c 0,…,c r be real numbers, p(x) ¼ c 0 þ
c 1x þ … þ cx r a univariate polynomial in x, r a non- THEOREM 5. If j is infinitely differentiable in a open neigh-
negative integer, and b 1,…,b rþ1 real values such that borhood of a point b, A is an open subset of R n that
b j Þ b i for j Þ i. Furthermore, suppose that j is C rþ1 includes 0, B an open subset of R that includes b and
in open neighborhoods of 0 and satisfies j (r)(0) Þ 0. j (i)(b) Þ 0 holds for an infinite number of indexes i, then
Then there exist reals w 1,1,…,w 1,rþ1,…,w r,rþ1, which S3j (A,B) is dense in C(K) for all compacts K. Moreover,
depend on a real a, such that p̄ a, if j (i)(b) Þ 0 for all i then also S j(A,{b})(A,{b}) is dense
rX
þ1 X
r in C(K).
p̄a (x) ¼ wl, i (a)j(alx þ albi ) (24)
i¼1 l¼0 This result is similar to the ones given in Hornik
converges uniformly to p on every bounded interval. (1993) and Leshno et al. (1993) (see also Section 3.3).
It also extends some aspects of those results, because
Proof. Let us consider the homogeneous polynomial Theorem 5 defines the properties that a point b must
t(x,y) ¼ c 0y r þ c 1y r¹1 x þ … þ c rx r. It can be written satisfy in order that the FNN can employ b as the
as linear combination of the powers (x þ yb 1) r,…,(x þ unique value for all the thresholds. Moreover, Theorem
yb rþ1) r. This is a consequence of the argument used by 5 requires that j is C ` and has nonnull derivatives,
Chui and Li (1992) (see Section 3.5). Since p(x) ¼ t(x,1), instead to be analytic and nonpolynomial. Note that if
it follows that j is analytic and nonpolynomial in an interval, then there
is at least a point b where j (i)(b) Þ 0 for all i, while the
rX
þ1
converse is not true.
p(x) ¼ vi (x þ bi )r (25) Theorems 2 and 4 have constructive proofs that sug-
i¼1
gest a way to realize FNNs. A further and more challenge
for reals v 1,…,v rþ1. question is to establish relationships between the FNNs
Furthermore, applying the same reasoning as con- of the theorems and the FNNs realized by common train-
tained in the proof of Theorem 2, we show that ing algorithms of the gradient descent type. It is unlikely
!
X
r
1 r that common learning algorithms exactly deploy the
qa (y) ¼ ( ¹ 1) rþl
j(lay) results of the theorems. The directions of a node should
l¼0 ar j(r) (0) l have weights fixed according to some precise relation-
converges to the power y r.9 Substituting (x þ b i) r with ships; in our case, they differ by integer multiples, the l in
q a(x þ b i) for every i, 1 # i # r þ 1, in the sum on the the eqns (21) and (24). The training algorithms do not
right-hand side of eqn (25) we obtain make use of this fact and simply adjust the weights
! according to some gradient descent techniques. When
rX
þ1 X
r r
cr the constant a approaches 0, small change of the weights
vi ( ¹ 1) rþl
(r)
j(al(x þ bi ))
i¼1 l¼0 a j (0) l
r of the network may cause significant corresponding
changes of the error; this could only be maintained by
This sum converges uniformly to p. Hence, the theorem using training algorithms with a very small learning rate.
is proved. However, notwithstanding this, the FNNs found by the
algorithms might follow rules similar to eqns (21) and
Theorem 4 is easly extended to the case when j is C rþ1 (24), probably more general rules than which had been
in an open neighborhood of a point b and j (r)(b) Þ 0: in observed so far.
fact, it is only required to replace all the terms alx þ alb i
with alx þ alb i þ b. Thus, an immediate consequence of
Theorems 3 and 4 is that if j is C ` and has nonzero 4.5. The Thresholds
derivatives somewhere around b, then S3j is dense in Theorem 2 proves that a single threshold, or more pre-
cisely one common value for all the thresholds, of the
9
Note that the hypothesis j (r)(0) Þ 0 is sufficient: the other deriva- hidden nodes is sufficient to obtain a prescribed degree of
tives of j need not be constrained, because they do not appear in q a. approximation, provided that the activation function has
30 F. Scarselli and Ah Chung Tsoi
FIGURE 7. Approximation of x 2 with one threshold (b ¼ 1). See text for more details.
nonzero derivatives corresponding to the value of the Intuitively, it should be difficult and error sensitive to
threshold. The same result has been obtained, by using approximate a function in the neighborhood of 0 when
different and nonconstructive routes, in Hornik (1993). the hyperplanes are far away from the origin, since every
However, these results do not resolve the problem node will contribute to the realization by an almost con-
completely, because they cannot explain whether to use stant or linear function. A further problem, in this case, is
a single threshold is a better solution than to use more that it is unlikely that the learning algorithms could effec-
thresholds. The following discussion is informal, since at tively work when the ratio of the thresholds to the
the present moment, we do not know how to formalize it, weights becomes very large.
but notwithstanding this, it discloses interesting aspects On the other hand, consider the situations depicted in
of the question. case 2. Most of the commonly used activation functions
First of all, observe that eqns (21) and (24) differ do not satisfy the constraint on the derivatives of theorem
because the former uses a single threshold and the 2 in the neighborhood of 0. For example, the even deri-
latter uses more thresholds. Let us compare the advan- vatives of the logistic sigmoid and the odd derivatives of
tages of the two expressions. Eqn (21) provides that, the gaussian are zero in the neighborhood of 0. This is
while thresholds remain constant, the other weights of because of the fact that the logistic sigmoid is an odd
the hidden layer becomes very close to 0. So, it admits symmetric function, and the gaussian function is an even
two cases: symmetric function. It is easily proved, using the same
reasoning as contained in the proof of theorem 2, that
1. the threshold is not zero and the approximation
with one null threshold, polynomials which contain only
requires a large ratio of threshold to weights;
odd degrees can be approximated by logistic sigmoid
2. the threshold is 0.
functions and polynomials with only the even ones by
According to the discussion on partition in previous the gaussian functions. Theoretically every nonnull
sections, the two situations correspond to the limiting threshold could work, but the small ones imply large
case when the hyperplanes are far away from the origin weights in the network because of the presence of the
(case 1) and to the limiting case when all the hyperplanes term 1/j (k)(b) in eqn (21).
pass through 0 (case 2). On the other hand, eqn (24) So, according to the above discussion, eqn (24)
admits all partitions including the ones that split and appears to be more meaningful with respect to eqn (21)
cover the domain by small similar subsets. and seems to suggest that to use more thresholds give
Now, consider the situations as depicted in case 1. The some advantage. The following observation concerning
commonly used activation functions become constant for the sigmoid functions further supports this idea.
large positive and negative inputs (the logistic sigmoid, the Let us restrict our attention to the approximation of
gaussian, the step function and so on) or they become linear univariate functions. Regardless of P the threshold b, any
(the linear function, the sigmoid plus a linear function10). sum of sigmoids in the form of f̂(x) ¼ ri ¼ 0 w i lsig(v ix þ b)
verifies lim x → ¹` (f̂(x) ¹ f̂(0)) ¼ ¹ lim x → ` (f̂(x) ¹ f̂(0)). In
fact, the term lsig(ia x þ b) is an odd symmetric function
with respect to the point (¹b/v i, fˆ(0)). Intuitively we can
10
Some applications adopt the function lsig(x) þ kx, where k is a
small scalar, instead of lsig(x). In fact, the derivative of the former is
nonnull for large x, that resolves part of the problems with flat error
explain this situation as follows: while selecting the
surfaces. appropriate parameters, we can constrain f̂ to approximate
Universal Approximation 31
FIGURE 8. Approximation of x 2 with two thresholds, (b 1 ¼ ¹0.5, b 2 ¼ 0.5). See text for more details.
every polynomial of order r in a bounded interval; far away Another question which can be addressed in this con-
from the interval, f̂ will be an odd symmetric function. A text concerns the connections in the network. In the
further interpretation of this situation is possible by our literature, and similarly in this paper so far, it is always
discussion on the partition in case 1. In fact, the function assumed that the network is completely connected from
to be approximated determines the behavior of f̂ in the one layer to another. However, it is known that in some
domain, but outside this domain, in the other sets of the biological structures the number of connections is
partitions which do not contain the domain, the behavior of limited. For example, this is true for neural networks
f̂ is induced by the odd symmetric nature of the logistic connected in a local fashion, the so called local receptive
sigmoid. field. Secondly, Sontag (1991) has shown that in some
We have no formal proof that this is a problem. How- cases that the FNN gives a better generalization cap-
ever, in some cases, it gives rise to nonnatural approxima- ability if direct feedforward links are allowed between
tions, for example, when we try to approximate symmetric the input and output layer neurons. It would be interest-
functions. Figure 7 depicts what happens when the ing to understand what this means from the point of view
approximated function is p(x) ¼ x 2, the approximation of the approximation properties.
interval is [¹1,1] and only one threshold (b ¼ 1) is used. Suppose we restrict the number of input links to a
Figure 8 shows the corresponding case when more thresh- maximal of N for each hidden node. If we wish to com-
olds are employed (b 1 ¼ ¹0.5, b 2 ¼ 0.5). The weights of pute, in the polynomial computed by the network, a pro-
the first layer are selected by eqns (21) and (24), respec- duct xi1 , …, xiN of a subset of inputs, then the network
tively, where r is 2 and a is 0.2. The continuous line must contain a direction v i which has components
represents the function realized by the network, the vi1 , …, viN different from 0. In fact, the products are
dotted line the function to be approximated, p(x) ¼ x 2. generated by the powers (v 19x) r,…,(v s9x) r and only the
products for which nonzero components of directions are
generated.
4.6. The Weights
Thus, three-layered FNNs which can approximate every
Note that eqn (21) implies that the approximation can be function must use at least a direction with all components
carried out such that the weights in the hidden layer are different from 0, i.e. a node connected to every input.
as small as required. In fact, it is sufficient to select small An alternative for networks with limited number of
b and a. The threshold b must satisfy the constraint on links is to have more layers. In this case, with the first
the derivatives of j; however, if j is analytic and non- hidden layer, we can generate every polynomial of N
polynomial then there exists a b that satisfies the con- variables and, as a consequence, we can approximate
straint and is as close to 0 as desired. Furthermore, the every function of N input variables. The second hidden
weights from the hidden layer to the output layer can be layer can compute every function of N 2 variables and so
made small by creating many instances of the same on. In the end, log Nd hidden layers are required.
hidden nodes. Thus, under the constraint of an analytic
and nonpolynomial activation function j, we can approx-
4.7. Two Constructive Approximation Algorithms
imate f with a FNN that has weights as small as required.
This result is similar to the ones given in Hornik (1993) The main aim of this section is to define the kernel of two
(see also Section 3.3 and Leshno et al., 1993). algorithms that employ, in a practical way, some of the
32 F. Scarselli and Ah Chung Tsoi
ideas which emerged from the above discussions. The 4. Construct a three-layered FNN, which contains L i
algorithms allow the construction of FNNs, avoiding hidden nodes for each direction v i, 1 # i # s. The
some of the problems of convergence of the training weights of the jth node of the ith direction must be av i
algorithm due to sub-optimal local minima which com- and the threshold ab j.
monly affect the learning process. The purpose is to
5. Adjust the parameters of the output layer so as to
provide a method to initialize the networks and at the
minimize the error function
same time to suggest directions to be investigated in
order to design new training algorithms. In particular,
e(W, b) ¼ kf ¹ fˆW, b kLp (m)
attention is directed towards applications which require
a small number of inputs while requiring that the error
where f̂ w denotes the function implemented by the
remains bounded within a certain limit. This situation is
neural network, W and b are the matrix of the weights
easily found, for example, in the applications of FNNs to
and the thresholds of the output layer, respectively.
control and system identification. Practical systems and
controllers with a small number of inputs are common. The value of a in step 1 should be sufficiently large so
Furthermore, it is often important to have bounds on the that the sigmoid can simulate a step function. However, if
error for control applications. our idea is to use the algorithm to initialize the network
The idea is based on the observation that the last layer and to use a different algorithm to complete the learning
of a FNN behaves like a linear combiner11 whose inputs process, then a must be selected in a way such that the
are the outputs of the nodes in the layer immediately ensuing training algorithm does not suffer from problems
prior to the last. Thus, the error surface is a quadratic with a flat error surface (see Saseetharan and Moody,
function of the weights of the last layer (see Widrow 1992).
(1990)) and has no local minima. More precisely, the There are results in the literature (Chung and Yao,
computation of the optimal value of the weights of the 1977; Nicolaides, 1972; Lee and Philips, 1991) that
last layer is polynomial in time, provided that f is defined could suggest how to generate a set of directions that
by a set of m pairs {(x i,f(x i))l 1 # i # m} and the error is satisfy the unique interpolation property (step 2). For
measured using the L 2 norm. For reason of space we skip example, in Chung and Yao (1977) it is proven that if
over details, however, it is quite easy to prove that the p 1,…,p d are the vertices of a nondegenerate (d ¹ 1)-
computation can be fulfilled in O(M 3) operations, where simplex12 then the set of points
M ¼ max(mq,n(d þ 1)), n is the number nodes in the (
1X
d¹1 dX
¹1
layer immediately prior to the last, and q is the dimension U ¼ ul u ¼ li pi , with li ¼ r and li $ 0,
of the codomain of f. r i¼1 i¼1
On the other hand, the proofs on approximation of
polynomials of Section 4.3 suggest two different ways li integer, 1 # i # d ¹ 1 ð26Þ
to fix the weights of input to hidden layer in a three-
layered FNN, according to whether we are considering satisfies the unique interpolation property for polynomial
the proof that deploys staircase functions or the one that of order r. So, the set of directions {vl v ¼ [19,u9]9, u [ U}
uses eqn (24). Both the algorithms we present exploit satisfies the hypothesis of Theorem 1.
these facts. However, in practice, the random generation of the
directions may be an efficient alternative. For example,
4.7.1. Algorithm 1. The first algorithm is defined below. directions could be randomly generated by a uniform
It fixes the number of hidden nodes and the weights from distribution on the unit sphere. In fact, the probability
the input to hidden layer following the spirit of the proof that a set of random directions do not satisfy the unique
in the previous section which uses the staircase function interpolation property is low and it is sufficient to gen-
and Theorem 1. Then it adjusts the other weights, the erate more directions than necessary to overcome the
ones which connect the hidden to the output nodes, by problem. Furthermore, the random generation also
a common linear quadratic programming algorithm. would have the advantage of a uniform distribution of
1. Select a. the directions, which is intuitively preferable.
Let us consider again the example of Section 3.1 and
2. Select the directions v 1,…,v s. Figure2 2, 2 i.e. we desire to approximate f(x,y) ¼
3. Select a constant p and real values b 1,1,…,bs, Ls such xe ¹ 2x ¹ 2y on the square ¹1 # x # 1 and ¹1 # y # 1.
that b i,j ¹ b i,j¹1 ¼ p for each 1 # i # s, 2 # j # L i, and In this case the network has two inputs (d ¼ 2) and a one-
such that the domain is contained in between every simplex is just a segment of R. Thus, the set U defined in
pair of hyperplanes {xlv i9x ¼ b i,1} and {xlv i9x ¼ bi, Ls }. eqn (26) contains r þ 1 equally distant points. With the
12
A (d ¹ 1)-simplex is a geometrical figure in R d¹1 with d vertices.
11
A linear combiner, according to our terminology, is a one-layered The (d ¹ 1)-simplex is nondegenerate if no hyperplane contains all the
FNN with a linear activation function. vertices.
Universal Approximation 33
FIGURE 9. The function that is realized by the network built with algorithm 1: the one based on the staircase functions.
segment [¹1,1] and r ¼ 3 the set of the directions is V ¼ 4.7.2. Algorithm 2. This algorithm is defined by using a
{(¹1,1),(¹1/3,1),(1/3,1),(1,1)}. second strategy: it is based eqn (24). In this case, a must
Figure 4 represents this case, where further L i ¼ 7 for be a scalar with a small value.
each i, i.e. seven hyperplanes are used for each direc- In fact, eqn (24) suggests how to select the weights and
tions, and the thresholds are fixed so that most external the thresholds in order to guarantee that the network
hyperplanes are tangent to the domain. We have tested realizes every polynomial of a certain order. The algo-
the algorithm with a ¼ 0.8. Figs 2, and 9 represent f and rithm is equal to the procedures described previously,
the function realized by the FNN, respectively. We have except that steps 3 and 4 are modified suitably. The
found that the error, with respect to the L 2(k) norm, was new version of steps 3 and 4 appears below.
approximately 0.0053. Since kf k L2(K) < 0.3441, the rela-
tive error, i.e. the ratio of the error to kf k L2(K), was 0.0155. (3) Select an integer r and real values b 1,1,…,b s,r such
From a theoretical point of view, Theorem 1 and the that b i,m Þ b i,j for m Þ j.
discussion on approximation by staircase functions guar- (4) Construct a three-layered FNN with (r þ 1) 2
antees that with enough directions and hyperplanes, and a hidden nodes for each direction v i. For each 1 # i # s,
large value of a, the network together with the associated 1 # j # r þ 1, 0 # l # r there must be a hidden node
training algorithm as described can approximate every with weights alv i and threshold alb i,j.
polynomial of degree r. On the other hand, this assertion
and previous discussion on weights of hidden layer to Observe that the only constraint for the b i,j is that they
output layer suggest that, apart from the function we must be different for different j. Actually, this is the
are approximating (it may be a nonpolynomial), the constraint required by Theorem 4. However, we can
training in step 5 in algorithm 1 gives a solution that is expect a better behavior if the hyperplanes cover and
at least as good as the best approximation given by a split the domain into pieces that have similar dimensions.
polynomial of degree r. In reality, in the previous example, Thus, a good way to select the b i,j is such that b i,h ¹ b i,j¹1
the solution was better than the best approximation by ¼ p i for constants p 1,…,p s and the domain of the function
polynomials (of order 3) for which the error, with respect is contained between the pairs of hyperplanes {xlv i9x ¼
to the L 2(K) norm, was 0.0288 and the relative error b i,1} and {xlv i9x ¼ b i,rþ1} for each 1 # i # s.
0.0838. We have tested also this algorithm
2 2
with the example of
Observe that with this approach we form a partition of the function f(x,y) ¼ xe ¹ 2x ¹ 2y . The directions were the
the domain similar to the one used by CMAC (Albus, same as used for the first algorithm. We have fixed the b i,j
1975). The difference is that instead of using a neuron for to be {¹3/5, ¹1/5, 1/5, 3/5} for the nodes corresponding
every cell in the partition, we include a set of neurons for to directions (1,¹1), (1,1) and {¹4/5, ¹4/15), 4/15, 4/5}
every direction. The disadvantage is that we cannot otherwise; it is easily verified that, in this way, the set of
approximate every function using a fixed number of hyperplanes for each direction splits the domain into five,
directions. The advantage is that if the number of direc- equally large strips. With a ¼ 0.2 the approximation
tions is sufficiently large, the error decreases linearly error was 0.0042 and the relative error 0.0123. The
with respect to the number of neurons. We can expect result is similar to the one obtained by the first algorithm.
that in many practical situations, this approach allows us An important advantage of this algorithm is the fact
to decrease the number of the required nodes. that the weights of the first layer are selected to be small.
34 F. Scarselli and Ah Chung Tsoi
This eliminates part of the problems due to the flat error s, the number of nodes for each direction L 1,…,L s and the
surfaces. If we use the algorithm to initialize the network, value of a influence the approximation error. A large
the succeeding training algorithm will be much easier to error after step 5 may depend on the number of direc-
use. At this moment, this is the only algorithm we know tions, the parameters L 1,…,L s, or a being too small (too
that is capable of initializing the network with small large in algorithm 2). To be able to single out the reason
weights and, simultaneously, guarantee bounds on the would allow us to iterate the algorithm with more direc-
approximation error. The disadvantage is that a small a tions, more nodes, a different value for a or to decide on
implies large values for the hidden to output layer when to stop the iteration process.
weights. However, this does not cause flat error surfaces One possible strategy might be as follows: to under-
and, actually, it is not clear whether and why this may be stand if we are using a sufficient number of directions we
a problem. In the end, we observe that the computational could study how the error decreases as we increase the
complexity of both algorithms essentially depends on step number of nodes which contribute to a direction. If the
5. According to the previous discussion, step 5 can be ful- rate of decrease of the error is too low, it means that we
filled in O(M 3) operations, where M ¼ max(mq,n(d þ 1)), are using too few directions. If the number of directions
m is the number of patterns when the network error is is sufficient, the error depends only on how well the ridge
evaluated, d, n and q are the numbers of the input, hidden functions are approximated. In this case, using the step
and output neurons, respectively. So, the complexity is functions and algorithm 1, the error would decrease with
polynomial of order 3 with respect to the number of an order which should approximately depend on the
neurons and the number of patterns. In practice, this is second derivatives of f. The problem is similar to func-
an acceptable result if both m and n can be selected to be tion integration. ‘‘How much decrease in the integration
small. error would result if we increase the number of points
where the function is evaluated?’’ In this case it
becomes: ‘‘How much decrease the approximation
4.8. Open Questions
error would be affected if more nodes are used?’’
With our approach, approximation of generic functions is Both a and the parameters L 1,…,L s influence the error
not studied directly, but in terms of the polynomial in the approximation of each ridge function, i.e. h 1,…,h s.
approximation of the function we wish to realize. It To distinguish those two types of errors, we can suppose
should be interesting to discover to what degree our to have an infinitive number of nodes and estimate the
results depend on this choice. The formulation of a minimal error that we can obtain in this way. This is the
more direct approach to the problem appears difficult; error due to a. Then we have to compute what happens
in fact, all the results obtained in the literature have when we use a finite number of nodes. For example
been derived by indirect approaches. On the other applying this reasoning to the first algorithm we obtain
hand, a possible way could be to develop strategies alter- the following.
native to ours, where the polynomials are replaced by Consider the mollifier g e(x) ¼ e ¹1(lsig(e ¹1(x þ b)) ¹
other kinds of functions, e.g. sinusoidals, RBFs. Each lsig(e ¹1(x ¹ b))) for some
R real b. The convolution of g e
one could provide us with another point of view of the and f, i.e. (g e*f)(x) ¼ g e(x ¹ y)f(y) dy, represents an
problem. infinite sum of sigmoids and (f(x) ¹ (g e*f)(x)) 2 is
Possible variations of the training algorithms form bounded for the part of the error that depends on a.
more topics of future research. For example the possi- Then the error for using a finite number of nodes corre-
bility of extending the training algorithm to deal with not sponds to the error due to approximating the integral by a
only the weights of the output layer, but also the thresh- finite sum.
olds, the directions of the hidden layer neurons, or the
value of a which could be different for each node. Until
5. CONCLUSIONS
now, training algorithms have adjusted every parameters
of the network simultaneously. This probably gives the In this paper, we presented a survey of most of the avail-
best approximations, but it does not give any guarantee able results in function approximation using feedforward
that the learning process will be successful, i.e. that networks. It is shown that, when the FNN has a ridge
it will converge. Other strategies may guarantee the activation function, proofs with constructive characteris-
convergence of the algorithm or selects a compromise tics consist of two parts: the function f is decomposed as a
among the quality of the approximation, the robustness sum of generic univariate ridge functions with respect to
of the learning algorithm and the computational cost. For a finite set of directions; then each ridge function is
example, we could use networks whose weights are approximated by a linear combination of translations of
linked by special rules as in eqn (24). This means that the activation function. Different solutions for both parts
in step 5 in the second algorithm, we would adjust also determine different properties for the results.
the v i, the b i,j and a, changing, but only indirectly, the The computational aspects assumed in those works
parameters of the hidden layer. are further discussed. It is indicated that while the
Another question concerns how the number of directions general results are useful, it is more important to have
Universal Approximation
TABLE 1
The main characteristics of the reviewed approaches
Network type Decomposition in ridge Ridge function approximation Error bound Constructive approach
functions
Cybenko (1989), Hornik Three layers, ridge activations Hahn–Banach theorem None No
(1990, 1993)
Katsuura and Sprecher Four layers, ridge activatons Kolmogorov’s theorem None Yes
(1994)
Kurkova (1992) Four layers, ridge activations Kolmogorov’s theorem O(n -1/(dþ2)) Yes
Chui and Li (1992) Four layers, ridge activatons Sums of powers Hahn–Banach None For polynomials theorem and
only the decomposition part
Chen et al. (1995), Three layers, ridge activations Radom Transform step function argument O(n ¹1/(d¹1)) Not completely
Carroll and Dickinson (1989)
Barron (1993) Three layers, ridge activation Fourier distrbutions step function argument O(n ¹1/2) No
Mhaskar and Micchelli Three layers, ridge activations Fourier series Fourier series O(n ¹1/4) Not completely
(1992, 1994)
Park and Sandberg Three layers, RBFs e-mollifiers O(n ¹1/d) Yes
(1991, 1993)
Our algorithm 1 Three layers, ridge activations Sums of powers step function argument O(n ¹1) Yes, for polynomials
Our algorithm 2 Three layers, ridge activations Sums of powers eqn (24) Any degree of precision by a Yes, for polynomials
finite number of neurons
35
36 F. Scarselli and Ah Chung Tsoi
a constructive algorithm with known computational function approximation by multilayered pereeptrons. IEEE Trans-
complexity to realize the functions. It is shown that actions on Neural Networks, 3(4), 621–623.
Hecht-Nielsen, R. (1989). Kolmogorov’s mapping neural network exis-
some of the existing approaches, while giving very tence theorem. In International Joint Conference on Neural
general results, are of existence type in nature. Hence, Networks, vol. 3. Washington, DC: IEEE, pp. 11–14.
it is rather difficult to find out how to utilize the results to Hornik, K. (1990). Approximation capabilities of multilayer feed-
give a computational algorithm with a known computa- forward neural networks. Neural Networks, 4, 251–257.
tional complexity. Hornik, K. (1993). Some results on neural network approximation.
Neural Networks, 6, 1069–1072.
Furthermore, some works give bounds on the number Ito, Y. (1991). Approximation of functions on a compact set by finite
of nodes needed to realize a desired approximation. The sums of sigmoid function without scaling. Neural Networks, 4,
works of Barron (1993) and Mhaskar and Micchelli 817–826.
(1992) give the lowest bounds: the number of hidden Katsuura, H., & Sprecher, D.A. (1994). Computational aspects of
nodes is polynomial with respect to the error and the Kolmogorov’s superposition theorem. Neural Networks, 7(3),
455–461.
order is independent of the input dimension. In general, Kreinovich, V. (1991). Arbitrary nonlinearity is sufficient to represent
the bounds depend also on the method used to measure all functions by neural networks: a theorem. Neural Networks, 4,
the complexity of the approximated function and on 381–383.
whether the proof is constructive. Thus, a satisfactory Kurkova, V. (1992). Kolmogorov’s theorem and multilayer neural
comparison is difficult and a topic of future research. networks. Neural Networks, 5, 501–506.
Lee, S., & Philips, G. (1991). Construction of lattices for lagrange
Table 1 summarizes the main characteristics of the interpolation in projective space. Constructive Approximation, 7,
approaches discussed in this paper. 283–297.
We then turn our attention to approximating a special Leshno, M., Lin, V., Pinkus, A., & Shocken, S. (1993). Multilayer
class of functions, viz. the set of algebraic polynomials. feedforward networks with a polynomial activation function can
We have proved that algebraic polynomials are approxim- approximate any function. Neural Networks, 6, 861–867.
Mhaskar, H., & Micchelli, C. (1992). Approximation by superposition
able up to any degree of precision by a finite number of of sigmoidal and radial basis functions. Advances in Applied
hidden nodes. Then we have considered a number of Mathematics, 13, 350–373.
remarks, which can be deduced from the proof, on the Mhaskar, H., & Micchelli, C. (1994). Dimension independent bounds
structure of the network. In the end, we have discussed on the degree of approximation by neural networks. IBM Journal of
two variants of a constructive algorithm, with known Research and Development, 38(3), 277–283.
Nicolaides, R. (1972). On a class of finite elements generated by
computational complexity, which can be used to realize lagrange interpolation. SIAM Journal of Numerical Analysis, 9,
the set of given algebraic polynomials. A number of 435–445.
practical aspects of the algorithm are discussed. Park, J., & Sandberg, I.W. (1991). Universal approximation using
Finally, some open questions concerning the approach radial-basis-function networks. Neural Computation, 3(2), 246–
are discussed. 257.
Park, J., & Sandberg, I.W. (1993). Approximation and radial-basis-
functions networks. Neural Computation, 5, 305–316.
Rudin, W. (1991). Principle of mathematical analysis, Sections 6 and
REFERENCES 11.5. New York: McGraw-Hill.
Abramovitz, M. and Stegun, I. (1968). Handbook of mathematical Saseetharan, M., & Moody, M. (1992). A modified neuron model that
functions. New York: Dover. scales and resolve network paralysis. Network, 3, 101–104.
Albus, J. (1975). A new approach to manipulator control: the cerebellar Sontag, E. (1991). Remarks on interpolation and recognition using
model articulation controller (cmac). Transactions of ASME. neural nets. In Lippmann, R., Moody, J. and Touretzky, D. (Eds.),
Journal of Dynamic Systems, Measurement and Control, 97, 220– Advances in neural information processing system 3. San Matteo,
227. CA: Morgan Kaufmann.
Barron, A. (1993). Universal approximation bounds for superposition of Sprecher, D.A. (1965). On the structure of continuous functions of
a sigmoidal function. IEEE Transactions on Information Theory, 3, several variables. Transactions of the American Math Society,
930–945. 115, 340–355.
Blum, E., & Li, K. (1991). Approximation theory and feedforward Widrow, B. (1990). 30 years of adaptive neural networks: Perceptron,
networks. Neural Networks, 4, 511–515. Madaline, and Backpropagation. IEEE Transactions on Neural
Carroll, S. and Dickinson, B. (1989). Construction of neural networks Networks, 78(9), 1415–1442.
using the Radon transform. IEEE International Conference on
Neural Networks, vol. 1. Washington, DC: IEEE, pp. 607–611.
Chen, T., Clien, H., & Liu, R. (1995). Approximation capability in c(R̄ n)
by multilayer feedforward networks and related problems. IEEE NOMENCLATURE
Transactions on Neural Networks, 6(1), 25–30.
Chui, C., & Li, X. (1992). Approximation by ridge functions and neural
g an activation function
networks with one hidden layer. Journal of Approximation Theory, R the set of real numbers
70, 131–141. d the number of inputs of the neural network
Chung, K., & Yao, T. (1977). On lattices admitting a unique lagrange Rd the set of d ¤/ 1 vectors of reals
interpolations. SIAM Journal of Numerical Analysis, 14, 735–743. gauss the gaussian function
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal
function. Mathematics of Control, Signals, and Systems, 3, 303–
lsig the logistic sigmoid function
314. k·k the Euclidean norm
Geva, S., & Sitte, J. (1992). A constructive method for multivariate 9 the transpose operator
Universal Approximation 37