1998-scarselli-NN - Universal Approximation Using Feedforward Neural Networks A Survey of Some Existing Methods, and Some New Results PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Neural Networks, Vol. 11, No. 1, pp.

15–37, 1998
q 1998 Elsevier Science Ltd. All rights reserved
Pergamon Printed in Great Britain
0893–6080/98 $19.00+.00
PII: S0893-6080(97)00097-X

CONTRIBUTED ARTICLE

Universal Approximation Using Feedforward Neural


Networks: A Survey of Some Existing Methods,
and Some New Results

Franco Scarselli 1 and Ah Chung Tsoi 2


1
Dipartimento di Ingegneria dei Sistemi e Informatica, Università di Firenze and 2Faculty of Informatics, University of Wollongong
(Received 27 July 1995; accepted 31 May 1997)

Abstract—In this paper, we present a review of some recent works on approximation by feedforward neural networks. A
particular emphasis is placed on the computational aspects of the problem, i.e. we discuss the possibility of realizing a
feedforward neural network which achieves a prescribed degree of accuracy of approximation, and the determination of
the number of hidden layer neurons required to achieve this accuracy. Furthermore, a unifying framework is introduced
to understand existing approaches to investigate the universal approximation problem using feedforward neural net-
works. Some new results are also presented. Finally, two training algorithms are introduced which can determine the
weights of feedforward neural networks, with sigmoidal activation neurons, to any degree of prescribed accuracy. These
training algorithms are designed so that they do not suffer from the problems of local minima which commonly affect
neural network learning algorithms. q 1998 Elsevier Science Ltd. All rights reserved
Keywords—Approximation by neural networks, Approximation of polynomials, Constructive approximation, Feed-
forward neural networks, Multilayer neural networks, Radial basis functions, Universal approximation.

1. INTRODUCTION Attention has been focused in particular to multi-


layered2 FNNs with either a linear activation, or a non-
Many authors (Cybenko, 1989; Hecht-Nielsen, 1989;
linear activation function in the output layer and all
Carroll and Dickinson, 1989; Hornik, 1990, 1993; Park
hidden layer neurons have a nonlinear activation function
and Sandberg, 1991, 1993; Barron, 1993) recently
g(x,a,b). Here x [ R d is the output of the previous layer
addressed the question of approximation by feedforward
and a [ R d, b [ R are parameters (weights) of the
neural networks (FNNs).1 FNNs have been shown to be
network.
capable of approximating generic classes of functions,
It is well known that a two-layered FNN, i.e. one that
including continuous and integrable ones. These authors
does not have any hidden layers, is not capable of
defined several classes of FNNs, each one capable of
approximating generic nonlinear continuous functions
approximating a class of functions according to a set of
(Widrow, 1990). On the other hand, four or more layer
approximation criteria.
FNNs are rarely used in practice; furthermore, the
proof that they are universal approximators is
Acknowledgements: The authors wish to acknowledge financial
simple. Hence, almost all the work deal with the most
support by the Italian Government in providing a traveling challenging issue of the approximation capability of
scholarship to FS to make the research reported here possible. In three-layered FNNs.
addition, ACT acknowledges partial financial support from the Almost all the activation functions discussed in this
Australian Research Council. paper, both in theory and used in practice, can be
Requests for reprints should be sent to Dr F. Scarselli, Dipartimento
di Ingegneria dei Sistemi e Informatica, Università di Firenze, v. S.
classified as radial basis functions (Park and Sandberg,
Marta, n. 350139, Firenze, Italy; Tel.: +39 55 4796 361; Fax: +39 55 1991, 1993) or ridge functions (Cybenko, 1989; Chui
4796 363; e-mail: [email protected]. and Li, 1992). A radial basis function (RBF) is given
1
Without any further mention, all network architectures considered
in this paper are fully connected. For example, in a three-layered FNN,
2
the input and the hidden layer neurons and the hidden layer and the In this paper, we will follow the convention that both the input and
output layer neurons are, respectively, fully connected. However, there output layers are counted as layers. In addition, both the input layer
will not be any direct connections between the input and the output neurons and the output layer neurons will be included in the counting of
neurons. the total number of neurons in the network.

15
16 F. Scarselli and Ah Chung Tsoi

as follows: computational aspects. In some cases, the proof may


 x ¹ a suggest lower and upper bounds on the number of
g(x, a, b) ¼ k (1) nodes of the network that realizes the desired approxima-
b
tion and/or give information about the assumed values of
where g depends on a center a and a smoothing factor b.
the network parameters and/or indicate the admissible
k(·) is usually assumed to be integrable on R d, and
R structures of the network. Furthermore, the proof may
Rd k(x)dx Þ 0. The radial basis functions adopted in be constructive or existent in nature, according to
applications usually depend only on the distance between
whether it describes how to obtain the network or it is
its current value and the center, i.e. g(x,a,b) ¼ k(kx ¹ ak/b),
carried out by an argument of reduction to absurdity. In
where k·k denotes the usual Euclidean norm. The 2gaussian
other words, it is constructive if it is proven that the
radial basis function gauss(x,a,b) ¼ e ¹ (kx ¹ ak =b) is an
construction of the network is a computable problem.4
example of this type of activation.
If the proof is constructive then it guarantees that the
A ridge function has the following form:
found results are effectively reachable. Moreover, a con-
g(x, a, b) ¼ j(a9x þ b) (2) structive proof may suggest a practical algorithm to
where ‘‘9’’ is the transpose operator, a is a d 3 1 vector, construct the network.
usually referred to as the direction of the ridge function, The research in this field is at its infancy and most
and b is a scalar called the threshold. j(·) is a nonlinear authors have paid little attention to the computational
function which satisfies a number of conditions (this will aspects of the approximation process. In this paper, we
be made clear in subsequent sections). The directions will review some existing results on approximation by
will play an important role in our discussions. For the FNNs and will discuss the computational aspects impli-
moment, observe that the ridge activation functions are citly or explicitly assumed in such work. This kind of
constant for inputs lying in planes orthogonal to the approach may not give an answer to all the questions
direction. The most common example is the logistic arising from FNN constructions. For example, there
sigmoid function lsig(x,a,b) ¼ (1)/(1 þ e ¹a9x¹b). may exist algorithms which are very efficient in practical
So far, the obtained results consider approximation of applications, but cannot be used for proofs of approxima-
scalar functions defined on a subset K , R d by networks tion capabilities of FNNs. However, the review given in
with one output. The extension to the case when the this paper would hopefully provide an interesting point
codomain is a subset of R q, q . 1, is usually said to be of view on the problem.
‘‘straightforward’’, because it only requires putting The structure of this paper is as follows: in Section 2
together q networks, each one approximating a function we will introduce the notations used in this paper. In
which maps the domain into a component of the Section 3 we will present a review of some recent work
codomain.3 We adopt the same solution in this paper. on approximation by FNNs with particular emphasis on its
Let us denote by Slg the class of functions which can be computational aspects. In Section 4 we will present an
realized by FNNs with (l ¹ 2) hidden layers, where each alternative proof in using FNNs for polynomial and/or
hidden layer neuron is endowed with an activation func- generic function approximations, and discuss a number
tion g and there is a neuron with a linear activation in the of of corollaries which can be deduced from the proof.
output layer. For example, for three-layered FNNs, we For example, we will show that polynomials are approxim-
have able up to any degree of precision by FNNs with a finite
 number of hidden nodes. Furthermore, we propose two
X
n
computational algorithms which can realize the desired
S3g ¼ f lf (x) ¼ ci g(x, ai , bi ) þ c0 ,
i¼1 approximation, whilst avoiding some of the problems of
 local minima which commonly affect neural network
x [ Rd , ai [ Rd , bi [ R, 1 # i # n ð3Þ training algorithms.

From a mathematical point of view, in order to prove that


a set of FNNs can approximate every function within a 2. NOTATIONS
function class, e.g. continuous
Pl functions C(R d), it is suf-
According to common usage, a sigmoid is a function j
ficient to show that g is dense in C(R d). Despite this, it
defined in R with the following properties:
is important to also consider its computational aspects.
The methods used in proving that Slg is dense in C(R d) 1. lim x → ¹`j(x) ¼ 0; and
often have implicit or explicit consequences on their 2. lim x → `j(x) ¼ 1.

3 4
Note, however, that this approach considers only the approximation Here, ‘‘computable’’ or not is defined in the sense used in com-
aspects of the problem. It is an open and challenging question to define putational complexity theory. The construction of a network is com-
how different nodes can participate in approximating more than one putable if there exists an algorithm on the Turing machine that takes a
output and how learning algorithms and the number of necessary nodes representation of the function in input and returns a representation of the
in the network can be influenced by this fact. approximating network.
Universal Approximation 17

The set of continuous functions on R d is represented by function’’ itself may be used to indicate the functions k
C(R d) and, without further notice, is equipped with the and j of eqns (1) and (2), respectively.
supremum norm. The sets of the functions with con-
tinuous derivatives up to order r and the infinitely differ- 3. REVIEW OF CURRENT LITERATURE
entiable functions on R d are denoted by C r(R d) and
C `(R d), respectively. For functions defined almost every- There does not appear to be any analytical formula which
where in R d with respect to the Lebesgue measure, three describe generic functions in terms of sigmoids, RBF or
common norms are used: other commonly used activation functions. Approxima-
tion properties are studied by indirect strategies, for
kf kL` (Rd ) ¼ ess sup f (x)
x[Rd
example, with proofs by reduction to absurdity argu-
ments or by showing that FNNs can approximate a
Z 1=p certain set of functions which, in turn, is dense in C(R d)
kf kLp (Rd ) ¼ d
lf (x)lp dx or in C(K). In addition, at least as it stands at present,
R
approximation by using ridge functions and RBF func-
Z 1=p tions, respectively, require different approaches. We
kf kLp (m) ¼ lf (x)l dm(x)
p
begin by reviewing works on ridge functions first.
Rd
Simple and intuitive methods are available to under-
where m is a finite measure. The sets of functions essen- stand the approximation capabilities of the following
tially bounded according to these three norms, i.e. those particular FNN classes:
with a finite measure, are denoted by L `(R d), L p(R d) and
1. hidden layer neurons with step activation functions
L p(m), respectively. Similar norms k·kL` (K) and k·kLp (K) and
and a total of four layers, i.e. two hidden layers;
corresponding sets of functions are defined in the com-
2. hidden layer neurons with step activation functions, a
pact domain K , R d. When we say that a set of functions
total of three layers and one input only.
is dense in, respectively, L `(R d), L p(R d) or L p(m), without
further notice, we mean that the set is dense with respect Hence it is worthwhile to commence our review with
to the norm used to define the function class. Further- these particular cases first, because the results involved
more, the sets of locally bounded functions, denoted by will be referred to later in our discussion of more com-
L`loc (R d) or Lploc (R d), are the functions f for which kf kL` (K) , plicated examples.
kf kLp (K) are finite for every compact subset K. To measure
the smoothness of a function, we will use the modulus of
3.1. FNNs with Step-Activation Function and Two
continuity. For a function f on K , R, the modulus of
Hidden Layers
continuity is defined by
Linear combination of characteristic functions are com-
w(f , d) ¼ max{lf (x þ e) ¹ f (x)l: x þ e, x [ K, 0 , e # d}
monly used for function approximation in theory of inte-
It has the property that lim d → 0 w(f,d) ¼ 0 if and only if f gration (Rudin, 1991). A characteristic function of a set A
is continuous. Furthermore, if f satisfies the Lipschitz equals 1 for points in A and 0 otherwise. Step functions
condition then w(f,d) is O(d) as d → 0. The defintion of are a type of characteristic function which are often used
the modulus of continuity is easily extended to d-variable to discuss approximation properties of FNNs (Blum and
functions. In fact, let f be a function from K to R, where K Li, 1991; Kurkova, 1992; Geva and Sitte, 1992; Chen
is a subset of R d and consider the functions h i,x(y) ¼ et al., 1995).
f(x 1,…,x i¹1,x i þ y,x iþ1,…,x d), where x 1,…,x d are the Assume that the FNN hidden layer neurons have the
components of x and 1 # i # d holds. The modulus following step activation function:
(
of continity of f is defined in terms of the moduli of 1 x$0
continuity of f i,x: step(x) ¼
Xd 0 x,0
w(f , d) ¼ max{ w(fi, x , d) : x [ K} Let us consider the hypercube I ¼ [a 1,b 1) 3 [a 2,b 2) 3 …
i¼1
3 [a d,b d), where [a i,b i) denotes an interval closed on the
Finally, kvk is the Euclidean norm of the vector v and lal left and open on the right boundary. Observe that
is the modulus of the scalar a. (
To avoid clumsy terminology and without further 1 x[I
hstepI (x) ¼
mention, FNNs are three layered with a linear activation 0 otherwise
function neuron in the output layer. The symbol f denotes
the function we wish to approximate, f̂ the function com- can be re-written, after some simple algebraic manipula-
puted by the network and K , R is a compact subset tions, as !
where f is defined. According to whether the activation Xd
hstepI (x) ¼ step (step(xi ¹ ai ) ¹ step(xi ¹ bi )) ¹ d
function g is a RBF or a ridge function, the symbol Slg i¼1
may be replaced by Slk or Slj and the term ‘‘activation
18 F. Scarselli and Ah Chung Tsoi

It is simple to observe that this expression can be


interpreted in terms of neural networks as shown in
Figure 1. It is observed that the first hidden layer consists
of 2d neurons with step activation function, the second
layer consists of d linear activation function neurons, and
one output neuron with step activation function. 2 2
For example, let us assume that f (x, y) ¼ xe ¹ 2x ¹ 2y
(Figure 2) and the domain K, where we wish to model
f, is the square ¹ 1 # x # 1 and ¹ 1 # y # 1. Moreover,
suppose that n ¼ 25 and I 1,…,I 25 are the 0.2 3 0.2
squares defined by a uniform grid on K. Figure 3 shows
the output of the network that is defined in eqn (4).
In general, if the entire domain is contained in a hyper-
cube H whose edges are T long, the hypercubes
I 1,I 2,…,I n have equal edges p long and are defined
according to a grid over H. The approximation error is
easily bounded in the following manner:
lf (x) ¹ fˆ(x)l # w(f , p)
and is O(p) when f satisfies the Lipschitz condition.
Observe that p ¼ T/(n 1/d) and that 2d nodes are required
for every hypercube. So, the number of nodes required to
achieve a certain prescribed error accuracy is a poly-
nomial of order d with respect to the length p of the
edges and is exponential with respect to the dimension
FIGURE 1. Interpretation of eqn (4).
d. In both cases, the results are with respect to theuniform
norm.
where x i, 1 # i # d, is the ith component of the
vector x. 3.2. FNNs with Step Activation Function Hidden
If our purpose is to approximate a continuous function Layer Neurons, and One Input
f over a compact subset K of R d then this task can be A similar approach is possible for three-layered FNNs
easily fulfilled by splitting the domain K into a set of with one input, i.e. d ¼ 1. In fact, every continuous
I 1,I 2,…,I n hypercubes and writing function in C(R) can be approximated by a staircase
X
n function, which in turn can be realized by a sum of
fˆ(x) ¼ f (zi )hstepIi (x) (4) step functions.
i¼1
Let K ¼ [a,b] be the segment where the function
where z i is the center of I i, the ith hypercube. f is defined. We split K into n equal segments, i.e.

2
¹ 2y 2
FIGURE 2. The function f(x,y) ¼ xe ¹ 2x .
Universal Approximation 19

FIGURE 3. The function f̂ realized by the network.

I iþ1 ¼ [a i,a iþ1), 0 # i # n ¹ 1, where a i ¼ pi þ a and p ¼ for x Þ b. Substituting the step functions by the lsig
(b ¹ a)/n. We further add a degenerate segment to deal functions in equation (staircase) and passing to the
with the right extreme of the interval,5 i.e. I nþ1 ¼ {b}. limit we obtain a function
We approximate f by a staircase function. In each interval nX
þ1
I i the staircase function will have a constant value and stair f , K, n (x) ¼ lima→` (f (zi ) ¹ f (zi ¹ 1 ))lsig(ax ¹ aai )
will represent f in the middle of the interval. The staircase i¼1
function is easily written as a linear combination of the (6)
step functions which, in turn, can be interpreted as repre-
that is equal to stair f,K,n(x) for every x Þ a i, 0 # i # n.
senting the output of a three-layered FNN with step
Furthermore, since a i approaches z iþ1 for m → ` and
activation functions:
stair f , K, n (ai ) ¼ f (zi þ 1 ), then stair f , k, n converges to f
nX
þ1 also in the points a i, 0 # i # n.
stairf , K, n (x) ¼ (f (zi ) ¹ f (zi ¹ 1 ))step(x ¹ ai ) (5) This idea was widely deployed in the literature (Blum
i¼1
and Li, 1991; Geva and Sitte, 1992, Kurkova, 1992;
where z i, 1 # i # n þ 1 is the center of the interval I i and Leshno et al., 1993; Chen et al., 1995) where it is used
z 0 ¼ a 0. Note that this three-layered FNN has n þ 1 as a starting point to derive more refined results. This
hidden layer step activation neurons, and one output approach applies to all sigmoids which have lower and
layer neuron with linear activation. The weight connect- upper bounds (bounded sigmoids).
ing the ith hidden layer neuron and the output neuron is In Chen et al. (1995), for example, it is shown that for
f(z i) ¹ f(z i¹1). The error, with respect to the uniform every bounded sigmoid j, S3j is dense in C(R) and the
norm, is bounded by proof follows the above-mentioned guideline closely.
b¹a Geva and Sitte (1992) used a reasoning similar to what
lf (x) ¹ stairf , K, n (x)l # w(f ,) was described in the previous section about hstep to
2n
prove constructively that four-layered FNNs with logistic
This error is O(1/n) when f satisfies the Lipschitz condition.
sigmoid activation functions are universal approximators.
Note that the case of approximating f: R → R is
Observe that in this case, the input to hidden node
considerably simpler than the corresponding case of
weights, in the resulting network, have large magnitudes,
approximating f: R d → R. This observation will be
as they are the product of the limiting process for a → `
explored later in Section 4.
in eqn (6). In Leshno et al. (1993), the authors note that if
The above reasoning can be extended to FNNs whose
j has a point of discontinuity, say, in b, and is continuous
activation functions, in the limit, are similar to the step
in [b ¹ e,b) and [b,b þ e), where e is real, and e . 0, then
function. For example, the logistic sigmoid function lsig (
has the property that lim a → ` lsig(a x þ ab) ¼ step(x þ b), yl x,b
lima→0 j(ax ¹ ab) ¼ (7)
yr x.b
5
This artifact is necessary as we define each segment to be closed on where y l and y r are, respectively, the left and right limits
the left, and open on the right. Hence, in order to account for the closed
interval K, it is necessary to add this degenerate segment at the end of j. Thus, by employing small input to hidden node
which consists of only a single point {b}. weights it is possible to realize the step function by a
20 F. Scarselli and Ah Chung Tsoi

point of discontinuity in the activation function and to and A contains a neighborhood of the origin, then
prove that S3j is dense in L p(m). It is further interesting to S3j (A, B) is dense in C(K) with respect to the uniform
note that this approach uses a local property of the acti- norm for all compacts; K , R.
vation function. The activation function j can have any
behavior far away from b and the function can be realized Here S3j (A, B) denotes the class of functions which can be
using only a small part of j. Of course, the approximation realized by three-layered FNNs [see eqn (3)], with the
must be limited to a compact K, but this is not a problem additional constraints a i [ A, b i [ B, 1 # i # n.
in practice. Furthermore, the weights of the network can Another theorem in Hornik (1993) states that if j is
be bounded in magnitude using a small value of a in eqn essentially bounded instead of being Riemann integrable
(7). This is an important issue since both biological and then S3j (A, B) is dense in L p(m) for any compactly sup-
artificial neurons do not normally have unbounded ported measure m.
weights. However, this technique does not resolve the Similar results about the importance of the nonpoly-
problem completely; the above approach (Leshno et al., nomiality has been developed by Leshno et al. (1993),
1993) uses a very large number of neurons which is not using a different approach. They assume that j [ Lploc (R)
realistic in practice. In addition, it is sensitive to noise. and that the closure of the points of discontinuity of j
have 0 Lebesgue measure. Then they prove that S3j is dense
in L p(m) and in C(R d) if and only if j is not a polynomial.
3.3. The Hahn–Banach Theorem
Furthermore, Hornik (1993) proves that if j is analytic
The elegant approach employed by Cybenko (1989) is instead of Riemann integrable then there exists b [ B
based on a well-known theorem which is often used to such that S3j (A, {b}), where {b} denotes a single value, is
establish the denseness of a set of functions, i.e. the dense in C(K) and in L p(m).
Hahn–Banach theorem. Let I ¼ [0,1] d. A function j is The fact that a single threshold in the hidden layer is
called discriminatory on I if for a signed Borel regular sufficient raises interesting questions. Can one use effec-
measure m tively a single fixed threshold? In this case, the problem
Z becomes one to discover whether and when the learning
j(a9x þ b) dm(x) ¼ 0 algorithms have a better performance than the ones which
I
have a number of threshold units in the hidden layer.
for all a [ R implies that m ¼ 0.
d
This is a powerful and elegant approach which allows
Using the Hahn–Banach theorem, Cybenko (1989) information to be derived on how we can constrain para-
proves that meters while retaining the universal approximation
property. However, it has its shortcomings of being an
Proposition 1. If j is discriminatory on I then S3j is existence proof. It leaves open questions regarding the
dense in C(I). Furthermore, any bounded sigmoid is possibility of effectively realizing the approximation.
discriminatory. The networks whose existence is guaranteed by the
theorem could require a prohibitively long time to com-
The proof is an existence proof, as the Hahn–Banach pute, or worse, they may not be computable at all.
theorem is an existence proof in nature. An alternative strategy, which instead can give con-
In the work of Hornik (1990, 1993) this result is structive solutions, consists of reducing the problem of
further extended. In particular, in Hornik (1993) some approximation in R d to a problem of approximation in R,
theorems are presented which encompass almost all which, as we have seen in Section 3.2, is considerably
recent results on FNNs with ridge functions. The simpler. This can be achieved, for example, by using a
theorems state that three-layered FNNs are universal formula which describes functions in R d in terms of a set
approximators under very weak assumptions on the acti- functions in R. This is the case of the approaches based
vation functions and suggest that nonpolynomiality of on the Fourier transform, the Radon transform and
the activation function is the key property. He proves Kolmogorov’s theorem.
also that the approximation can be performed by weights
bounded as close to 0 as required and that for some
activation functions, a single threshold for the hidden 3.4. Kolmogorov’s Theorem
layer is sufficient.6 More precisely, Hornik’s theorem The earliest work where it is shown that FNNs are uni-
states that versal approximators is in Hecht-Nielsen (1989). Hecht-
Nielsen uses an improved version of Kolmogorov’s the-
Proposition 2. If j is a Riemann integrable and non- orem due to Sprecher (1965) which states that every
polynomial ridge function on some closed interval B continuous function f:[0,1] d → R can be written as
!
2dX
þ1 X
d
f (x) ¼
6 h
More precisely, by a network with a single threshold (in the hidden Fh l w(xk þ eh) þ h (8)
layer) we mean that all the thresholds (of the hidden layer neurons) have h¼1 k¼1
the same value.
Universal Approximation 21

where the real l and the continuous monotonic increas- (8), where the range of the index h of first summation and
ing function w are independent of f, the constant e is a the number of functions F h are not fixed a priori, but
positive rational number that can be chosen as close to 0 depend on f. Kurkova’s result states the following:
as desired and the continuous functions F h, 1 # h # 2d þ 1
depend on f. Proposition 4. For each real e and integers m and v such
Eqn (8) can be interpreted as representing a three- that m $ 2d þ 1, d/(n ¹ d) þ v , e/kf k`L and w(f,1/n)
layered FNN where
P the hth hidden nodes compute the , v(n ¹ d)/(2n ¹ d), there is a four-layered FNN, with
function zh ¼ dk ¼ 1 lh w(x
P kþþ1 eh) þ h, the output nodes dn(n þ 1) nodes in the first hidden layer and n 2(n þ 1) d in
compute the function 2d h ¼ 1 Fh (zh ) and z h is the output the second hidden layer, that approximates f with an
of the hidden layer. However, this is not one of the net- accuracy e with respect to the uniform norm. Further-
work architectures commonly used in practice and more, only the weights from the second hidden layer to
furthermore the proof of the Sprecher–Kolmogorov the output depend on f while the other weights are fixed
theorem was not constructive in nature; it only asserted for all functions h for which khk`L # kf k`L and w(h,d) #
the existence of the functions and parameters in eqn (8), w(f,d) for each d . 0.
but it did not prescribe how they can be obtained.
Since the above equations give
Katsuura and Sprecher (1994) resolve part of the pro-
kf kL` , e(n ¹ d)/[n þ w(f,1/n)(2n ¹ d)],
blem by defining a method to realize the functions F h and
then for a function which satisfies the Lipschitz condi-
w of eqn (8). They suppose that the function j:R → [0,1]
tion, the error decreases as O(n ¹1/(dþ2)). This is a result
is continuous and j(x) ¼ 0 for x # 0 and j(x) ¼ 1 for
which prevents use of the method for nonsmall values of
x $ 1 hold. Roughly speaking, the idea consists of
d. In fact, the proof of Kurkova (1992) provides a parti-
approximating each of the functions F h, 1 # h # 2d þ 1
tion of the domain and every node in the second layer
and w by three-layered FNNs with activation function j
deals with one element of the partition. For this reason,
and to combine the networks together in a four-layered
the number of nodes grows exponentially with respect to
FNN. More precisely, their results can be stated as follows.
the dimension d.
On the other hand, an interesting aspect of the works of
Proposition 3. Given a continuous function f, there
Kurkova (1992) and Katsuura and Sprecher (1994) is that
are three-layered networks N Fh, i and N wi, where 1 # h
only the weights from the second hidden layer to the
# 2d þ 1 and i . 0, that realize the out functions F h,i
output layer depend on f. Even if this was not explicitly
and w i, and there are four-layered networks Nfˆi ,
mentioned by the authors, this fact, together with the
whereP þi 1 . 0,Pd that h realizes the functions assumption that the output layer nodes use a linear acti-
fˆi ¼ 2d
h ¼ 1Fh, i ( k ¼ 1l wi (xk þ eh) þ h) such that:
vation function, allows an easy realization of the net-
1. lim i → `F h,i ¼ F h and lim i → `w h,i ¼ w h; moreover, work. In fact we can realize the first two layers of the
lim i → `f̂ i ¼ f uniformly with respect to the norm network by the proofs of Kurkova (1992) or Katsuura and
L `([0,1] d). Sprecher (1994); this part of the network is fixed and can
2. For each i . 0, the network Nfˆi can be realized by be re-used. Then we can adjust the second hidden layer to
putting together a number of instances of the networks the output layer weights depending on f. If we adopt the
NFh, i and Nwi , 1 # h # 2d þ 1. L 2 norm, as it is common in practice, then the error
3. All the parameters of the networks are independent of function is quadratic with respect to the weights and
f except for hidden layer to output layer weights of can be adjusted whilst avoiding the problems with local
networks Nfh, i and Nfˆi . minima. We will come back to this simple property in
Section 4, where we will use it to propose two new
An interesting aspect of the approach is that the four-
algorithms.
layered network Nfˆi is constructed by putting together
instances of simpler networks. Thus, the network Nfˆi
3.5. Polynomials Polynomial
can be represented by a number of parameters that is
lower than the total number of weights. In fact, the com- Chui and Li (1992) prove that if j is a sigmoid and
plexity of the network Nfˆi depends on the sum of the S3j (Z d , Z) is the set of FNNs with integer parameters,
complexities of the networks NFh, i , 1 # h # 2d þ 1, where Z denotes the set of integers, then S3j (Z d , Z) is
and Nwi . dense in C(K). In fact, polynomials are known to be
The reals l h and e and the networks NFh, i and Nwi , 1 # dense in C(R d). Chui and Li construct their proof by
h # 2d þ 1, are computable, so that the networks Nfˆi are showing that it is possible to realize polynomials as a
constructible. On the other hand, a bound on the error is sum of ridge functions. Then they follow the arguments
not given and it is not known how many neurons must be used by Cybenko (1989) for the approximation of ridge
used to obtain a desired degree of accuracy. Kurkova functions by sigmoids.
(1992) defines a bound using an approximated version The ridge functions, in this case, are powers of linear
of the Kolmogorov’s theorem. Accordingly, every con- combinations of the inputs, i.e. functions in the form of
tinuous function f is representable by a formula like eqn h(a9x) ¼ (a9x) r ¼ (a 1x 1 þ … þ a d x d) r, r $ 0. In fact,
22 F. Scarselli and Ah Chung Tsoi

i ¼ D hi (x) ¼ 6ai, 2 . For m 2 ¼


For m 1 ¼ (0,3), then am1 m1 3
Chui and Li (1992) prove a theorem that states that every
homogeneous polynomial7 of d variables and degree r is (1,2), then ai ¼ D hi (x) ¼ 6ai, 1 ai, 2 and so on. Eqn (10)
m2 m2 2

equal to a linear combination of becomes Aa ¼ 0. This implies that A must be singular by


! the assumption a Þ 0.
rþd¹1
N¼ Now, consider the system of equations A9b ¼ 0, where
d¹1 b ¼ [b 1,…,b N]9 is a vector of reals. It is a system of
algebraic equations:
functions in the form of h, say h 1,…,h N with h i(x) ¼
(a i9x) ¼ (a i,1x 1 þ … þ a i,dx d) r.8 Since, generic poly- X
N X
N
bi (aj, 1i, 1 …aj, di, d ) ¼ 0
m m
nomials are expressible by linear combinations of homo- bi am i
j ¼ (12)
i¼1 i¼1
geneous polynomials, the theorem is sufficient to prove
that any polynomial can be expressed by linear combina- for j ¼ 1,2, …, N; which is an interpolation problem,
tions of h 1,…,h N. where the question is to find a polynomial of order r
According to their proof, let Hrd be the set of all homo- which has a value of 0 at the points a 1,…,a N. Chui and
geneous polynomials with d variables and degree r and Li (1992), studying this latter problem, deduce condi-
note that tions that guarantee the nonsingularity of A.
!
rþd¹1 Proposition 5. Let A be defined as in eqn (11). In order

d¹1 for A to be nonsingular, it is sufficient to select directions
so that the set {(a i,2/a i,1,…,a i,d/a i,1)l a i,1 Þ 0, 1 # i # N}
is the dimension of the vector space. Thus, the cited has the unique interpolation property, i.e. it admits a
theorem is equivalent to proving that h 1,…,h N are unique interpolation polynomial of order r.
linearly independent, which, in turn, corresponds to
proving that Since there exist subsets of Z d that have the unique
X
N interpolation property for any r, this contradicts our pre-
ai hi (x) ¼ 0 (9) vious deduction that A must be singular, and hence the
i¼1
proposition is proved.
is satisfied only by a i ¼ 0,1 # i # N. Suppose, by using The proof, as carried out by Chui and Li (1992), is not
the argument of reduction to absurdity, that nontrivial a i completely constructive, either because: (1) a part of the
exist such that eqn (9) is verified. Let us P represent by proof uses the existence proof argument in Cybenko
m 1,…,m N all distinct elements in Z d with dj¼ 1 mi, j ¼ r (1989), or (2) the polynomials are dense in C(K) is
for each i, where m i ¼ (m i,1, m i,2,…, m i,d). For example, if only an existence argument. However, the powers h i
r ¼ 3, and d ¼ 2, then the values of m i are (0,3), (1,2), can be approximated by any one of the methods we
(2,1) and (3,0). This is a convenient way to represent all have discussed previously and the Taylor series or poly-
possible partial derivatives of a multivariate polynomial nomial interpolation can provide a polynomial which
with d variables. Let us denote the partial derivative with approximates f. So the construction of the network can
respect Pto the d variables as Dmi ¼ (]/]x 1)mi, 1 …(]/]x d)mi, d , be easily fulfilled. We will return to this question in
where dj¼ 1 mi, j ¼ r. Taking all partial derivatives of Section 4 when we will describe our approach.
degree r of both sides of the eqn (9), we obtain a set of
algebraic equations:
3.6. The Radon Transform
X
N
ai (Dmj hi )(x) ¼ 0, 1 # j # N (10) The Radon transform and, more precisely, its inverse are
i¼1 the cornerstones of computerized axial tomography. The
Let us denote [a 1,…,a N]9 by a and Radon transform f̃ represents a function f [ C `(K) by the
2 m1 sets of all integrals over hyperplanes in R d. Furthermore,
a … am1 3
1 N by the inverse Radon transform, f can be written as
6 m 7
6a 2 … am 2 7 Z
6 1 7
h(v9 x, v) dm(v)
N
A¼6 7 (11) f (x) ¼ (13)
6 : 7 llvll ¼ 1
4 5
am N … am N where h is called the filtered back-projection function
1 N
and is obtained by differentiation and Hilbert transforms
where ai j , i ¼ 1,…, N, j ¼ 1,…, N is a convenient way to
m
of the Radon transform f̃.
express constants which depends on the values of m j, and So, the idea is to transform the integral in eqn (13) into
the elements of a i. For example, if h i(x) ¼ (a i,1x 1 þ a i,2x 2) 3. a sum of a finite number of terms and to approximate
f by
7 Xn
h(v9i x, v9i )mi
The homogeneous polynomials are polynomials in which every
term has the same degree. f˜(x) ¼ (14)
8
A similar statement is also in the work of Ito (1991). i¼1
Universal Approximation 23

where v 1,…,v n are a set of unitary vectors and m i ¼ Proposition 7. For every ball B r ¼ {xllxl # r}, every
dm(v i). The v i can be defined by a grid on the surface probability measure m with support on B r, every bounded
of a unit sphere. Each term h(v i9 x,v i) is a ridge function sigmoidal activation function j, every function f [ G and
and is, in fact, obtained by the variable substitution z ¼ v i9 x, every positive integer n, there is a three-layered FNN
a function in C(R). with n hidden nodes such that it computes the output
In Chen et al. (1995), the authors show, by using this function f̂ and
argument, that every function in C(R̄ d) can be uniformly 2rCf
approximated by f̂. They prove, by using the step func- llf ¹ fˆllL2 (m) # p (16)
n
tions argument shown in Section 3.1, that every term in
eqn (14) can be realized by a FNN with a bounded The first part of the proof of the proposition follows the
sigmoid j and conclude that S3j is dense in C(R̄ d). Here same idea of Section 3.6, where the Radon transform is
C(R̄ d) is the set of continuous functions on R d with finite replaced by the Fourier transform. In fact, since f is real-
limit for kxk → `. valued, we obtain
Carroll and Dickinson (1989) estimate the error due to Z
eqn (14) instead of eqn (13) with the following result. f (x) ¹ f (0) ¼ (eiwx ¹ 1)F̃(dw)
Z
Proposition 6. Let n be the number of terms in eqn (14), B
¼ (eiwx ¹ 1)eiwx F(dw)
the ball {xllxl # r B} and f [ C `(R d) have a compact Rd ¹ {0}
support included in the ball B. Then for each x in the ball Z
2 p ¼ (cos(wx þ v(w)) ¹ cos(v(w)))F(dw)
Rd ¹ {0}
f (x) ¹ fˆ(x)j # (vB p (rB þ 1)d) max ll=h(a, u)ll
1 ¹ 1#a#1 Z
n d¹1 kuk ¼ 1
¼ cos(wx þ v(w))F(dw)
Rd ¹ {0}
where v B is the volume of the ball. Z
¹ cos(v(w)))F(dw)
Rd ¹ {0}
The error is polynomial of order O(n ¹1/(d¹1)) with
respect to the number of terms. So f is represented by an infinite sum of sinusoids plus a
The technique is constructive in nature; however, constant. Barron (1993) uses the argument of the step
the need for a polynomial number of nodes of order function to prove that each sinusoid can be approximated
O(n ¹1/(d¹1)) can prevent a direct application of the by bounded sigmoids and concludes that S3j , for every
scheme in practice. Furthermore, the techniques com- bounded sigmoid j, is dense in G. The proof uses the
monly used to building filtered back-projection functions norm L 2(m).
h should be adapted to this particular case. However, we The bound on nodes is derived by a lemma on Hilbert
observe that this approach, with others surveyed in this spaces, which states
paper, allows us to investigate more closely the problem
of constructive approximations. Proposition 8. Let G l be a subset of a Hilbert space for
which every h [ G l is bounded by a constant l. If f is in
the closure of the convex hull of G l and c . l 2 ¹ kf k 2
3.7. Fourier Transform and Series then, for every n, there exists a function f n in the convex
Other approaches use the Fourier distributions and series. hull of n points of G l such that
Barron (1993) adopts the Fourier distributions to derive c
kf ¹ fn k2 #
bounds on the number of nodes. The Fourier distribution n
of a function f is a measure F̃(dw) ¼ e iv(w)F(dw), where
e iv(w) and F(dw) are the phase and the magnitude distri- A proof of the lemma is found in Barron (1993). The
butions, respectively, such that result applies to our case with G l ¼ {hlh(x) ¼ aj(v9x þ b),
Z a,b [ R, lal , l,v [ R m}.
f (x) ¼ eiwx F̃(dw) (15) This lemma is existence in nature, so that the whole
proof is nonconstructive. Thus an open question, which
If B is a bounded set of R d over which we wish to approx-
arises from this, is whether the same results can be repro-
imate f, then we can relax eqn (15), requiring that it is
duced in a constructive manner. On the other hand, pro-
satisfied only for x [ B. Barron (1993) defines the con-
position 7 states that, for the class of functions G, the
stant C f which gives an estimate of the complexity of a
error decreases as O(n ¹1/2), so that the order of the
function in the set B
Z error does not depend on d. It is also observed the number
Cf ¼ lwlB F(dw) of parameters of the network grows as O(n ¹1/2). This is
an important aspect from a computational point of view,
where lwl B ¼ sup x [ Blwxl. The functions for which C f is because it states that for this class of functions the curse
finite will be denoted by G. Barron (1993) proves that of the dimensionality has no effect. Furthermore, as
24 F. Scarselli and Ah Chung Tsoi

Barron noticed, the result appears to suggest that approx- coefficients must be sorted according to their magnitude.
imation by sigmoids has some advantages compared with This could be a problem in the sense that the range of
other common traditional approximation techniques, e.g. Fourier coefficients may be large, and it is not clear how
polynomials, splines, which, in contrast, requires an to order them and what is the computational complexity
exponential number of parameters. of the problem. With respect to the approach used by
Another approach, put forward by Mhaskar and Barron (1993), this one has the advantage of being
Micchelli (1992, 1994), is based on the Fourier series. ‘‘almost’’ constructive in nature, in that it assures that
The Fourier series of a function f is defined as the bounds can be effectively obtained by a computa-
X tional algorithm. The bound is weaker [O(n ¹1/4) com-
f (x) ¼ f˜(k)eikx (17)
pared with O(n ¹1/2)]; however, the constraints on f and
k[Z d
on j, respectively, are weaker as well. Observe, further,
where f̃(k), k [ Z d are the Fourier coefficients. In parti- as in Barron (1993), the order of the error does not
cular, Mhaskar and Micchelli address the question of depend on the dimension of the input space.
periodic functions for which kfk SF ¼ o k [ Zdlf˜(k)l is finite. In the end, the bounds found, respectively, by Barron
The basic idea is similar to the one adopted above. Since and Mhaskar and Micchelli are the lowest that we are
eqn (17) represents f, we will truncate the infinite sum to aware of. It would be interesting to further understand
a finite set of elements and rewrite e ikx in terms of the how this result depends on the particular manner Barron
activation functions. Mhaskar and Micchelli (1992), and Mhaskar and Micchelli adopts to measure the com-
however, take this another step further and show that if plexity of the function f. Note that the orders of the error
we reorder the terms in eqn (17) in a decreasing order [O(n -1/2) and O(n ¹1/4), respectively] are independent of
with respect to lf˜(k)l the error goes to 0 as O((m þ 1) ¹1/2), the dimension of the input space d, so that the error
where m is the number of total terms. Formally, we appears not to suffer the curse of dimensionality. How-
obtain ever, the error depends also on the constants C f and kf k SF
am (2p)d=2 llf llSF [eqns (16) and (19)]. Barron notices the C f could be large
llf ¹ f̄ llL2 (Qd ) # p and grows exponentially with respect to d, for some class
mþ1 of functions [in Barron (1993) values of C f are listed for
where different types of functions]. Note that, according to its
X definition, C f measures the complexity of the functions in
f̄ (x) ¼ f˜(k)eikx (18) terms of the Fourier transform and, in turn, the sinusoidal
k[L
functions. In this sense, those results depend on a parti-
Q is the interval [ ¹ p,p], {a m} is a set of positive reals cular method of measuring function complexity and it is
which converges to 0 as m → ` and L is a set containing a topic for further research to evaluate such a method and
the m vectors for which the values lf˜(k)l attain the max- to compare it with other solutions.
imum. Then, it is proved that the exponential term e ikx
can be substituted by the sums of activation functions.
The proof is different from the ones we have reviewed so 3.8. Radial Basis Functions
far in this overview. We will omit the proof here and For FNNs which use radial basis functions (RBFs) in the
refer the interested readers to the paper (Mhaskar and hidden layer activation functions, Park and Sandberg
Micchelli, 1994). (1991, 1993) adopt the following arguments to show
Mhaskar and Micchelli conclude that f can be approxi- that they constitute a class of universal approximators
mated by f̂, which is a sum of 2n 2 activation functions, for a general nonlinear mapping f.
with the error bounded to lie in the following range: Suppose that g(x,a,b) ¼R k((x ¹ a)/b) is a RBF, k is
  integrable on R d and that RRd k(x) dx Þ 0. Assume with-
bn (2p)d=2 llj̃llSF
llf ¹ fˆllL2 (Qd ) # p  1 þ llf llSF (19) out loss of generality, that Rd k(x) dx ¼ 1, if the integral
nþ1 lj(1)l
is different R from 1, we substitute k with
where {b n} is a set of positive reals which converges to 0 k̄(x) ¼ k(x)= Rd k(x) dx. The function k e(x) ¼ e ¹dk(x/e),
as n → `. So the error is O(n ¹1/4) with respect to the e . 0, is usually called the e-mollifier of k. The convolu-
number of nodes of the network. In fact looking at the tion of k e and f R [ L p(R d) is denoted by k e*f and is defined
proof, it is observed that in order to make the error by (k e*f)(x) ¼ f(x)k e(y ¹ x) dy. The convolution satis-
decrease linearly, O(n 2) terms are required in eqn (18) fies llf ¹ kep f llLp (Rd ) → 0 when e → 0 for 1 # p , ` [see
and, for each term, there are O(n 2) nodes. Here, the basic Park and Sandberg (1991) for the proof].
assumptions are The idea is to fix e so that g e*f is sufficiently close to f
and, then, to approximate g e*f by a finite sum in the
1. the activation j is periodic; and
form of
2. kjk SF is finite.
X
n X
n
The construction of the network requires a know- fˆ(x) ¼ f (yi )ke (yi ¹ x) ¼ f (yi )g(x, ¹ yi , ¹ e)
ledge of the Fourier coefficients. Furthermore, Fourier i¼1 i¼1
Universal Approximation 25

where the y i, i ¼ 1,2,…,n, are selected according to a grid important suggestions about the solution of the general
over R d. If f and k are both continuous, then f(·)k e(· ¹ x) is problem of approximation by FNNs. In fact, in some
Riemann integrable and the convergence is assured. Park cases we will be able to extend, to continuous function
and Sandberg prove that approximation, the results that hold for polynomial func-
tion approximation.
Proposition 9. If g(x,a,b) ¼ k((x ¹ a)/b) is a RBF and k is In general, the results will give us an intuitive idea of
integrable then S3g is dense in L 1(R d) if and only if
R the general solution. Of course, sometimes the intuition
Rd k(x) dx Þ 0. might be vague or even misleading. However, from the
overview contained in the previous section it is evident
Park and Sandberg give other conditions for the other that all the existing approaches have this common draw-
norms (see Park and Sandberg, 1991, 1993 for details). back, because they study the problem of continuous func-
The proof is ‘‘almost’’ constructive in nature. There tion approximation by properties of a class of functions
are, however, two open questions associated with this or a representation formula. For example, results in
work, i.e. Section 3.3 depend on properties of discriminatory
functions, the ones in Section 3.4 on representation of
1. how to determine the value of e; and
functions by Kolmogorov’s theorem and so on. In this
2. how to determine the refinement of the grid which
sense, the approach we propose here provides one of the
allows the required approximation.
possible ways to consider the problem.
A problem of this approach is the polynomial growth
rate of the value of n with order d with respect to the
4.1. From d-Variable Functions to Univariate
distance of the points in the grid. This well-know pro-
Functions
blem of RBFs is clarified by the proof in Park and Sand-
berg (1991, 1993). When e → 0, k e becomes a function If we carefully review the discussions in the previous
with a neighborhood of the center as support. A node sections, we will note that the constructive proofs follow
influences the outputs of the network only for inputs a common strategy when the FNN has ridge activation
near to its center, so that we require an exponential functions. The proof consists of two parts.
number of neighborhoods to cover the entire domain.
1. The function f is decomposed as a sum of generic
This suggests that RBFs are well suited for problems
ridge functions with respect to a finite set of
with a small number of inputs, i.e. when d is small.
directions.
Another interesting point is that it is possible to fix the
2. Each ridge function is approximated by a linear com-
weights between the input and the hidden layer neurons,
bination of translations of the activation function.
according to a sufficiently fine grid and a small a. Adjust-
ing only the weights from the hidden layer to the output Different solutions are available for both parts of the
layer neurons, the network can approximate every func- proof and many properties of the proof depend on how
tion to an arbitrarily degree of accuracy. different solutions are combined together.
In the next section, we will prove that it is possible to From an intuitive point of view, the proofs split the
adjust the weights without the usual associated problems domain into a number of partitions. In fact, the weights
of local minima in the training algorithm. This suggests associated with a node identify a hyperplane {xl v i9x þ b i
that RBFs could be particularly suited for those problems ¼ 0}, where v i is the direction and b i the threshold. If the
for which the ridge functions require an exponential activation function is the logistic sigmoid, the hyperplane
number of nodes. divides the domain into two regions, one for which the
outputs of the node are greater than 1/2 and the other one
with values smaller than 1/2. The nodes participating to
4. OUR APPROACH TO POLYNOMIAL
realize the same ridge function form a set of parallel
APPROXIMATION
hyperplanes and all the hyperplanes of a network
In the following, we will consider principally the together split the domain in a particular partition. If
problem of constructive approximation of the class of thresholds are selected using the method of the staircase
algebraic polynomial functions. A solution to the functions as indicated in Section 3.1, the partition is
problem will be given. More precisely, it will be proven uniform and distributed over all domain. For example
that polynomials can be approximated up to any degree Figure 4 represents a possible partition for d ¼ 2 and
of precision by FNNs with a fixed number of hidden layer the set of directions {(1,¹1), (1,1/3), (1,¹1/3), (1,1)}.
neurons. Then, the discussion will turn to the structure of The hyperplanes are straight lines in R 2; the figure
the network which is necessary to realize a prescribed shows seven hyperplanes for each direction, each
degree of approximation. In addition, we will give a hyperplane corresponding to a different value of the
number of remarks and results. threshold b i.
The motivation of this study is that polynomials are The approach by Park and Sandberg for FNNs with
dense in the continuous functions and can give us RBFs (Section 3.8) presents some similarities, since, in
26 F. Scarselli and Ah Chung Tsoi

one variable and degree not greater than r such that


X
s
p(x) ¼ hi (vi 9x) (20)
i¼1

Moreover, the number of directions


!
rþd¹1
d¹1
is minimal provided that the h i are polynomials. In fact,
h 1,…,h s must span the entire space of polynomials of
degree not greater than r and, in particular, the subspace
Hrd of homogeneous polynomials of degree r. Each of
h 1,…,h s spans a subspace of Hrd of dimension 1 and the
dimension of Hrd is just
!
rþd¹1
:
d¹1
FIGURE 4. A partition generated by the set of directions {(1,¹1),
(1,(1)/(3)), (1,¹(1)/(3)), (1,1)}. Here the question is also to discuss the case when
h 1,…,h s are generic univariate functions: i.e. ‘‘is there
any advantage when h 1,…,h s are generic functions
that case, the partition is fixed by a grid over the entire instead of polynomials?’’. Writing h 1, …, h s in terms
domain. However, compared with RBFs, the grid is of their Taylor series expansion and using the argument
defined by hyperplanes orthogonal to the axes, so that with dimension of spaces, we can prove that s must be at
the number of sets of parallel planes is just equal to the least
dimension of the input space. On the other hand, ridge !
activation functions require a number of sets of parallel rþd¹1
planes which depends on the approximated function. d¹1
This fact can be exploited as indicated in the following
subsections. to generate exactly all polynomials of degree r. However,
the approximation property is weaker than an exact
generation property, so that the reasoning is not com-
4.2. Directions of the Ridge Activation Functions pletely satisfactory.
Another open problem associated with this approach is
A first question to answer is how many directions are the following: are there polynomials of order r which
required to approximate a polynomial. We have observed require
(Section 3.5) that every homogeneous polynomial of !
rþd¹1
degree r can be written as a sum of powers (v 19x) r,…,
(v s9x) r provided that S ¼ {(v i,2/v i,1,…,v i,d/v i,1)l v i,1 Þ 0, d¹1
1 # i # s}, satisfies the unique interpolation property for
directions even if the directions are selected with respect
polynomials of degree r. It is observed that if the property
to the polynomials instead of being prescribed a priori?
is satisfied for polynomials of order r then there is a
That is, does the possibility of selecting the direction
subset of S which satisfies the unique interpolation
allow a lower bound on the required directions? For
property for polynomials of degree r9, where r9 # r.
example, r ¼ 2, x 2 þ 2xy always requires three
Applying the arguments used by Chui and Li (1992)
directions.
(see Section 3.5) to every 0 # r9 # r, we obtain
The theorem says that polynomials of bounded degree
and number of variables require a constant number of
THEOREM 1. For every polynomial p(x) of d variables and
directions. With respect to r and d, the number of direc-
degree less than or equal to r, every set of directions
tions is of polynomial order when r and d increase indi-
{v 1,…,v s}, with
! vidually, and it is exponential if they increase
rþd¹1 simultaneously. This result is obtained by observing that
s¼ , !
d¹1 rþd¹1
(r þ 1)d ¹ 1 =(d ¹ 1)! ,
such that the set {(v i,2/v i,1,…,v i,d/v i,1)lv i,1 Þ 0, 1 # i # s} d¹1
satisfies the unique interpolation property for poly-
nomials of order r, there exist polynomials h 1,…,h s of , (r þ d ¹ 1)d ¹ 1 =(d ¹ 1)!
Universal Approximation 27

and Proof. To prove the theorem, we rewrite p a as


! ! !
rþd¹1 X
r
ck X
k k
d =r! ,
r
, (r þ d ¹ 1) =r!
r
pa (x) ¼ ( ¹ 1)k ( ¹ 1)l j(lax þ b)
d¹1 a k j (k) (b)
k¼0 l¼0 l
In the previous section, we have observed that Mhaskar (22)
and Micchelli (1992) considered the function f such that rþ1
Since j is C in a neighborhood of b, it can be expanded
kfk SF is finite. This class of functions is far larger than
in a Taylor series with Lagrange remainder around b up
polynomials and includes a large number of interesting
to the any order k, 0 # k # r, and we have
functions. They proved that the approximation error
decreases as O(n ¹(1)/(2)), with respect to the number of X
k
ÿ 1 X
k
ÿ 1
directions and the corresponding associated L 2(Q d) ( ¹ 1)l kl j(lax þ b) ¼ ( ¹ 1)l kl
l¼0 l¼0
norm. Furthermore, the work of Mhaskar and Micchelli
(1992) suggests that if we could select directions (with !
Xk
j(i) (b)(xal)i j(k þ 1) (yl )(xal)k þ 1
integer components) for which the Fourier coefficients þ
are sorted in magnitude, then we could effectively realize i¼0 i! (k þ 1)!
that limit.
On the other hand, Theorem 1 deals with exact reali- Xk
j(i) (b)(xa)i X
k
ÿ 1
zation of functions, while Mhaskar and Micchelli discuss ¼ ( ¹ 1)l li kl
i¼0 i! l¼0
approximation of functions. In fact, Theorem 1 defines
the number of directions necessary to yield any degree of (xa)k þ 1 X
k
ÿ 1
precision, while Mhaskar and Micchelli provides a bound þ ( ¹ 1)l lk j(k þ 1) (yl ) kl
(k þ 1)! l ¼ 0
on the error for increasing number of directions. In this
sense, those are two different point of view on the same
Xk
j(i) (b)(xa)i
problem. ¼ k!( ¹ 1)k Si(k)
Note also that with the approach of Mhaskar and i¼0 i!
Micchelli the directions depend on the function to be !
(xa)k þ 1 X
approximated f. They are the ones with the largest k k
Fourier coefficients, while in our approach, the directions þ ( ¹ 1)l lk j(k þ 1) (yl )
(k þ 1)! l ¼ 0 l
are independent of f. This is a weaker assumption; it has
the disadvantage of resulting in a weaker bound, but the where y 1,…,y k are reals that satisfy y l [ [b þ lax], 0 # l
advantage is that the bound is reachable by an effective # k. Note that, by the continuity of j (kþ1) around b, the
algorithm. In fact, in Section 4.7 we will present an algo- terms j (kþ1)(y l) are bounded provided that a is suffi-
rithm which can obtain such an approximation. ciently small. Moreover, the value
!
1 X
k k
S(k)
i ¼ ( ¹ 1)k ¹ l li
4.3. Approximation of the Ridge Polynomial k! l ¼ 0 l
Functions
is called the Stirling number of the second kind and
Combining Theorem 1 with univariate function approx- equals the number of ways of partitioning a set of i ele-
imation by step functions, the error is O(1/n) with respect ments into k nonempty subsets. The Stirling number S (k)
i
to the number of nodes and an associated uniform norm. is 0 for 0 # i , k and 1 for i ¼ k (see Abramovitz and
However, for polynomials, there is another solution Stegun, 1968). Thus,
which requires only a constant number of neurons. The !
idea extends the one in Kreinovich (1991) and Leshno X
k k
et al. (1993). ( ¹ 1) l
j(lax þ b) ¼ (xa)k j(k) (b) þ O(ak þ 1 )
l¼0 l
THEOREM 2. Suppose that j is C rþ1
in an open neigh- (23)
borhood of b and none of the derivatives up to the order r Substituting eqn (23) in eqn (22), we have
is zero in b, i.e. g (i)(b) Þ 0, 0 # i # r. Let c 0,…,c r be a set X
r
of real numbers, p(x) ¼ c 0 þ c 1x þ … þ c rx r and pa (x) ¼ (ck xk þ O(a))
!! k¼0
X
r X
r
ck k
pa (x) ¼ ( ¹ 1)k þ l k (k) j(lax þ b) and passing to the limits, we obtain the theorem as
l¼0 k¼l a j (b) l required.
(21) The sum in eqn (21) contains r þ 1 terms, one of which
is a constant. We conclude that r hidden nodes are suffi-
then the function p a converges uniformly to p in any cient to approximate every univariate polynomial of
bounded interval for a → 0. order r. Moreover, since multivariate polynomials are
28 F. Scarselli and Ah Chung Tsoi

realizable as sums of as many univariate polynomials as


directions [eqn (20)] and the approximation of each uni-
variate polynomial requires r hidden neurons, we obtain
the following theorem.

THEOREM 3. Let p(x) be a d-variable algebraic poly-


nomial of degree not greater than r, j:R → R be a func-
tion with continuous derivatives up to the order r þ 1 in
an open neighborhood of the point b. Moreover, let
j (i)(b) Þ 0 hold for each 0 # i # r and let e be a
nonnegative real number. Then there exists a three-
layered FNN with activation functions j,
! FIGURE 5. A partition that consists of one point.
rþd¹1
r
d¹1
Theorem 3 and the associated discussions suggest
hidden neurons and a single threshold b in the hidden more remarks on the relationships between the structure
layer such that of the FNN that approximate a function and the function
itself. The remarks concerning the partition constructed
kp ¹ fˆkL` # e by the directions, the thresholds and the values of
where f̂ represents the function realized by the FNN. weights are discussed in the following three sections.

The proof of the theorem follows from the previous dis-


4.4. The Partition
cussions and hence will be omitted.
A fixed number of neurons is required in this particular From our discussion in the previous section, and particu-
case. Observe that the theorem cannot be deduced by any larly from eqn (21), we further deduce that it is not
existing result where the approximation accuracy necessary that the partition covers the entire domain. In
depends on the number of hidden neurons. the limit, it could be reduced to a single point (Figure 5).
Also note that Theorem 3 (see previous discussion on This happens when all the hyperplanes pass through the
the number of directions) states that the approximation of same point. For example, if all the nodes have a 0
a polynomial requires a number of hidden layer nodes threshold, all the hyperplanes pass through 0. On the
that grows as O(d r) and O(r d) with respect to r and d, other hand, it could be that the domain is contained in
respectively. This is only apparently in contrast with the only one set of the partition, see Figure 6. Eqn (21)
results discussed in Section 3.7, where the order of the provides that the thresholds (they are represented by b
bounds is independent of d. In fact, the bounds also in the formula) remain constant while the other weights
depend on the constants C f and kf k SF [see eqns (16) and in the hidden layer approach 0. If kv ik → 0 and b i Þ 0, the
(19)] which grow as the degree of f, a polynomial func- hyperplane {xlv i9x þ b i ¼ 0} becomes far away from the
tion, increases or the input dimension increases. point 0 and, in the limit, it does not intersect the domain.
Theorem 3, which concerns the approximation of In fact, the output of the network at a point, say x, is a
polynomial functions, can be used to study the approx- linear combination of the values h 1(v 1 x),…,h s(v sx), i.e. it
imation of more generic functions. Analytic functions depends on the values of the functions h 1,…,h s obtained
can be analyzed in terms of their Taylor series and on the projections of x over the directions v 1,…,v s. So,
other functions in terms of approximating polynomials. the emphasis is not on the grid size, but on the directions.
If a function can be approximated with an error e by a
polynomial of order r, then with
!
rþd¹1
r
d¹1
hidden neurons a FNN can realize the same approxima-
tion. In theory, for analytic functions this procedure is
computable and allows an effective construction of the
FNN. However, the procedure is not useful in practice,
because it requires the knowledge of the Taylor series
expansion. Notwithstanding this, it guarantees that the
discussed results are effectively computable by an algo-
rithm. We will come back to this question in Section 4.7, FIGURE 6. The domain is contained in only one set of the
where a more practical method will be described. partition.
Universal Approximation 29

The following theorem, which is an alternative to continuous functions over compact sets. More precisely,
Theorem 2, further stresses the fact that the grid is not Theorem 2 suggests that it is sufficient that there exist an
important. It states that each one of the polynomial func- infinitive number of indexes i for which j (i)(b) Þ 0. On
tions
Pr þ 1hPi(v ix) can be approximated by sums in the form of the other hand, Theorem 2 provides that if j (i)(b) Þ 0 for
r
i¼1 l ¼ 0 w l,i j(alx þ alb i) where b i can be fixed in all i, then it is possible to employ one common value for
advance. Since the hyperplanes does not depend on a and all the thresholds [see eqn (21)]. Also note that eqns (21)
l, they themselves can be fixed in advance independently and (24) implies that the approximation can be carried
of the function to be approximated and the prescribed out such that the weights in the hidden layer are as small
degree of precision. as required. In fact, it is sufficient to select small a. The
following theorem sums up such reasoning.
THEOREM 4. Let c 0,…,c r be real numbers, p(x) ¼ c 0 þ
c 1x þ … þ cx r a univariate polynomial in x, r a non- THEOREM 5. If j is infinitely differentiable in a open neigh-
negative integer, and b 1,…,b rþ1 real values such that borhood of a point b, A is an open subset of R n that
b j Þ b i for j Þ i. Furthermore, suppose that j is C rþ1 includes 0, B an open subset of R that includes b and
in open neighborhoods of 0 and satisfies j (r)(0) Þ 0. j (i)(b) Þ 0 holds for an infinite number of indexes i, then
Then there exist reals w 1,1,…,w 1,rþ1,…,w r,rþ1, which S3j (A,B) is dense in C(K) for all compacts K. Moreover,
depend on a real a, such that p̄ a, if j (i)(b) Þ 0 for all i then also S j(A,{b})(A,{b}) is dense
rX
þ1 X
r in C(K).
p̄a (x) ¼ wl, i (a)j(alx þ albi ) (24)
i¼1 l¼0 This result is similar to the ones given in Hornik
converges uniformly to p on every bounded interval. (1993) and Leshno et al. (1993) (see also Section 3.3).
It also extends some aspects of those results, because
Proof. Let us consider the homogeneous polynomial Theorem 5 defines the properties that a point b must
t(x,y) ¼ c 0y r þ c 1y r¹1 x þ … þ c rx r. It can be written satisfy in order that the FNN can employ b as the
as linear combination of the powers (x þ yb 1) r,…,(x þ unique value for all the thresholds. Moreover, Theorem
yb rþ1) r. This is a consequence of the argument used by 5 requires that j is C ` and has nonnull derivatives,
Chui and Li (1992) (see Section 3.5). Since p(x) ¼ t(x,1), instead to be analytic and nonpolynomial. Note that if
it follows that j is analytic and nonpolynomial in an interval, then there
is at least a point b where j (i)(b) Þ 0 for all i, while the
rX
þ1
converse is not true.
p(x) ¼ vi (x þ bi )r (25) Theorems 2 and 4 have constructive proofs that sug-
i¼1
gest a way to realize FNNs. A further and more challenge
for reals v 1,…,v rþ1. question is to establish relationships between the FNNs
Furthermore, applying the same reasoning as con- of the theorems and the FNNs realized by common train-
tained in the proof of Theorem 2, we show that ing algorithms of the gradient descent type. It is unlikely
!
X
r
1 r that common learning algorithms exactly deploy the
qa (y) ¼ ( ¹ 1) rþl
j(lay) results of the theorems. The directions of a node should
l¼0 ar j(r) (0) l have weights fixed according to some precise relation-
converges to the power y r.9 Substituting (x þ b i) r with ships; in our case, they differ by integer multiples, the l in
q a(x þ b i) for every i, 1 # i # r þ 1, in the sum on the the eqns (21) and (24). The training algorithms do not
right-hand side of eqn (25) we obtain make use of this fact and simply adjust the weights
! according to some gradient descent techniques. When
rX
þ1 X
r r
cr the constant a approaches 0, small change of the weights
vi ( ¹ 1) rþl
(r)
j(al(x þ bi ))
i¼1 l¼0 a j (0) l
r of the network may cause significant corresponding
changes of the error; this could only be maintained by
This sum converges uniformly to p. Hence, the theorem using training algorithms with a very small learning rate.
is proved. However, notwithstanding this, the FNNs found by the
algorithms might follow rules similar to eqns (21) and
Theorem 4 is easly extended to the case when j is C rþ1 (24), probably more general rules than which had been
in an open neighborhood of a point b and j (r)(b) Þ 0: in observed so far.
fact, it is only required to replace all the terms alx þ alb i
with alx þ alb i þ b. Thus, an immediate consequence of
Theorems 3 and 4 is that if j is C ` and has nonzero 4.5. The Thresholds
derivatives somewhere around b, then S3j is dense in Theorem 2 proves that a single threshold, or more pre-
cisely one common value for all the thresholds, of the
9
Note that the hypothesis j (r)(0) Þ 0 is sufficient: the other deriva- hidden nodes is sufficient to obtain a prescribed degree of
tives of j need not be constrained, because they do not appear in q a. approximation, provided that the activation function has
30 F. Scarselli and Ah Chung Tsoi

FIGURE 7. Approximation of x 2 with one threshold (b ¼ 1). See text for more details.

nonzero derivatives corresponding to the value of the Intuitively, it should be difficult and error sensitive to
threshold. The same result has been obtained, by using approximate a function in the neighborhood of 0 when
different and nonconstructive routes, in Hornik (1993). the hyperplanes are far away from the origin, since every
However, these results do not resolve the problem node will contribute to the realization by an almost con-
completely, because they cannot explain whether to use stant or linear function. A further problem, in this case, is
a single threshold is a better solution than to use more that it is unlikely that the learning algorithms could effec-
thresholds. The following discussion is informal, since at tively work when the ratio of the thresholds to the
the present moment, we do not know how to formalize it, weights becomes very large.
but notwithstanding this, it discloses interesting aspects On the other hand, consider the situations depicted in
of the question. case 2. Most of the commonly used activation functions
First of all, observe that eqns (21) and (24) differ do not satisfy the constraint on the derivatives of theorem
because the former uses a single threshold and the 2 in the neighborhood of 0. For example, the even deri-
latter uses more thresholds. Let us compare the advan- vatives of the logistic sigmoid and the odd derivatives of
tages of the two expressions. Eqn (21) provides that, the gaussian are zero in the neighborhood of 0. This is
while thresholds remain constant, the other weights of because of the fact that the logistic sigmoid is an odd
the hidden layer becomes very close to 0. So, it admits symmetric function, and the gaussian function is an even
two cases: symmetric function. It is easily proved, using the same
reasoning as contained in the proof of theorem 2, that
1. the threshold is not zero and the approximation
with one null threshold, polynomials which contain only
requires a large ratio of threshold to weights;
odd degrees can be approximated by logistic sigmoid
2. the threshold is 0.
functions and polynomials with only the even ones by
According to the discussion on partition in previous the gaussian functions. Theoretically every nonnull
sections, the two situations correspond to the limiting threshold could work, but the small ones imply large
case when the hyperplanes are far away from the origin weights in the network because of the presence of the
(case 1) and to the limiting case when all the hyperplanes term 1/j (k)(b) in eqn (21).
pass through 0 (case 2). On the other hand, eqn (24) So, according to the above discussion, eqn (24)
admits all partitions including the ones that split and appears to be more meaningful with respect to eqn (21)
cover the domain by small similar subsets. and seems to suggest that to use more thresholds give
Now, consider the situations as depicted in case 1. The some advantage. The following observation concerning
commonly used activation functions become constant for the sigmoid functions further supports this idea.
large positive and negative inputs (the logistic sigmoid, the Let us restrict our attention to the approximation of
gaussian, the step function and so on) or they become linear univariate functions. Regardless of P the threshold b, any
(the linear function, the sigmoid plus a linear function10). sum of sigmoids in the form of f̂(x) ¼ ri ¼ 0 w i lsig(v ix þ b)
verifies lim x → ¹` (f̂(x) ¹ f̂(0)) ¼ ¹ lim x → ` (f̂(x) ¹ f̂(0)). In
fact, the term lsig(ia x þ b) is an odd symmetric function
with respect to the point (¹b/v i, fˆ(0)). Intuitively we can
10
Some applications adopt the function lsig(x) þ kx, where k is a
small scalar, instead of lsig(x). In fact, the derivative of the former is
nonnull for large x, that resolves part of the problems with flat error
explain this situation as follows: while selecting the
surfaces. appropriate parameters, we can constrain f̂ to approximate
Universal Approximation 31

FIGURE 8. Approximation of x 2 with two thresholds, (b 1 ¼ ¹0.5, b 2 ¼ 0.5). See text for more details.

every polynomial of order r in a bounded interval; far away Another question which can be addressed in this con-
from the interval, f̂ will be an odd symmetric function. A text concerns the connections in the network. In the
further interpretation of this situation is possible by our literature, and similarly in this paper so far, it is always
discussion on the partition in case 1. In fact, the function assumed that the network is completely connected from
to be approximated determines the behavior of f̂ in the one layer to another. However, it is known that in some
domain, but outside this domain, in the other sets of the biological structures the number of connections is
partitions which do not contain the domain, the behavior of limited. For example, this is true for neural networks
f̂ is induced by the odd symmetric nature of the logistic connected in a local fashion, the so called local receptive
sigmoid. field. Secondly, Sontag (1991) has shown that in some
We have no formal proof that this is a problem. How- cases that the FNN gives a better generalization cap-
ever, in some cases, it gives rise to nonnatural approxima- ability if direct feedforward links are allowed between
tions, for example, when we try to approximate symmetric the input and output layer neurons. It would be interest-
functions. Figure 7 depicts what happens when the ing to understand what this means from the point of view
approximated function is p(x) ¼ x 2, the approximation of the approximation properties.
interval is [¹1,1] and only one threshold (b ¼ 1) is used. Suppose we restrict the number of input links to a
Figure 8 shows the corresponding case when more thresh- maximal of N for each hidden node. If we wish to com-
olds are employed (b 1 ¼ ¹0.5, b 2 ¼ 0.5). The weights of pute, in the polynomial computed by the network, a pro-
the first layer are selected by eqns (21) and (24), respec- duct xi1 , …, xiN of a subset of inputs, then the network
tively, where r is 2 and a is 0.2. The continuous line must contain a direction v i which has components
represents the function realized by the network, the vi1 , …, viN different from 0. In fact, the products are
dotted line the function to be approximated, p(x) ¼ x 2. generated by the powers (v 19x) r,…,(v s9x) r and only the
products for which nonzero components of directions are
generated.
4.6. The Weights
Thus, three-layered FNNs which can approximate every
Note that eqn (21) implies that the approximation can be function must use at least a direction with all components
carried out such that the weights in the hidden layer are different from 0, i.e. a node connected to every input.
as small as required. In fact, it is sufficient to select small An alternative for networks with limited number of
b and a. The threshold b must satisfy the constraint on links is to have more layers. In this case, with the first
the derivatives of j; however, if j is analytic and non- hidden layer, we can generate every polynomial of N
polynomial then there exists a b that satisfies the con- variables and, as a consequence, we can approximate
straint and is as close to 0 as desired. Furthermore, the every function of N input variables. The second hidden
weights from the hidden layer to the output layer can be layer can compute every function of N 2 variables and so
made small by creating many instances of the same on. In the end, log Nd hidden layers are required.
hidden nodes. Thus, under the constraint of an analytic
and nonpolynomial activation function j, we can approx-
4.7. Two Constructive Approximation Algorithms
imate f with a FNN that has weights as small as required.
This result is similar to the ones given in Hornik (1993) The main aim of this section is to define the kernel of two
(see also Section 3.3 and Leshno et al., 1993). algorithms that employ, in a practical way, some of the
32 F. Scarselli and Ah Chung Tsoi

ideas which emerged from the above discussions. The 4. Construct a three-layered FNN, which contains L i
algorithms allow the construction of FNNs, avoiding hidden nodes for each direction v i, 1 # i # s. The
some of the problems of convergence of the training weights of the jth node of the ith direction must be av i
algorithm due to sub-optimal local minima which com- and the threshold ab j.
monly affect the learning process. The purpose is to
5. Adjust the parameters of the output layer so as to
provide a method to initialize the networks and at the
minimize the error function
same time to suggest directions to be investigated in
order to design new training algorithms. In particular,
e(W, b) ¼ kf ¹ fˆW, b kLp (m)
attention is directed towards applications which require
a small number of inputs while requiring that the error
where f̂ w denotes the function implemented by the
remains bounded within a certain limit. This situation is
neural network, W and b are the matrix of the weights
easily found, for example, in the applications of FNNs to
and the thresholds of the output layer, respectively.
control and system identification. Practical systems and
controllers with a small number of inputs are common. The value of a in step 1 should be sufficiently large so
Furthermore, it is often important to have bounds on the that the sigmoid can simulate a step function. However, if
error for control applications. our idea is to use the algorithm to initialize the network
The idea is based on the observation that the last layer and to use a different algorithm to complete the learning
of a FNN behaves like a linear combiner11 whose inputs process, then a must be selected in a way such that the
are the outputs of the nodes in the layer immediately ensuing training algorithm does not suffer from problems
prior to the last. Thus, the error surface is a quadratic with a flat error surface (see Saseetharan and Moody,
function of the weights of the last layer (see Widrow 1992).
(1990)) and has no local minima. More precisely, the There are results in the literature (Chung and Yao,
computation of the optimal value of the weights of the 1977; Nicolaides, 1972; Lee and Philips, 1991) that
last layer is polynomial in time, provided that f is defined could suggest how to generate a set of directions that
by a set of m pairs {(x i,f(x i))l 1 # i # m} and the error is satisfy the unique interpolation property (step 2). For
measured using the L 2 norm. For reason of space we skip example, in Chung and Yao (1977) it is proven that if
over details, however, it is quite easy to prove that the p 1,…,p d are the vertices of a nondegenerate (d ¹ 1)-
computation can be fulfilled in O(M 3) operations, where simplex12 then the set of points
M ¼ max(mq,n(d þ 1)), n is the number nodes in the (
1X
d¹1 dX
¹1
layer immediately prior to the last, and q is the dimension U ¼ ul u ¼ li pi , with li ¼ r and li $ 0,
of the codomain of f. r i¼1 i¼1
On the other hand, the proofs on approximation of 
polynomials of Section 4.3 suggest two different ways li integer, 1 # i # d ¹ 1 ð26Þ
to fix the weights of input to hidden layer in a three-
layered FNN, according to whether we are considering satisfies the unique interpolation property for polynomial
the proof that deploys staircase functions or the one that of order r. So, the set of directions {vl v ¼ [19,u9]9, u [ U}
uses eqn (24). Both the algorithms we present exploit satisfies the hypothesis of Theorem 1.
these facts. However, in practice, the random generation of the
directions may be an efficient alternative. For example,
4.7.1. Algorithm 1. The first algorithm is defined below. directions could be randomly generated by a uniform
It fixes the number of hidden nodes and the weights from distribution on the unit sphere. In fact, the probability
the input to hidden layer following the spirit of the proof that a set of random directions do not satisfy the unique
in the previous section which uses the staircase function interpolation property is low and it is sufficient to gen-
and Theorem 1. Then it adjusts the other weights, the erate more directions than necessary to overcome the
ones which connect the hidden to the output nodes, by problem. Furthermore, the random generation also
a common linear quadratic programming algorithm. would have the advantage of a uniform distribution of
1. Select a. the directions, which is intuitively preferable.
Let us consider again the example of Section 3.1 and
2. Select the directions v 1,…,v s. Figure2 2, 2 i.e. we desire to approximate f(x,y) ¼
3. Select a constant p and real values b 1,1,…,bs, Ls such xe ¹ 2x ¹ 2y on the square ¹1 # x # 1 and ¹1 # y # 1.
that b i,j ¹ b i,j¹1 ¼ p for each 1 # i # s, 2 # j # L i, and In this case the network has two inputs (d ¼ 2) and a one-
such that the domain is contained in between every simplex is just a segment of R. Thus, the set U defined in
pair of hyperplanes {xlv i9x ¼ b i,1} and {xlv i9x ¼ bi, Ls }. eqn (26) contains r þ 1 equally distant points. With the

12
A (d ¹ 1)-simplex is a geometrical figure in R d¹1 with d vertices.
11
A linear combiner, according to our terminology, is a one-layered The (d ¹ 1)-simplex is nondegenerate if no hyperplane contains all the
FNN with a linear activation function. vertices.
Universal Approximation 33

FIGURE 9. The function that is realized by the network built with algorithm 1: the one based on the staircase functions.

segment [¹1,1] and r ¼ 3 the set of the directions is V ¼ 4.7.2. Algorithm 2. This algorithm is defined by using a
{(¹1,1),(¹1/3,1),(1/3,1),(1,1)}. second strategy: it is based eqn (24). In this case, a must
Figure 4 represents this case, where further L i ¼ 7 for be a scalar with a small value.
each i, i.e. seven hyperplanes are used for each direc- In fact, eqn (24) suggests how to select the weights and
tions, and the thresholds are fixed so that most external the thresholds in order to guarantee that the network
hyperplanes are tangent to the domain. We have tested realizes every polynomial of a certain order. The algo-
the algorithm with a ¼ 0.8. Figs 2, and 9 represent f and rithm is equal to the procedures described previously,
the function realized by the FNN, respectively. We have except that steps 3 and 4 are modified suitably. The
found that the error, with respect to the L 2(k) norm, was new version of steps 3 and 4 appears below.
approximately 0.0053. Since kf k L2(K) < 0.3441, the rela-
tive error, i.e. the ratio of the error to kf k L2(K), was 0.0155. (3) Select an integer r and real values b 1,1,…,b s,r such
From a theoretical point of view, Theorem 1 and the that b i,m Þ b i,j for m Þ j.
discussion on approximation by staircase functions guar- (4) Construct a three-layered FNN with (r þ 1) 2
antees that with enough directions and hyperplanes, and a hidden nodes for each direction v i. For each 1 # i # s,
large value of a, the network together with the associated 1 # j # r þ 1, 0 # l # r there must be a hidden node
training algorithm as described can approximate every with weights alv i and threshold alb i,j.
polynomial of degree r. On the other hand, this assertion
and previous discussion on weights of hidden layer to Observe that the only constraint for the b i,j is that they
output layer suggest that, apart from the function we must be different for different j. Actually, this is the
are approximating (it may be a nonpolynomial), the constraint required by Theorem 4. However, we can
training in step 5 in algorithm 1 gives a solution that is expect a better behavior if the hyperplanes cover and
at least as good as the best approximation given by a split the domain into pieces that have similar dimensions.
polynomial of degree r. In reality, in the previous example, Thus, a good way to select the b i,j is such that b i,h ¹ b i,j¹1
the solution was better than the best approximation by ¼ p i for constants p 1,…,p s and the domain of the function
polynomials (of order 3) for which the error, with respect is contained between the pairs of hyperplanes {xlv i9x ¼
to the L 2(K) norm, was 0.0288 and the relative error b i,1} and {xlv i9x ¼ b i,rþ1} for each 1 # i # s.
0.0838. We have tested also this algorithm
2 2
with the example of
Observe that with this approach we form a partition of the function f(x,y) ¼ xe ¹ 2x ¹ 2y . The directions were the
the domain similar to the one used by CMAC (Albus, same as used for the first algorithm. We have fixed the b i,j
1975). The difference is that instead of using a neuron for to be {¹3/5, ¹1/5, 1/5, 3/5} for the nodes corresponding
every cell in the partition, we include a set of neurons for to directions (1,¹1), (1,1) and {¹4/5, ¹4/15), 4/15, 4/5}
every direction. The disadvantage is that we cannot otherwise; it is easily verified that, in this way, the set of
approximate every function using a fixed number of hyperplanes for each direction splits the domain into five,
directions. The advantage is that if the number of direc- equally large strips. With a ¼ 0.2 the approximation
tions is sufficiently large, the error decreases linearly error was 0.0042 and the relative error 0.0123. The
with respect to the number of neurons. We can expect result is similar to the one obtained by the first algorithm.
that in many practical situations, this approach allows us An important advantage of this algorithm is the fact
to decrease the number of the required nodes. that the weights of the first layer are selected to be small.
34 F. Scarselli and Ah Chung Tsoi

This eliminates part of the problems due to the flat error s, the number of nodes for each direction L 1,…,L s and the
surfaces. If we use the algorithm to initialize the network, value of a influence the approximation error. A large
the succeeding training algorithm will be much easier to error after step 5 may depend on the number of direc-
use. At this moment, this is the only algorithm we know tions, the parameters L 1,…,L s, or a being too small (too
that is capable of initializing the network with small large in algorithm 2). To be able to single out the reason
weights and, simultaneously, guarantee bounds on the would allow us to iterate the algorithm with more direc-
approximation error. The disadvantage is that a small a tions, more nodes, a different value for a or to decide on
implies large values for the hidden to output layer when to stop the iteration process.
weights. However, this does not cause flat error surfaces One possible strategy might be as follows: to under-
and, actually, it is not clear whether and why this may be stand if we are using a sufficient number of directions we
a problem. In the end, we observe that the computational could study how the error decreases as we increase the
complexity of both algorithms essentially depends on step number of nodes which contribute to a direction. If the
5. According to the previous discussion, step 5 can be ful- rate of decrease of the error is too low, it means that we
filled in O(M 3) operations, where M ¼ max(mq,n(d þ 1)), are using too few directions. If the number of directions
m is the number of patterns when the network error is is sufficient, the error depends only on how well the ridge
evaluated, d, n and q are the numbers of the input, hidden functions are approximated. In this case, using the step
and output neurons, respectively. So, the complexity is functions and algorithm 1, the error would decrease with
polynomial of order 3 with respect to the number of an order which should approximately depend on the
neurons and the number of patterns. In practice, this is second derivatives of f. The problem is similar to func-
an acceptable result if both m and n can be selected to be tion integration. ‘‘How much decrease in the integration
small. error would result if we increase the number of points
where the function is evaluated?’’ In this case it
becomes: ‘‘How much decrease the approximation
4.8. Open Questions
error would be affected if more nodes are used?’’
With our approach, approximation of generic functions is Both a and the parameters L 1,…,L s influence the error
not studied directly, but in terms of the polynomial in the approximation of each ridge function, i.e. h 1,…,h s.
approximation of the function we wish to realize. It To distinguish those two types of errors, we can suppose
should be interesting to discover to what degree our to have an infinitive number of nodes and estimate the
results depend on this choice. The formulation of a minimal error that we can obtain in this way. This is the
more direct approach to the problem appears difficult; error due to a. Then we have to compute what happens
in fact, all the results obtained in the literature have when we use a finite number of nodes. For example
been derived by indirect approaches. On the other applying this reasoning to the first algorithm we obtain
hand, a possible way could be to develop strategies alter- the following.
native to ours, where the polynomials are replaced by Consider the mollifier g e(x) ¼ e ¹1(lsig(e ¹1(x þ b)) ¹
other kinds of functions, e.g. sinusoidals, RBFs. Each lsig(e ¹1(x ¹ b))) for some
R real b. The convolution of g e
one could provide us with another point of view of the and f, i.e. (g e*f)(x) ¼ g e(x ¹ y)f(y) dy, represents an
problem. infinite sum of sigmoids and (f(x) ¹ (g e*f)(x)) 2 is
Possible variations of the training algorithms form bounded for the part of the error that depends on a.
more topics of future research. For example the possi- Then the error for using a finite number of nodes corre-
bility of extending the training algorithm to deal with not sponds to the error due to approximating the integral by a
only the weights of the output layer, but also the thresh- finite sum.
olds, the directions of the hidden layer neurons, or the
value of a which could be different for each node. Until
5. CONCLUSIONS
now, training algorithms have adjusted every parameters
of the network simultaneously. This probably gives the In this paper, we presented a survey of most of the avail-
best approximations, but it does not give any guarantee able results in function approximation using feedforward
that the learning process will be successful, i.e. that networks. It is shown that, when the FNN has a ridge
it will converge. Other strategies may guarantee the activation function, proofs with constructive characteris-
convergence of the algorithm or selects a compromise tics consist of two parts: the function f is decomposed as a
among the quality of the approximation, the robustness sum of generic univariate ridge functions with respect to
of the learning algorithm and the computational cost. For a finite set of directions; then each ridge function is
example, we could use networks whose weights are approximated by a linear combination of translations of
linked by special rules as in eqn (24). This means that the activation function. Different solutions for both parts
in step 5 in the second algorithm, we would adjust also determine different properties for the results.
the v i, the b i,j and a, changing, but only indirectly, the The computational aspects assumed in those works
parameters of the hidden layer. are further discussed. It is indicated that while the
Another question concerns how the number of directions general results are useful, it is more important to have
Universal Approximation

TABLE 1
The main characteristics of the reviewed approaches

Network type Decomposition in ridge Ridge function approximation Error bound Constructive approach
functions

Cybenko (1989), Hornik Three layers, ridge activations Hahn–Banach theorem None No
(1990, 1993)
Katsuura and Sprecher Four layers, ridge activatons Kolmogorov’s theorem None Yes
(1994)
Kurkova (1992) Four layers, ridge activations Kolmogorov’s theorem O(n -1/(dþ2)) Yes
Chui and Li (1992) Four layers, ridge activatons Sums of powers Hahn–Banach None For polynomials theorem and
only the decomposition part
Chen et al. (1995), Three layers, ridge activations Radom Transform step function argument O(n ¹1/(d¹1)) Not completely
Carroll and Dickinson (1989)
Barron (1993) Three layers, ridge activation Fourier distrbutions step function argument O(n ¹1/2) No
Mhaskar and Micchelli Three layers, ridge activations Fourier series Fourier series O(n ¹1/4) Not completely
(1992, 1994)
Park and Sandberg Three layers, RBFs e-mollifiers O(n ¹1/d) Yes
(1991, 1993)
Our algorithm 1 Three layers, ridge activations Sums of powers step function argument O(n ¹1) Yes, for polynomials
Our algorithm 2 Three layers, ridge activations Sums of powers eqn (24) Any degree of precision by a Yes, for polynomials
finite number of neurons
35
36 F. Scarselli and Ah Chung Tsoi

a constructive algorithm with known computational function approximation by multilayered pereeptrons. IEEE Trans-
complexity to realize the functions. It is shown that actions on Neural Networks, 3(4), 621–623.
Hecht-Nielsen, R. (1989). Kolmogorov’s mapping neural network exis-
some of the existing approaches, while giving very tence theorem. In International Joint Conference on Neural
general results, are of existence type in nature. Hence, Networks, vol. 3. Washington, DC: IEEE, pp. 11–14.
it is rather difficult to find out how to utilize the results to Hornik, K. (1990). Approximation capabilities of multilayer feed-
give a computational algorithm with a known computa- forward neural networks. Neural Networks, 4, 251–257.
tional complexity. Hornik, K. (1993). Some results on neural network approximation.
Neural Networks, 6, 1069–1072.
Furthermore, some works give bounds on the number Ito, Y. (1991). Approximation of functions on a compact set by finite
of nodes needed to realize a desired approximation. The sums of sigmoid function without scaling. Neural Networks, 4,
works of Barron (1993) and Mhaskar and Micchelli 817–826.
(1992) give the lowest bounds: the number of hidden Katsuura, H., & Sprecher, D.A. (1994). Computational aspects of
nodes is polynomial with respect to the error and the Kolmogorov’s superposition theorem. Neural Networks, 7(3),
455–461.
order is independent of the input dimension. In general, Kreinovich, V. (1991). Arbitrary nonlinearity is sufficient to represent
the bounds depend also on the method used to measure all functions by neural networks: a theorem. Neural Networks, 4,
the complexity of the approximated function and on 381–383.
whether the proof is constructive. Thus, a satisfactory Kurkova, V. (1992). Kolmogorov’s theorem and multilayer neural
comparison is difficult and a topic of future research. networks. Neural Networks, 5, 501–506.
Lee, S., & Philips, G. (1991). Construction of lattices for lagrange
Table 1 summarizes the main characteristics of the interpolation in projective space. Constructive Approximation, 7,
approaches discussed in this paper. 283–297.
We then turn our attention to approximating a special Leshno, M., Lin, V., Pinkus, A., & Shocken, S. (1993). Multilayer
class of functions, viz. the set of algebraic polynomials. feedforward networks with a polynomial activation function can
We have proved that algebraic polynomials are approxim- approximate any function. Neural Networks, 6, 861–867.
Mhaskar, H., & Micchelli, C. (1992). Approximation by superposition
able up to any degree of precision by a finite number of of sigmoidal and radial basis functions. Advances in Applied
hidden nodes. Then we have considered a number of Mathematics, 13, 350–373.
remarks, which can be deduced from the proof, on the Mhaskar, H., & Micchelli, C. (1994). Dimension independent bounds
structure of the network. In the end, we have discussed on the degree of approximation by neural networks. IBM Journal of
two variants of a constructive algorithm, with known Research and Development, 38(3), 277–283.
Nicolaides, R. (1972). On a class of finite elements generated by
computational complexity, which can be used to realize lagrange interpolation. SIAM Journal of Numerical Analysis, 9,
the set of given algebraic polynomials. A number of 435–445.
practical aspects of the algorithm are discussed. Park, J., & Sandberg, I.W. (1991). Universal approximation using
Finally, some open questions concerning the approach radial-basis-function networks. Neural Computation, 3(2), 246–
are discussed. 257.
Park, J., & Sandberg, I.W. (1993). Approximation and radial-basis-
functions networks. Neural Computation, 5, 305–316.
Rudin, W. (1991). Principle of mathematical analysis, Sections 6 and
REFERENCES 11.5. New York: McGraw-Hill.
Abramovitz, M. and Stegun, I. (1968). Handbook of mathematical Saseetharan, M., & Moody, M. (1992). A modified neuron model that
functions. New York: Dover. scales and resolve network paralysis. Network, 3, 101–104.
Albus, J. (1975). A new approach to manipulator control: the cerebellar Sontag, E. (1991). Remarks on interpolation and recognition using
model articulation controller (cmac). Transactions of ASME. neural nets. In Lippmann, R., Moody, J. and Touretzky, D. (Eds.),
Journal of Dynamic Systems, Measurement and Control, 97, 220– Advances in neural information processing system 3. San Matteo,
227. CA: Morgan Kaufmann.
Barron, A. (1993). Universal approximation bounds for superposition of Sprecher, D.A. (1965). On the structure of continuous functions of
a sigmoidal function. IEEE Transactions on Information Theory, 3, several variables. Transactions of the American Math Society,
930–945. 115, 340–355.
Blum, E., & Li, K. (1991). Approximation theory and feedforward Widrow, B. (1990). 30 years of adaptive neural networks: Perceptron,
networks. Neural Networks, 4, 511–515. Madaline, and Backpropagation. IEEE Transactions on Neural
Carroll, S. and Dickinson, B. (1989). Construction of neural networks Networks, 78(9), 1415–1442.
using the Radon transform. IEEE International Conference on
Neural Networks, vol. 1. Washington, DC: IEEE, pp. 607–611.
Chen, T., Clien, H., & Liu, R. (1995). Approximation capability in c(R̄ n)
by multilayer feedforward networks and related problems. IEEE NOMENCLATURE
Transactions on Neural Networks, 6(1), 25–30.
Chui, C., & Li, X. (1992). Approximation by ridge functions and neural
g an activation function
networks with one hidden layer. Journal of Approximation Theory, R the set of real numbers
70, 131–141. d the number of inputs of the neural network
Chung, K., & Yao, T. (1977). On lattices admitting a unique lagrange Rd the set of d ¤/ 1 vectors of reals
interpolations. SIAM Journal of Numerical Analysis, 14, 735–743. gauss the gaussian function
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal
function. Mathematics of Control, Signals, and Systems, 3, 303–
lsig the logistic sigmoid function
314. k·k the Euclidean norm
Geva, S., & Sitte, J. (1992). A constructive method for multivariate 9 the transpose operator
Universal Approximation 37

Slg the functions realizable by l-layered stair f,K,n a staircase function


networks with g activation function Slg (A,B) the functions realizable by l-layered
C(R d) the set of continuous functions on R d networks with constrained weights
C r(R d), C r the set of functions differentiable on R d up to Z the set of integers
the order r Hrd the set of homogeneous polynomials
C `(R d), C ` the set of infinitely differentiable functions Dm the partial differential operator
on R d C(R̄ d) the set of continuous functions with finite
lal the absolute value of a limit for kxk → `
K a bounded subset of R d F̃(dw) a Fourier distribution
m a measure e iv(w) a phase distribution
k·kL` (Rd ) , F(dw) a magnitude distribution
k·kLp (Rd ) , B a bounded subset of R d
k·kLp (m) norms for functions on R d (see text) Br a ball whose ray is r
`
k·kLp (K) , lwl B the upper limit of lwxl for x [ B
k·kL (K) norms for functions on K (see text) G a set of functions (see text)
L `(R d), Cf a constant measuring the complexity of f
L p(R d), f̃(k) a Fourier coefficient
L p(m), a n, b n sequences of positive reals
L `(K), Q the interval [¹ p, p]
L p(K) sets of essentially bounded functions (see k·k SF a norm on functions (see text)
text) ke the e-mollifier of k
w(f,d) the modulus of continuity of f * ! the convolution operator
a
f̂ the function computed by the network a binomial coefficient
O(f(x)) the order of f b
step the step function p(x) a polynomial
I an hyper-cube ! the factorial function
hstep I a hyper-step function (see text) j (i) the ith derivative of i

You might also like