DHSCH 6

Pattern
Classification
All materials in these slides were taken

from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John
Wiley & Sons, 2000
with the permission of the authors and
the publisher
Chapter 6: Multilayer Neural Networks
(Sections 6.1-6.3)
• Introduction
• Feedforward Operation and Classification
• Backpropagation Algorithm
3
Introduction
• Goal: Classify objects by learning nonlinearity
• There are many problems for which linear
discriminants are insufficient for minimum error
• In previous methods, the central difficulty was the

choice of the appropriate nonlinear functions
• A “brute” approach might be to select a complete

basis set such as all polynomials; such a classifier
would require too many parameters to be determined
from a limited number of training samples
Pattern Classification, Chapter 6
4
• There is no automatic method for determining the

nonlinearities when no information is provided to the
classifier
• In using the multilayer Neural Networks, the form of

the nonlinearity is learned from the training data

5
Feedforward Operation and

Classification
• A three-layer neural network consists of an input

layer, a hidden layer and an output layer
interconnected by modifiable weights
represented by links between layers

6

7

8
• A single “bias unit” is connected to each unit other than the

input units
d d
• Net activation: net j   x i w ji  w j 0   x i w ji  w tj .x ,

i 1 i 0
where the subscript i indexes units in the input layer, j in the

hidden; wji denotes the input-to-hidden layer weights at the
hidden unit j. (In neurobiology, such weights or connections
are called “synapses”)
• Each hidden unit emits an output that is a nonlinear function

of its activation, that is: yj = f(netj)
9
Figure 6.1 shows a simple threshold function
1 if net  0
f ( net )  sgn( net )  
 1 if net  0
• The function f(.) is also called the activation

function or “nonlinearity” of a unit. There are
more general activation functions with
desirables properties
• Each output unit similarly computes its net

activation based on the hidden unit signals as:
nH nH
net k   y j w kj  w k 0   y j w kj  w kt . y ,
j 1 j 0
where the subscript k indexes units in the ouput

layer and nH denotes the number of hidden units Pattern Classification, Chapter 6
1
0
• More than one output are referred z . An output unit
k
computes the nonlinear function of its net, emitting
zk = f(netk)
• In the case of c outputs (classes), we can view the

network as computing c discriminants functions
zk = gk(x) and classify the input x according to the
largest discriminant function gk(x)  k = 1, …, c
• The three-layer network with the weights listed in

fig. 6.1 solves the XOR problem

1
1
• The hidden unit y computes the boundary:
1
 0  y1 = +1
x1 + x2 + 0.5 = 0
< 0  y1 = -1
• The hidden unit y computes the boundary:

2
 0  y2 = +1
x1 + x2 -1.5 = 0
< 0  y2 = -1
• The final output unit emits z 1 = +1  y1 = +1 and y2 = +1

zk = y1 and not y2 = (x1 or x2) and not (x1 and x2) = x1 XOR x2
which provides the nonlinear decision of fig. 6.1

1
• General Feedforward Operation – case of c output units 2
 nH  d  
gk ( x )  z k  f   w kj f   w ji x i  w j 0   w k 0  (1)
 j 1  i 1  
(k  1,...,c)
• Hidden units enable us to express more complicated nonlinear functions
and thus extend the classification
• The activation function does not have to be a sign function, it is often

required to be continuous and differentiable
• We can allow the activation in the output layer to be different from the
activation function in the hidden layer or have different activation for each
individual unit
• We assume for now that all activation functions to be identical

1
3
• Expressive Power of multi-layer Networks
Question: Can every decision be implemented by a three-layer

network described by equation (1) ?
Answer: Yes (due to A. Kolmogorov)

“Any continuous function from input to output can be implemented
in a three-layer net, given sufficient number of hidden units nH,
proper nonlinearities, and weights.”
2 n 1
g( x )    j   ij ( xi )
j 1
x  I n ( I  [ 0 ,1 ]; n  2 )
for properly chosen functions j and ij

1
4
• Each of the 2n+1 hidden units j takes as input a sum of d

nonlinear functions, one for each input feature xi
• Each hidden unit emits a nonlinear function j of its total input
• The output unit emits the sum of the contributions of the

hidden units
Unfortunately: Kolmogorov’s theorem tells us very little about

how to find the nonlinear functions based on data; this is the
central problem in network-based pattern recognition
1
5

1
Backpropagation Algorithm 6
• Any function from input to output can be

implemented as a three-layer neural network
• These results are of greater theoretical interest

than practical, since the construction of such a
network requires the nonlinear functions and the
weight values which are unknown!

1
7

1
8
• Our goal now is to set the interconnexion weights based

on the training patterns and the desired outputs
• In a three-layer network, it is a straightforward matter to

understand how the output, and thus the error, depend on
the hidden-to-output layer weights
• The power of backpropagation is that it enables us to

compute an effective error for each hidden unit, and thus
derive a learning rule for the input-to-hidden weights, this
is known as:
The credit assignment problem

1
9
• Network have two modes of operation:
• Feedforward
The feedforward operations consists of presenting a
pattern to the input units and passing (or feeding) the
signals through the network in order to get outputs
units (no cycles!)
• Learning
The supervised learning consists of presenting an input
pattern and modifying the network parameters
(weights) to reduce distances between the computed
output and the desired output
2
0

2
1
• Network Learning
• Let tk be the k-th target (or desired) output and zk be
the k-th computed output with k = 1, …, c and w
represents all the weights of the network
1 c 1
The training error: J ( w )   ( t k  z k ) 
2
• 2 k 1
2
2
tz
• The backpropagation learning rule is based on

gradient descent
• The weights are initialized with pseudo-random values and
are changed in a direction that will reduce the error:
J
w  
w
2
where  is the learning rate which indicates the relative 2
size of the change in weights
w(m +1) = w(m) + w(m)
where m is the m-th pattern presented
• Error on the hidden–to-output weights

J J net k net k
 .   k
w kj net k w kj w kj
where the sensitivity of unit k is defined as:  k   J
net k
and describes how the overall error changes with the
activation of the unit’s net
J J  z k
k    .  ( t k  z k ) f ' ( net k )
net k z k net k
2
3
Since netk = wkt.y therefore: net k
 yj
w kj
Conclusion: the weight update (or learning rule) for the

hidden-to-output weights is:
wkj = kyj = (tk – zk) f’ (netk)yj
• Error on the input-to-hidden units

J J y j net j
 . .
w ji y j net j w ji

J  1 c 2
c
z k 2
However,

y j y j 2  k k 
( t  z )    ( t k  z k )
y j 4
 k 1  k 1
c
z k net k c
  ( t k  zk ) .    ( t k  z k ) f ' ( net k )w kj
k 1  net k  y j k 1
Similarly as in the preceding case, we define the

sensitivity for a hidden unit: c
 j  f ' ( net j ) w kj k
k 1
which means that:“The sensitivity at a hidden unit is simply

the sum of the individual sensitivities at the output units
weighted by the hidden-to-output weights wkj; all multipled
by f’(netj)”
Conclusion: The learning rule for the input-to-hidden

weights is:
w ji  x i j    w kj k  f ' ( net j ) x i
       
j
2
5
• Starting with a pseudo-random weight configuration, the

stochastic backpropagation algorithm can be written as:
Begin initialize nH; w, criterion , , m

 0
do m  m + 1
xm  randomly chosen pattern
wji  wji + jxi; wkj  wkj + kyj
until ||J(w)|| < 
return w
End

2
6
• Stopping criterion
• The algorithm terminates when the change in the criterion
function J(w) is smaller than some preset value 
• There are other stopping criteria that lead to better performance

than this one
• So far, we have considered the error on a single pattern, but we

want to consider an error defined over the entirety of patterns in
the training set
• The total training error is the sum over the errors of n individual
patterns
n
J  Jp (1)
p1

2
7
• Stopping criterion (cont.)
• A weight update may reduce the error on the single pattern

being presented but can increase the error on the full training
set
• However, given a large number of such individual updates,

the total error of equation (1) decreases

2
8
• Learning Curves
• Before training starts, the error on the training set is high; through
the learning process, the error becomes smaller
• The error per pattern depends on the amount of training data and
the expressive power (such as the number of weights) in the
network
• The average error on an independent test set is always higher

than on the training set, and it can decrease as well as increase
• A validation set is used in order to decide when to stop training ;

we do not want to overfit the network and decrease the power of
the classifier generalization
“we stop training at a minimum of the error on the validation set”

2
9

3
0
EXERCISES
• Exercise #1.
Explain why a MLP (multilayer perceptron) does not
learn if the initial weights and biases are all zeros
• Exercise #2. (#2 p. 344)

DHSCH 6

Uploaded by

Copyright:

Available Formats

DHSCH 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DHSCH 6

Uploaded by

Copyright:

Available Formats

Pattern

All materials in these slides were taken

• In previous methods, the central difficulty was the

• A “brute” approach might be to select a complete

• There is no automatic method for determining the

• In using the multilayer Neural Networks, the form of

Pattern Classification, Chapter 6

Feedforward Operation and

• A three-layer neural network consists of an input

Pattern Classification, Chapter 6

Pattern Classification, Chapter 6

Pattern Classification, Chapter 6

• A single “bias unit” is connected to each unit other than the

• Net activation: net j   x i w ji  w j 0   x i w ji  w tj .x ,

where the subscript i indexes units in the input layer, j in the

• Each hidden unit emits an output that is a nonlinear function

• The function f(.) is also called the activation

• Each output unit similarly computes its net

where the subscript k indexes units in the ouput

• In the case of c outputs (classes), we can view the

• The three-layer network with the weights listed in

Pattern Classification, Chapter 6

• The hidden unit y computes the boundary:

• The final output unit emits z 1 = +1  y1 = +1 and y2 = +1

Pattern Classification, Chapter 6

• The activation function does not have to be a sign function, it is often

• We assume for now that all activation functions to be identical

Question: Can every decision be implemented by a three-layer

Answer: Yes (due to A. Kolmogorov)

for properly chosen functions j and ij

Pattern Classification, Chapter 6

• Each of the 2n+1 hidden units j takes as input a sum of d

• Each hidden unit emits a nonlinear function j of its total input

• The output unit emits the sum of the contributions of the

Unfortunately: Kolmogorov’s theorem tells us very little about

Pattern Classification, Chapter 6

• Any function from input to output can be

• These results are of greater theoretical interest

Pattern Classification, Chapter 6

Pattern Classification, Chapter 6

• Our goal now is to set the interconnexion weights based

• In a three-layer network, it is a straightforward matter to

• The power of backpropagation is that it enables us to

Pattern Classification, Chapter 6

• Network have two modes of operation:

Pattern Classification, Chapter 6

• The backpropagation learning rule is based on

• Error on the hidden–to-output weights

Conclusion: the weight update (or learning rule) for the

• Error on the input-to-hidden units

Pattern Classification, Chapter 6

Similarly as in the preceding case, we define the

which means that:“The sensitivity at a hidden unit is simply

Conclusion: The learning rule for the input-to-hidden

• Starting with a pseudo-random weight configuration, the

Begin initialize nH; w, criterion , , m

Pattern Classification, Chapter 6

• There are other stopping criteria that lead to better performance

• So far, we have considered the error on a single pattern, but we