DHSCH 6

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 30

Pattern

Classification

All materials in these slides were taken


from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John
Wiley & Sons, 2000
with the permission of the authors and
the publisher
Chapter 6: Multilayer Neural Networks
(Sections 6.1-6.3)

• Introduction
• Feedforward Operation and Classification
• Backpropagation Algorithm
3
Introduction
• Goal: Classify objects by learning nonlinearity
• There are many problems for which linear
discriminants are insufficient for minimum error

• In previous methods, the central difficulty was the


choice of the appropriate nonlinear functions

• A “brute” approach might be to select a complete


basis set such as all polynomials; such a classifier
would require too many parameters to be determined
from a limited number of training samples
Pattern Classification, Chapter 6
4

• There is no automatic method for determining the


nonlinearities when no information is provided to the
classifier

• In using the multilayer Neural Networks, the form of


the nonlinearity is learned from the training data

Pattern Classification, Chapter 6


5

Feedforward Operation and


Classification

• A three-layer neural network consists of an input


layer, a hidden layer and an output layer
interconnected by modifiable weights
represented by links between layers

Pattern Classification, Chapter 6


6

Pattern Classification, Chapter 6


7

Pattern Classification, Chapter 6


8

• A single “bias unit” is connected to each unit other than the


input units
d d

• Net activation: net j   x i w ji  w j 0   x i w ji  w tj .x ,


i 1 i 0

where the subscript i indexes units in the input layer, j in the


hidden; wji denotes the input-to-hidden layer weights at the
hidden unit j. (In neurobiology, such weights or connections
are called “synapses”)

• Each hidden unit emits an output that is a nonlinear function


of its activation, that is: yj = f(netj)
Pattern Classification, Chapter 6
9
Figure 6.1 shows a simple threshold function
1 if net  0
f ( net )  sgn( net )  
 1 if net  0

• The function f(.) is also called the activation


function or “nonlinearity” of a unit. There are
more general activation functions with
desirables properties

• Each output unit similarly computes its net


activation based on the hidden unit signals as:
nH nH
net k   y j w kj  w k 0   y j w kj  w kt . y ,
j 1 j 0

where the subscript k indexes units in the ouput


layer and nH denotes the number of hidden units Pattern Classification, Chapter 6
1
0
• More than one output are referred z . An output unit
k
computes the nonlinear function of its net, emitting
zk = f(netk)

• In the case of c outputs (classes), we can view the


network as computing c discriminants functions
zk = gk(x) and classify the input x according to the
largest discriminant function gk(x)  k = 1, …, c

• The three-layer network with the weights listed in


fig. 6.1 solves the XOR problem

Pattern Classification, Chapter 6


1
1
• The hidden unit y computes the boundary:
1

 0  y1 = +1
x1 + x2 + 0.5 = 0
< 0  y1 = -1

• The hidden unit y computes the boundary:


2

 0  y2 = +1
x1 + x2 -1.5 = 0
< 0  y2 = -1

• The final output unit emits z 1 = +1  y1 = +1 and y2 = +1


zk = y1 and not y2 = (x1 or x2) and not (x1 and x2) = x1 XOR x2
which provides the nonlinear decision of fig. 6.1

Pattern Classification, Chapter 6


1
• General Feedforward Operation – case of c output units 2

 nH  d  
gk ( x )  z k  f   w kj f   w ji x i  w j 0   w k 0  (1)
 j 1  i 1  
(k  1,...,c)
• Hidden units enable us to express more complicated nonlinear functions
and thus extend the classification

• The activation function does not have to be a sign function, it is often


required to be continuous and differentiable

• We can allow the activation in the output layer to be different from the
activation function in the hidden layer or have different activation for each
individual unit

• We assume for now that all activation functions to be identical


Pattern Classification, Chapter 6
1
3
• Expressive Power of multi-layer Networks

Question: Can every decision be implemented by a three-layer


network described by equation (1) ?

Answer: Yes (due to A. Kolmogorov)


“Any continuous function from input to output can be implemented
in a three-layer net, given sufficient number of hidden units nH,
proper nonlinearities, and weights.”

2 n 1
g( x )    j   ij ( xi )
j 1
x  I n ( I  [ 0 ,1 ]; n  2 )

for properly chosen functions j and ij

Pattern Classification, Chapter 6


1
4

• Each of the 2n+1 hidden units j takes as input a sum of d


nonlinear functions, one for each input feature xi

• Each hidden unit emits a nonlinear function j of its total input

• The output unit emits the sum of the contributions of the


hidden units

Unfortunately: Kolmogorov’s theorem tells us very little about


how to find the nonlinear functions based on data; this is the
central problem in network-based pattern recognition
Pattern Classification, Chapter 6
1
5

Pattern Classification, Chapter 6


1
Backpropagation Algorithm 6

• Any function from input to output can be


implemented as a three-layer neural network

• These results are of greater theoretical interest


than practical, since the construction of such a
network requires the nonlinear functions and the
weight values which are unknown!

Pattern Classification, Chapter 6


1
7

Pattern Classification, Chapter 6


1
8

• Our goal now is to set the interconnexion weights based


on the training patterns and the desired outputs

• In a three-layer network, it is a straightforward matter to


understand how the output, and thus the error, depend on
the hidden-to-output layer weights

• The power of backpropagation is that it enables us to


compute an effective error for each hidden unit, and thus
derive a learning rule for the input-to-hidden weights, this
is known as:
The credit assignment problem

Pattern Classification, Chapter 6


1
9

• Network have two modes of operation:

• Feedforward
The feedforward operations consists of presenting a
pattern to the input units and passing (or feeding) the
signals through the network in order to get outputs
units (no cycles!)

• Learning
The supervised learning consists of presenting an input
pattern and modifying the network parameters
(weights) to reduce distances between the computed
output and the desired output
Pattern Classification, Chapter 6
2
0

Pattern Classification, Chapter 6


2
1
• Network Learning
• Let tk be the k-th target (or desired) output and zk be
the k-th computed output with k = 1, …, c and w
represents all the weights of the network
1 c 1
The training error: J ( w )   ( t k  z k ) 
2
• 2 k 1
2

2
tz

• The backpropagation learning rule is based on


gradient descent
• The weights are initialized with pseudo-random values and
are changed in a direction that will reduce the error:
J
w  
w
Pattern Classification, Chapter 6
2
where  is the learning rate which indicates the relative 2
size of the change in weights
w(m +1) = w(m) + w(m)
where m is the m-th pattern presented

• Error on the hidden–to-output weights


J J net k net k
 .   k
w kj net k w kj w kj
where the sensitivity of unit k is defined as:  k   J
net k
and describes how the overall error changes with the
activation of the unit’s net
J J  z k
k    .  ( t k  z k ) f ' ( net k )
net k z k net k
Pattern Classification, Chapter 6
2
3
Since netk = wkt.y therefore: net k
 yj
w kj

Conclusion: the weight update (or learning rule) for the


hidden-to-output weights is:
wkj = kyj = (tk – zk) f’ (netk)yj

• Error on the input-to-hidden units


J J y j net j
 . .
w ji y j net j w ji

Pattern Classification, Chapter 6


J  1 c 2
c
z k 2
However,

y j y j 2  k k 
( t  z )    ( t k  z k )
y j 4
 k 1  k 1
c
z k net k c
  ( t k  zk ) .    ( t k  z k ) f ' ( net k )w kj
k 1  net k  y j k 1

Similarly as in the preceding case, we define the


sensitivity for a hidden unit: c
 j  f ' ( net j ) w kj k
k 1

which means that:“The sensitivity at a hidden unit is simply


the sum of the individual sensitivities at the output units
weighted by the hidden-to-output weights wkj; all multipled
by f’(netj)”

Conclusion: The learning rule for the input-to-hidden


weights is:
w ji  x i j    w kj k  f ' ( net j ) x i
       
j
Pattern Classification, Chapter 6
2
5

• Starting with a pseudo-random weight configuration, the


stochastic backpropagation algorithm can be written as:

Begin initialize nH; w, criterion , , m


 0
do m  m + 1
xm  randomly chosen pattern
wji  wji + jxi; wkj  wkj + kyj
until ||J(w)|| < 
return w
End

Pattern Classification, Chapter 6


2
6
• Stopping criterion
• The algorithm terminates when the change in the criterion
function J(w) is smaller than some preset value 

• There are other stopping criteria that lead to better performance


than this one

• So far, we have considered the error on a single pattern, but we


want to consider an error defined over the entirety of patterns in
the training set

• The total training error is the sum over the errors of n individual
patterns
n
J  Jp (1)
p1

Pattern Classification, Chapter 6


2
7
• Stopping criterion (cont.)

• A weight update may reduce the error on the single pattern


being presented but can increase the error on the full training
set

• However, given a large number of such individual updates,


the total error of equation (1) decreases

Pattern Classification, Chapter 6


2
8
• Learning Curves

• Before training starts, the error on the training set is high; through
the learning process, the error becomes smaller

• The error per pattern depends on the amount of training data and
the expressive power (such as the number of weights) in the
network

• The average error on an independent test set is always higher


than on the training set, and it can decrease as well as increase

• A validation set is used in order to decide when to stop training ;


we do not want to overfit the network and decrease the power of
the classifier generalization
“we stop training at a minimum of the error on the validation set”

Pattern Classification, Chapter 6


2
9

Pattern Classification, Chapter 6


3
0

EXERCISES

• Exercise #1.
Explain why a MLP (multilayer perceptron) does not
learn if the initial weights and biases are all zeros

• Exercise #2. (#2 p. 344)

Pattern Classification, Chapter 6

You might also like