Computer Vision

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO.

8, AUGUST 2000 831

A Bayesian Computer Vision System


for Modeling Human Interactions
Nuria M. Oliver, Barbara Rosario, and Alex P. Pentland, Senior Member, IEEE

AbstractÐWe describe a real-time computer vision and machine learning system for modeling and recognizing human behaviors in a
visual surveillance task [1]. The system is particularly concerned with detecting when interactions between people occur and
classifying the type of interaction. Examples of interesting interaction behaviors include following another person, altering one's path to
meet another, and so forth. Our system combines top-down with bottom-up information in a closed feedback loop, with both
components employing a statistical Bayesian approach [2]. We propose and compare two different state-based learning architectures,
namely, HMMs and CHMMs for modeling behaviors and interactions. The CHMM model is shown to work much more efficiently and
accurately. Finally, to deal with the problem of limited training data, a synthetic ªAlife-styleº training system is used to develop flexible
prior models for recognizing human interactions. We demonstrate the ability to use these a priori models to accurately classify real
human behaviors and interactions with no additional tuning or training.

Index TermsÐVisual surveillance, people detection, tracking, human behavior recognition, Hidden Markov Models.

1 INTRODUCTION

W E describe a real-time computer vision and machine


learning system for modeling and recognizing human
behaviors in a visual surveillance task [1]. The system is
unconstrained environment; from a Machine Learning and
Artificial Intelligence perspective, behavior models for inter-
acting agents are needed to interpret the set of perceived
particularly concerned with detecting when interactions actions and detect eventual anomalous behaviors or
between people occur and classifying the type of interaction. potentially dangerous situations. Moreover, all the proces-
Over the last decade there has been growing interest sing modules need to be integrated in a consistent way.
within the computer vision and machine learning commu- Our approach to modeling person-to-person interactions
nities in the problem of analyzing human behavior in video is to use supervised statistical machine learning techniques
([3], [4], [5], [6], [7], [8], [9], [10]). Such systems typically to teach the system to recognize normal single-person
consist of a low- or mid-level computer vision system to behaviors and common person-to-person interactions. A
detect and segment a moving objectÐhuman or car, for major problem with a data-driven statistical approach,
exampleÐand a higher level interpretation module that especially when modeling rare or anomalous behaviors, is
classifies the motion into ªatomicº behaviors such as, for the limited number of examples of those behaviors for
example, a pointing gesture or a car turning left. training the models. A major emphasis of our work,
However, there have been relatively few efforts to therefore, is on efficient Bayesian integration of both prior
understand human behaviors that have substantial extent knowledge (by the use of synthetic prior models) with
in time, particularly when they involve interactions evidence from data (by situation-specific parameter tuning).
between people. This level of interpretation is the goal of Our goal is to be able to successfully apply the system to
this paper, with the intention of building systems that can any normal multiperson interaction situation without
deal with the complexity of multiperson pedestrian and additional training.
highway scenes [2]. Another potential problem arises when a completely
This computational task combines elements of AI/ new pattern of behavior is presented to the system. After
machine learning and computer vision and presents the system has been trained at a few different sites,
challenging problems in both domains: from a Computer previously unobserved behaviors will be (by definition)
Vision viewpoint, it requires real-time, accurate, and robust rare and unusual. To account for such novel behaviors, the
detection and tracking of the objects of interest in an system should be able to recognize new behaviors and to
build models of them from as as little as a single example.
We have pursued a Bayesian approach to modeling that
. N.M. Oliver is with the Adaptive Systems and Interaction Group, includes both prior knowledge and evidence from data,
Microsoft Research, One Microsoft Way, Remond WA 98052. believing that the Bayesian approach provides the best
E-mail: [email protected].
. B. Rosario is with the School of Information and Management Systems framework for coping with small data sets and novel
(SIMS), Universtiy of California, Berkeley, 100 Academic Hall #4600, behaviors. Graphical models [11], such as Hidden Markov
Berkeley, CA 94720-4600. E-mail: rosario.sims.berkeley.edu. Models (HMMs) [12] and Coupled Hidden Markov Models
. A.P. Pentland is with the Vision and Modeling Media Laboratory MIT, (CHMMs) [13], [14], [15], seem most appropriate for
Cambridge, MA 02139. E-mail: [email protected].
modeling and classifying human behaviors because they
Manuscript received 21 Apr. 1999; revised 10 Feb. 2000; accepted 28 Mar. offer dynamic time warping, a well-understood training
2000.
Recommended for acceptance by R. Collins.
algorithm, and a clear Bayesian semantics for both
For information on obtaining reprints of this article, please send e-mail to: individual (HMMs) and interacting or coupled (CHMMs)
[email protected], and reference IEEECS Log Number 109636. generative processes.
0162-8828/00/$10.00 ß 2000 IEEE
832 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000

Fig. 1. Top-down and bottom-up processing loop.

To specify the priors in our system, we have developed a For each moving object an appearance-based description
framework for building and training models of the is generated, allowing it to be tracked through temporary
behaviors of interest using synthetic agents [16], [17]. occlusions and multiobject meetings. A Kalman filter tracks
Simulation with the agents yields synthetic data that is the objects' location, coarse shape, color pattern, and
used to train prior models. These prior models are then used velocity. This temporally ordered stream of data is then
recursively in a Bayesian framework to fit real behavioral used to obtain a behavioral description of each object and to
data. This approach provides a rather straightforward and detect interactions between objects.
flexible technique to the design of priors, one that does not Fig. 1 depicts the processing loop and main functional
require strong analytical assumptions to be made about the units of our ultimate system.
form of the priors.1 In our experiments, we have found that 1. The real-time computer vision input module detects
by combining such synthetic priors with limited real data and tracks moving objects in the scene, and for each
we can easily achieve very high accuracies of recognition of moving object outputs a feature vector describing its
different human-to-human interactions. Thus, our system is motion and heading, and its spatial relationship to
robust to cases in which there are only a few examples of a all nearby moving objects.
certain behavior (such as in interaction type 2 described in 2. These feature vectors constitute the input to stochas-
Section 5) or even no examples except synthetically- tic state-based behavior models. Both HMMs and
generated ones. CHMMs, with varying structures depending on the
The paper is structured as follows: Section 2 presents an complexity of the behavior, are then used for
overview of the system, Section 3 describes the computer classifying the perceived behaviors.
vision techniques used for segmentation and tracking of the
Note that both top-down and bottom-up streams of
pedestrians and the statistical models used for behavior
information would continuously be managed and com-
modeling and recognition are described in Section 4. A brief
bined for each moving object within the scene. Conse-
description of the synthetic agent environment that we have
quently, our Bayesian approach offers a mathematical
created is described in Section 5. Section 6 contains experi-
framework for both combining the observations (bottom-
mental results with both synthetic agent data and real video
up) with complex behavioral priors (top-down) to provide
data and Section 7 summarizes the main conclusions and
expectations that will be fed back to the perceptual system.
sketches our future directions of research. Finally, a summary
of the CHMM formulation is presented in the Appendix.
3 SEGMENTATION AND TRACKING
2 SYSTEM OVERVIEW The first step in the system is to reliably and robustly detect
and track the pedestrians in the scene. We use 2D blob
Our system employs a static camera with wide field-of-view
features for modeling each pedestrian. The notion of ªblobsº
watching a dynamic outdoor scene (the extension to an
as a representation for image features has a long history in
active camera [18] is straightforward and planned for the
computer vision [19], [20], [21], [22], [23] and has had many
next version). A real-time computer vision system segments
different mathematical definitions. In our usage, it is a
moving objects from the learned scene. The scene descrip-
compact set of pixels that share some visual properties that
tion method allows variations in lighting, weather, etc., to
are not shared by the surrounding pixels. These properties
be learned and accurately discounted.
could be color, texture, brightness, motion, shading, a
1. Note that our priors have the same form as our posteriors, namely they combination of these, or any other salient spatio-temporal
are Markov models. property derived from the signal (the image sequence).
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 833

Fig. 2. Background mean image, blob segmentation image, and input image with blob bounding boxes.

3.1 Segmentation by Eigenbackground Subtraction 3.2 Tracking


In our system, the main cue for clustering the pixels into The trajectories of each blob are computed and saved into a
blobs is motion, because we have a static background with dynamic track memory. Each trajectory has associated a first
moving objects. To detect these moving objects, we order Kalman filter that predicts the blob's position and
adaptively build an eigenspace that models the back- velocity in the next frame. Recall that the Kalman Filter is
ground. This eigenspace model describes the range of the ªbest linear unbiased estimatorº in a mean squared
appearances (e.g., lighting variations over the day, weather sense and that for Gaussian processes, the Kalman filter
variations, etc.) that have been observed. The eigenspace equations corresponds to the optimal Bayes' estimate.
could also be generated from a site model using standard In order to handle occlusions as well as to solve the
computer graphics techniques. correspondence between blobs over time, the appearance of
The eigenspace model is formed by taking a sample of N each blob is also modeled by a Gaussian PDF in RGB color
images and computing both the mean b background image space. When a new blob appears in the scene, a new
and its covariance matrix Cb . This covariance matrix can be trajectory is associated to it. Thus for each blob, the Kalman-
diagonalized via an eigenvalue decomposition Lb ˆ b Cb Tb , filter-generated spatial PDF and the Gaussian color PDF are
where b is the eigenvector matrix of the covariance of the combined to form a joint …x; y† image space and color space
data and Lb is the corresponding diagonal matrix of its PDF. In subsequent frames, the Mahalanobis distance is
eigenvalues. In order to reduce the dimensionality of the used to determine the blob that is most likely to have the
space, in principal component analysis (PCA) only M same identity (see Fig. 2).
eigenvectors (eigenbackgrounds) are kept, corresponding to
the M largest eigenvalues to give a M matrix. A principal 4 BEHAVIOR MODELS
component feature vector Ii ÿ TMb Xi is then formed, where
Xi ˆ Ii ÿ b is the mean normalized image vector. In this section, we develop our framework for building and
Note that moving objects, because they don't appear in applying models of individual behaviors and person-to-
the same location in the N sample images and they are person interactions. In order to build effective computer
typically small, do not have a significant contribution to this models of human behaviors, we need to address the
model. Consequently, the portions of an image containing a question of how knowledge can be mapped onto computa-
moving object cannot be well-described by this eigenspace tion to dynamically deliver consistent interpretations.
model (except in very unusual cases), whereas the static From a strict computational viewpoint there are two key
portions of the image can be accurately described as a sum problems when processing the continuous flow of feature
of the the various eigenbasis vectors. That is, the eigenspace data coming from a stream of input video: 1) Managing the
provides a robust model of the probability distribution computational load imposed by frame-by-frame examina-
function of the background, but not for the moving objects. tion of all of the agents and their interactions. For example,
Once the eigenbackground images (stored in a matrix the number of possible interactions between any two agents
called Mb hereafter) are obtained, as well as their mean b , of a set of N agents is N  …N ÿ 1†=2. If naively managed,
we can project each input image Ii onto the space expanded this load can easily become large for even moderate N.
by the eigenbackground images Bi ˆ Mb Xi to model the 2) Even when the frame-by-frame load is small and the
static parts of the scene, pertaining to the background. representation of each agent's instantaneous behavior is
Therefore, by computing and thresholding the Euclidean compact, there is still the problem of managing all this
distance (distance from feature space DFFS [24]) between information over time.
the input image and the projected image, we can detect the Statistical directed acyclic graphs (DAGs) or probabilistic
inference networks (PINs) [26], [27] can provide a compu-
moving objects present in the scene: Di ˆ jIi ÿ Bi j > t,
tationally efficient solution to these problems. HMMs and
where t is a given threshold. Note that it is easy to adaptively
their extensions, such as CHMMs, can be viewed as a
perform the eigenbackground subtraction in order to particular, simple case of temporal PIN or DAG. PINs
compensate for changes such as big shadows. This motion consist of a set of random variables represented as nodes as
mask is the input to a connected component algorithm that well as directed edges or links between them. They define a
produces blob descriptions that characterize each person's mathematical form of the joint or conditional PDF between
shape. We have also experimented with modeling the the random variables. They constitute a simple graphical
background by using a mixture of Gaussian distributions at way of representing causal dependencies between vari-
each pixel, as in Pfinder [25]. However, we finally opted for ables. The absence of directed links between nodes implies
the eigenbackground method because it offered good a conditional independence. Moreover, there is a family of
results and less computational load. transformations performed on the graphical structure that
834 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000

future inferences must be reflected in the current value of


this state variable. Graphically, HMMs are often depicted
ªrolled-out in timeº as PINs, such as in Fig. 4.
However, many interesting systems are composed of
multiple interacting processes and, thus, merit a composi-
tional representation of two or more variables. This is
typically the case for systems that have structure both in
time and space. Even with the correct number of states and
vast amounts of data, large HMMs generally train poorly
because the data is partitioned among states early (and
incorrectly) during training: the Markov independence
structure then ensures that the data is not shared by states,
thus reinforcing any mistakes in the initial partitioning.
Fig. 3. A typical image of a pedestrian plaza. Systems with multiple processes have states that share
properties and, thus, emit similar signals. With a single state
has a direct translation in terms of mathematical operations variable, Markov models are ill-suited to these problems.
applied to the underlying PDF. Finally, they are modular, Even though an HMM can model any system in principle,
i.e., one can express the joint global PDF as the product of in practice, the simple independence structure is a liability
local conditional PDFS. for large systems and for systems with compositional state.
PINspresentseveralimportantadvantagesthatarerelevant In order to model these interactions, a more complex
to our problem: They can handle incomplete data as well as architecture is needed.
uncertainty; they are trainable and easy to avoid overfitting;
4.1.1 Varieties of Couplings
they encode causality in a natural way; there are algorithms for
both doing prediction and probabilistic inference; they offer a Extensions to the basic Markov model generally increase
framework for combining prior knowledge and data; and, the memory of the system (durational modeling), providing
finally, they are modular and parallelizable. it with compositional state in time. We are interested in
In this paper, the behaviors we examine are generated by systems that have compositional state in space, e.g., more
pedestrians walking in an open outdoor environment. Our than one simultaneous state variable. Models with compo-
goal is to develop a generic, compositional analysis of the sitional state would offer conceptual advantages of parsi-
mony and clarity, with consequent computational benefits
observed behaviors in terms of states and transitions
in efficiency and accuracy. Using graphical models nota-
between states over time in such a manner that 1) the
tion, we can construct various architectures for multi-HMM
states correspond to our common sense notions of human
couplings offering compositional state under various
behaviors and 2) they are immediately applicable to a wide
assumptions of independence. It is well-known that the
range of sites and viewing situations. Fig. 3 shows a typical
exact solution of extensions of the basic HMM to three or
image for our pedestrian scenario.
more chains is intractable. In those cases, approximation
4.1 Visual Understanding via Graphical Models: techniques are needed ([28], [29], [30], [31]). However, it is
HMMs and CHMMs also known that there exists an exact solution for the case of
two interacting chains, as it is in our case [28], [14].
Hidden Markov models (HMMs) are a popular probabilistic
In particular, one can think of extending the basic HMM
framework for modeling processes that have structure in
framework at two different levels:
time. They have a clear Bayesian semantics, efficient
algorithms for state and parameter estimation, and they 1. Coupling the outputs. The weakest coupling is
automatically perform dynamic time warping. An HMM is when two independent processes are coupled at the
essentially a quantization of a system's configuration space output, superimposing their outputs in a single
into a small number of discrete states, together with observed signal (Fig. 5). This is known as a source
probabilities for transitions between states. A single finite separation problem: signals with zero mutual in-
discrete variable indexes the current state of the system. formation are overlaid in a single channel. In true
Any information about the history of the process needed for couplings, however, the processes are dependent

Fig. 4. Graphical representation of HMM and CHMM rolled-out in time.


OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 835

Fig. 5. Graphical representation of FHMM, LHMM, and HMDT rolled-out in time.

and interact by influencing each other's states. One most likely state sequence. The underlying indepen-
example is the sensor fusion problem: Multiple dence structure is suitable for representing hierarch-
channels carry complementary information about ical structure in a signal, for example, the baseline of a
different components of a system, e.g., acoustical song constrains the melody and both constrain the
signals from speech and visual features from lip
harmony.
tracking [32]. In [29], a generalization of HMMs with
coupling at the outputs is presented. These are We use two CHMMs for modeling two interacting
Factorial HMMs (FHMMs) where the state variable processes, in our case, they correspond to individual
is factored into multiple state variables. They have a humans. In this architecture state, chains are coupled via
clear representational advantage over HMMs: to matrices of conditional probabilities modeling causal
model C processes, each with N states, each would (temporal) influences between their hidden state variables.
require an HMM with N C joint states, typically The graphical representation of CHMMs is shown in Fig. 4.
intractable in both space and time. FHMMs are Exact maximum a posteriori (MAP) inference is an O…T N 4 †
tractable in space, taking NC states, but present an
computation [34], [30]. We have developed a deterministic
inference problem equivalent to that of a combina-
O…T N 2 † algorithm for maximum entropy approximations to
toric HMM. Therefore, exact solutions are intractable
in time. The authors present tractable approxima- state and parameter values in CHMMs. From the graph it
tions using Gibbs sampling, mean field theory, or can be seen that for each chain, the state at time t depends
structured mean field. on the state at time t ÿ 1 in both chains. The influence of one
2. Coupling the states. In [28], a statistical mechanical chain on the other is through a causal link. The Appendix
framework for modeling discrete time series is contains a summary of the CHMM formulation.
presented. The authors couple two HMMs to exploit In this paper, we compare performance of HMMs and
the correlation between feature sets. Two parallel CHMMs for maximum a posteriori (MAP) state estimation.
Boltzmann chains are coupled by weights that We compute the most likely sequence of states S^ within a
connect their hidden unitsÐshown in Fig. 5 as model given the observation sequence O ˆ fo1 ; . . . ; on g. This
Linked HMMs (LHMMs). Like the transition and
most likely sequence is obtained by S^ ˆ argmaxS P …SjO†.
emission weights within each chain, the coupling
In the case of HMMs, the posterior state sequence
weights are tied across the length of the network.
The independence structure of such an architecture probability P …SjO† is given by
is suitable for expressing symmetrical synchronous Q
Ps1 ps1 …o1 † Ttˆ2 pst …ot †Pst jstÿ1
constraints, long-term dependencies between hid- P …SjO† ˆ ; …1†
den states or processes that are coupled at different P …O†
time scales. Their algorithm is based on decimation, a where S ˆ fa1 ; . . . ; aN g is the set of discrete states, st 2 S
method from statistical mechanics in which the :
corresponds to the state at time t. Pijj ˆPst ˆai jstÿ1 ˆaj is the
marginal distributions of singly or doubly connected
nodes are integrated out. A limited class of graphs state-to-state transition probability (i.e., probability of being
can be recursively decimated, obtaining correlations in state ai at time t given that the system was in state aj at time
for any connected pair of nodes. t ÿ 1). In the following, we will write them as Pst jstÿ1 . The prior
:
Finally, Hidden Markov Decision Trees (HMDTs) probabilities for the initial state are Pi ˆPs1 ˆai ˆ Ps1 . And,
:
[33] are a decision tree with Markov temporal structure finally, pi …ot †ˆpst ˆai …ot † ˆ pst …ot † are the output probabilities
(see Fig. 5). The model is intractable for exact for each state, (i.e., the probability of observing ot given state
calculations. Thus, the authors use variational approx- ai at time t).
imations. They consider three distributions for the In the case of CHMMs, we introduce another set of
approximation: one in which the Markov calculations probabilities, Pst js0tÿ1 , which correspond to the probability of
are performed exactly and the layers of the decision state st at time t in one chain given that the other
tree are decoupled, one in which the decision tree chainÐdenoted hereafter by superscript 0 Ðwas in state s0tÿ1
calculations are performed exactly and the time steps at time t ÿ 1. These new probabilities express the causal
of the Markov chain are decoupled, and one in which a influence (coupling) of one chain to the other. The posterior
Viterbi-like assumption is made to pick out a single state probability for CHMMs is given by
836 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000

Ps1 ps1 …o1 †Ps01 ps01 …o01 † from the other agents on the scene. Their velocity is normally
P …SjO† ˆ
P …O† distributed around a mean that increases or decreases when
…2† they slow down or speed up. When certain preconditions are
Y T
 Pst jstÿ1 Ps0t js0tÿ1 Ps0t jstÿ1 Pst js0tÿ1 pst …ot †ps0t …o0t †; satisfied a specific interaction between two agents takes place.
tˆ2 Each agent has perfect knowledge of the world, including the
where st ; s0t ; ot ; o0t
denote states and observations for each of position of the other agents.
the Markov chains that compose the CHMMs. A coupled In the following, we will describe without loss of
HMM of C chains has a joint state trellis that is in principle N C generality, the two-agent system that we used for generat-
states wide; the associated dynamic programming problem is ing prior models and synthetic data of agents interactions.
O…T N 2 C†. In [14], an approximation is developed using N- Each agent makes its own decisions depending on the type
heads dynamic programming such that an O…T …CN†2 † of interaction, its location, and the location of the other
algorithm is obtained that closely approximates the full agent on the scene. There is no scripted behavior or a priori
combinatoric result. knowledge of what kind of interaction, if any, is going to
Coming back to our problem of modeling human take place. The agents' behavior is determined by the
behaviors, two persons (each modeled as a generative perceived contextual information: current position, relative
process) may interact without wholly determining each position of the other agent, speeds, paths they are in,
others' behavior. Instead, each of them has its own internal directions of walk, etc., as well as by its own repertoire of
dynamics and is influenced (either weakly or strongly) by possible behaviors and triggering events. For example, if
others. The probabilities Pst js0tÿ1 and Ps0t jstÿ1 describe this kind one agent decides to ªfollowº the other agent, it will
of interactions and CHMMs are intended to model them in proceed on its own path increasing its speed progressively
as efficient a manner as possible. until reaching the other agent, that will also be walking on
the same path. Once the agent has been reached, they will
adapt their mutual speeds in order to keep together and
5 SYNTHETIC BEHAVIORAL AGENTS continue advancing together until exiting the scene.
We have developed a framework for creating synthetic For each agent the position, orientation, and velocity is
agents that mimic human behavior in a virtual environment measured, and from this data a feature vector is constructed
[16], [17]. The agents can be assigned different behaviors which consists of: d_12 , the derivative of the relative distance
and they can interact with each other as well. Currently,
between two agents; 1;2 ˆ sign…< vp 1; v 2 >†, or degree of

they can generate five different interacting behaviors and
various kinds of individual behaviors (with no interaction). alignment of the agents, and vi ˆ x_ 2 ‡ y_ 2 ; i ˆ 1; 2, the
The parameters of this virtual environment are modeled on magnitude of their velocities. Note that such a feature vector
the basis of a real pedestrian scene from which we obtained is invariant to the absolute position and direction of the agents
measurements of typical pedestrian movement. and the particular environment they are in.
One of the main motivations for constructing such
synthetic agents is the ability to generate synthetic data 5.2 Agent Behaviors
which allows us to determine which Markov model The agent behavioral system is structured in a hierarchical
architecture will be best for recognizing a new behavior way. There are primitive or simple behaviors and complex
(since it is difficult to collect real examples of rare interactive behaviors to simulate the human interactions.
behaviors). By designing the synthetic agents models such In the experiments reported in Section 4, we considered
that they have the best generalization and invariance five different interacting behaviors that appear illustrated in
properties possible, we can obtain flexible prior models Figs. 6 and 7:
that are transferable to real human behaviors with little or
no need of additional training. The use of synthetic agents 1. Follow, reach, and walk together (inter1): The two
to generate robust behavior models from very few real agents happen to be on the same path walking in the
behavior examples is of special importance in a visual same direction. The agent behind decides that it wants
surveillance task, where typically the behaviors of greatest to reach the other. Therefore, it speeds up in order to
interest are also the most rare. reach the other agent. When this happens, it slows
down such that they keep walking together with the
5.1 Agent Architecture same speed.
Our dynamic multiagent system consists of some number of 2. Approach, meet, and go on separately (inter2): The
agents that perform some specific behavior from a set of agents are on the same path, but in the opposite
possible behaviors. The system starts at time zero, moving direction. When they are close enough, if they realize
discretely forward to time T or until the agents disappear that they ªknowº each other, they slow down and
from the scene. finally stop to chat. After talking they go on
The agents can follow three different paths with two separately, becoming independent again.
possible directions, as illustrated in Figs. 6 and 7 by the yellow 3. Approach, meet, and go on together (inter3): In this
paths.2 They walk with random speeds within an interval; case, the agents behave like in ªinter2,º but now after
they appear at random instances of time. They can slow talking they decide to continue together. One agent
down, speed up, stop, or change direction independently therefore, changes its direction to follow the other.
4. Change direction in order to meet, approach, meet,
2. The three paths were obtained by statistical analysis of the most and continue together (inter4): The agents start on
frequent paths that the pedestrians in the observed plaza followed. Note,
however, that the performance of neither the computer vision nor the different paths. When they are close enough they can
tracking modules is limited to these three paths. see each other and decide to interact. One agent waits
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 837

Fig. 6. Example trajectories and feature vector for the interactions: follow, approach+meet+continue separately, and approach+meet+continue
together.

for the other to reach it. The other changes direction in individual behaviors activated in each of the agents. Fig. 8
order to go toward the waiting agent. Then they meet, illustrates the timeline and synchronization of the simple
chat for some time, and decide to go on together. behaviors and events that constitute the interactions.
5. Change direction in order to meet, approach, meet, These interactions can happen at any moment in time and
and go on separately (inter5): This interaction is the at any location, provided only that the precondititions for the
same as ªinter4º except that when they decide to go on interactions are satisfied. The speeds they walk at, the
after talking, they separate, becoming independent. duration of their chats, the changes of direction, the starting
Proper design of the interactive behaviors requires the and ending of the actions vary highly. This high variance in
agents to have knowledge about the position of each the quantitative aspects of the interactions confers robustness
other as well as synchronization between the successive to the learned models that tend to capture only the invariant
838 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000

Fig. 7. Example trajectories and feature vector for the interactions: change direction+approach+meet+continue separately, change
direction+approach+meet+continue together, and no interacting behavior.

parts of the interactions. The invariance reflects the nature of one of the most controversial and open issues in Bayesian
their interactions and the environment. inference. As we have already described, we solve this
problem by using a synthetic agents modeling package,
which allows us to build flexible prior behavior models.
6 EXPERIMENTAL RESULTS
Our goal is to have a system that will accurately interpret
6.1 Comparison of CHMM and HMM Architectures
behaviors and interactions within almost any pedestrian with Synthetic Agent Data
scene with little or no training. One critical problem, We built models of the five previously described synthetic
therefore, is generation of models that capture our prior agent interactions with both CHMMs and HMMs. We used
knowledge about human behavior. The selection of priors is two or three states per chain in the case of CHMMs and
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 839

Fig. 8. Timeline of the five complex behaviors in terms of events and simple behaviors.

three to five states in the case of HMMs (accordingly to the …CN†2 ‡ N  …d ‡ d!† ˆ …2  3†2 ‡ 3  …3 ‡ 6† ˆ 36 ‡ 27 ˆ 63
complexity of the various interactions). The particular parameters. A five state (N ˆ 5) HMM with six-dimen-
number of states for each architecture was determined sional (d ˆ 6) Gaussian observations has N 2 ‡ N  …d ‡
using 10 percent cross validation. Because we used the same d!† ˆ 52 ‡ 5  …3 ‡ 6† ˆ 25 ‡ 45 ˆ 70 parameters to estimate.
amount of data for training both architectures, we tried Each of these architectures corresponds to a different
keeping the number of parameters to estimate roughly the physical hypothesis: CHMMs encode a spatial coupling in
same. For example, a three state (N ˆ 3) per chain CHMM time between two agents (e.g., a nonstationary process)
with three-dimensional (d ˆ 3) Gaussian observations has whereas HMMs model the data as an isolated, stationary
840 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000

process. We used from 11 to 75 sequences for training each of TABLE 1


the models, depending on their complexity, such that we Accuracy for HMMs and CHMMs on Synthetic Data
avoided overfitting. The optimal number of training
examples, of states for each interaction, as well as the optimal
model parameters were obtained by a 10 percent cross-
validation process. In all cases, the models were set up with a
full state-to-state connection topology, so that the training
algorithm was responsible for determining an appropriate
state structure for the training data. The feature vector was
six-dimensional in the case of HMMs, whereas in the case of
CHMMs, each agent was modeled by a different chain, each Accuracy at recognizing when no interaction occurs (ªNo interº), and
of them with a three-dimensional feature vector. The feature accuracy at classifying each type of interaction: ªInter1º is follow, reach,
and walk together; ªInter2º is approach, meet, and go on; ªInter3º is
vector was the same as the one described for the synthetic approach, meet, and continue together; ªInter4º is change direction to
agents, namely d_12 , the derivative of the relative distance meet, approach, meet, and go together and ªInter5º is change direction
to meet, approach, meet, and go on separately.
between two persons; 1;2 ˆ sign…< v1 ; v2 >†, or degree of
p
alignment of the people, and vi ˆ x_ 2 ‡ y_ 2 ; i ˆ 1; 2, the
magnitude of their velocities. achieve very low false alarm rates while still maintaining
To compare the performance of the two previously good classification accuracy.
described architectures, we used the best trained models to
6.2 Pedestrian Behaviors
classify 20 unseen new sequences. In order to find the most
likely model, the Viterbi algorithm was used for HMMs and Our goal is to develop a framework for detecting, classifying,
the N-heads dynamic programming forward-backward and learning generic models of behavior in a visual
propagation algorithm for CHMMs. surveillance situation. It is important that the models be
Table 1 illustrates the accuracy for each of the two generic, applicable to many different situations, rather than
different architectures and interactions. Note the superiority being tuned to the particular viewing or site. This was one of
of CHMMs versus HMMs for classifying the different our main motivations for developing a virtual agent
interactions and, more significantly, identifying the case in environment for modeling behaviors. If the synthetic agents
which there were no interactions present in the testing data. are ªsimilarº enough in their behavior to humans, then the
Complexity in time and space is an important issue when same models that were trained with synthetic data should be
modeling dynamic time series. The number of degrees of
directly applicable to human data. This section describes the
freedom (state-to-state probabilities+output means+output
experiments we have performed analyzing real pedestrian
covariances) in the largest best-scoring model was 85 for
HMMs and 54 for CHMMs. We also performed an analysis data using both synthetic and site-specific models (models
of the accuracies of the models and architectures with trained on data from the site being monitored).
respect to the number of sequences used for training.
6.2.1 Data Collection and Preprocessing
Efficiency in terms of training data is especially important
in the case of online real-time learning systemsÐsuch as Using the person detection and tracking system described
ours would ultimately beÐand/or in domains in which in Section 3, we obtained 2D blob features for each person
collecting clean labeled data may be difficult. in several hours of video. Up to 20 examples of following
The cross-product HMMs that result from incorporating and various types of meeting behaviors were detected and
both generative processes into the same joint-product state processed.
space usually require many more sequences for training The feature vector x coming from the computer vision
because of the larger number of parameters. In our case, this processing module consisted of the 2D …x; y† centroid
appears to result in an accuracy ceiling of around 80 percent (mean position) of each person's blob, the Kalman Filter
for any amount of training that was evaluated, whereas for state for each instant of time, consisting of …^ x; x^_ ; y^; y^_ †,
CHMMs we were able to reach approximately 100 percent where ^: represents the filter estimation, and the …r; g; b†
accuracy with only a small amount of training. From this
components of the mean of the Gaussian fitted to each
result, it seems that the CHMMs architecture, with two
blob in color space. The frame-rate of the vision system
coupled generative processes, is more suited to the problem
was of about 20-30 Hz on an SGI R10000 O2 computer.
of modeling the behavior of interacting agents than a
generative process encoded by a single HMM. We low-pass filtered the data with a 3Hz cutoff filter and
In a visual surveillance system, the false alarm rate is computed for every pair of nearby persons a feature
often as important as the classification accuracy. In an vector consisting of: d_12 , derivative of the relative distance
ideal automatic surveillance system, all the targeted between two persons, jvi j; i ˆ 1; 2, norm of the velocity
behaviors should be detected with a close-to-zero false vector for each person, ˆ sign…< v1 ; v2 >†, or degree of
alarm rate, so that we can reasonably alert a human alignment of the trajectories of each person. Typical
operator to examine them further. To analyze this aspect trajectories and feature vectors for an ªapproach, meet,
of our system's performance, we calculated the system's and continue separatelyº behavior (interaction 2) are
ROC curve. Fig. 9 shows that it is quite possible to shown in Fig. 10. This is the same type of behavior as
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 841

the synthetic best models. We used eight examples


of each interaction data from the specific site.
Recognition accuracies for both these ªpriorº and ªposter-
iorº CHMMs are summarized in Table 2. It is noteworthy
that with only eight training examples, the recognition
accuracy on the real data could be raised to 100 percent.
This result demonstrates the ability to accomplish extremely
rapid refinement of our behavior models from the initial
prior models.
Finally, the ROC curve for the posterior CHMMs is
displayed in Fig. 11.
One of the most interesting results from these experi-
ments is the high accuracy obtained when testing the
a priori models obtained from synthetic agent simulations.
The fact that a priori models transfer so well to real data
Fig. 9. ROC curve on synthetic data. demonstrates the robustness of the approach. It shows that
with our synthetic agent training system, we can develop
ªinter2º displayed in Fig. 6 for the synthetic agents. Note models of many different types of behaviorÐthus avoiding
the similarity of the feature vectors in both cases. the problem of limited amount of training dataÐand apply
Even though multiple pairwise interactions could poten- these models to real human behaviors without additional
tially be detected and recognized, we only had examples of parameter tuning or training.
one interaction taking place at a time. Therefore, all our
6.2.3 Parameter Sensitivity
results refer to single pairwise interaction detection.
In order to evaluate the sensitivity of our classification
6.2.2 Behavior Models and Results accuracy to variations in the model parameters, we trained
CHMMs were used for modeling three different behaviors: a set of models where we changed different parameters of
meet and continue together (interaction 3), meet and split the agents' dynamics by factors of 2:5 and five. The
(interaction 2), and follow (interaction 1). In addition, an performance of these altered models turned out to be
interaction versus no interaction detection test was also virtually the same in every case except for the ªinter1º
(follow) interaction, which seems to be sensitive to people's
performed. HMMs performed much worse than CHMMs
velocities. Only when the agents' speeds were within the
and, therefore, we omit reporting their results.
range of normal (average) pedestrian walking speeds
We used models trained with two types of data:
ªinter1º (follow) was correctly recognized.
1. Prior-only (synthetic data) models: that is, the
behavior models learned in our synthetic agent
7 SUMMARY AND CONCLUSIONS
environment and then directly applied to the real
data with no additional training or tuning of the In this paper, we have described a computer vision system
parameters. and a mathematical modeling framework for recognizing
2. Posterior (synthetic-plus-real data) models: new different human behaviors and interactions in a visual
behavior models trained by using as starting points surveillance task. Our system combines top-down with

Fig. 10. Example trajectories and feature vector for interaction 2, or approach, meet, and continue separately behavior.
842 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000

TABLE 2
Accuracy for Both Untuned, a Priori Models, and Site-Specific
CHMMs Tested on Real Pedestrian Data

The first entry in each column is the interaction versus no-interaction


accuracy, the remaining entries are classification accuracies between
the different interacting behaviors. Interactions are: ªInter1º follow,
reach, and walk together; ªInter2º approach, meet, and go on; ªInter3º
approach, meet, and continue together.

Fig. 11. ROC curve for real pedestrian data.


bottom-up information in a closed feedback loop, with both
components employing a statistical Bayesian approach. forward-backward algorithm the expressions for comput-
Two different state-based statistical learning architec-
ing the and variables will incorporate the probabilities
tures, namely, HMMs and CHMMs have been proposed
of the head and sidekick for each chain (the second chain is
and compared for modeling behaviors and interactions. The
indicated with 0 ). As an illustration of the effect of
superiority of the CHMM formulation has been demon-
maintaining multiple paths per chain, the traditional
strated in terms of both training efficiency and classification
accuracy. A synthetic agent training system has been expression for the variable in a single HMM:
created in order to develop flexible and interpretable prior " #
XN
behavior models and we have demonstrated the ability to j;t‡1 ˆ i;t Pijj pi …ot † …3†
use these a priori models to accurately classify real iˆ1

behaviors with no additional tuning or training. This fact will be transformed into a pair of equations, one for the full
is especially important, given the limited amount of training posterior  and another for the marginalized posterior :
data available. X
The presented CHMM framework is not limited to only i;t ˆ pi …ot †pki0 ;t …ot † Pijhj;tÿ1 Pijkj0 ;tÿ1 Pki0 ;t jhj;tÿ1 Pki0 ;t jkj;tÿ1 j;tÿ1
two interacting processes. Interactions between more than j

two people could potentially be modeled and recognized. …4†

APPENDIX i;t ˆ
X X
FORWARD ( ) AND BACKWARD ( ) EXPRESSIONS pi …ot † Pijhj;tÿ1 Pijkj0 ;tÿ1 pkg0 ;t …ot †Pkg0 ;t jhj;tÿ1 Pkg0 ;t jkj0 ;tÿ1 j;tÿ1 :
j g
FOR CHMMS
…5†
In [14], a deterministic approximation for maximum a
posterior (MAP) state estimation is introduced. It enables The variable can be computed in a similar way by
fast classification and parameter estimation via expectation tracing back through the paths selected by the forward
maximization and also obtains an upper bound on the cross analysis. After collecting statistics using N-heads dynamic
entropy with the full (combinatoric) posterior, which can be programming, transition matrices within chains are reesti-
minimized using a subspace that is linear in the number of mated according to the conventional HMM expression. The
state variables. An ªN-headsº dynamic programming
coupling matrices are given by:
algorithm samples from the O…N† highest probability paths
through a compacted state trellis, with complexity j;tÿ1 Pi0 jj ps0t ˆi …o0t † i0 ;t
O…T …CN†2 † for C chains of N states apiece observing T Ps0t ˆi;stÿ1 ˆjjO ˆ …6†
P …O†
data points. For interesting cases with limited couplings, the
complexity falls further to O…T CN 2 †. PT
For HMMs, the forward-backward or Baum-Welch tˆ2 Ps0t ˆi;stÿ1 ˆjjO
P^i0 jj ˆ PT : …7†
algorithm provides expressions for the and variables, tˆ2 j;tÿ1 j;tÿ1
whose product leads to the likelihood of a sequence at each
instant of time. In the case of CHMMs, two state-paths have
ACKNOWLEDGMENTS
to be followed over time for each chain: one path
corresponds to the ªheadº (represented with subscript The authors would like to thank Michael Jordan, Tony
ªhº) and another corresponds to the ªsidekickº (indicated Jebara, and Matthew Brand for their inestimable help and
with subscript ªkº) of this head. Therefore, in the new insightful comments.
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 843

REFERENCES [27] D. Heckerman, ªA Tutorial on Learning with Bayesian Net-


works,º Technical Report MSR-TR-95-06, Microsoft Research,
[1] N. Oliver, B. Rosario, and A. Pentland, ªA Bayesian Computer Redmond, Wash., 1995, revised June 1996.
Vision System for Modeling Human Interactions,º Proc. Int'l Conf. [28] L.K. Saul and M.I. Jordan, ªBoltzmann Chains and Hidden
Vision Systems, 1999. Markov Models,º Proc. Neural Information Processing Systems, G.
[2] N. Oliver, ªTowards Perceptual Intelligence: Statistical Modeling Tesauro, D.S. Touretzky, and T.K. Leen, eds., vol. 7, 1995.
of Human Individual and Interactive Behaviors,º PhD thesis, [29] Z. Ghahramani and M.I. Jordan, ªFactorial Hidden Markov
Massachusetts Institute of Technology (MIT), Media Lab, Cam- Models,º Proc. Neural Information Processing Systems, D.S. Tour-
bridge, Mass., 2000. etzky, M.C. Mozer, and M.E. Hasselmo, eds., vol. 8, 1996.
[3] T. Darrell and A. Pentland, ªActive Gesture Recognition Using [30] P Smyth, D. Heckerman, and M. Jordan, ªProbabilistic Indepen-
Partially Observable Markov Decision Processes,º Int'l Conf. dence Networks for Hidden Markov Probability Models,º AI
Pattern Recognition, vol. 5, p. C9E, 1996. memo 1565, MIT, Cambridge, Mass., Feb. 1996.
[4] A.F. Bobick, ªComputers Seeing Action,º Proc. British Machine [31] C. Williams and G.E. Hinton, ªMean Field Networks That Learn
Vision Conf., vol. 1, pp. 13-22, 1996. to Discriminate Temporally Distorted Strings,º Proc. Connectionist
[5] A. Pentland and A. Liu, ªModeling and Prediction of Human Models Summer School, pp. 18-22, 1990.
Behavior,º Defense Advanced Research Projects Agency, pp. 201-206, [32] D. Stork and M. Hennecke, ªSpeechreading: An Overview of
1997. Image Procssing, Feature Extraction, Sensory Integration and
[6] H. Buxton and S. Gong, ªAdvanced Visual Surveillance Using Pattern Recognition Techniques,º Proc. Int'l Conf. Automatic Face
Bayesian Networks,º Int'l Conf. Computer Vision, June 1995. and Gesture Recognition, 1996.
[7] H.H. Nagel, ªFrom Image Sequences Toward Conceptual De- [33] M.I. Jordan, Z. Ghahramani, and L.K. Saul, ªHidden Markov
scriptions,º IVC, vol. 6, no. 2, pp. 59-74, May 1988. Decision Trees,º Proc. Neural Information Processing Systems, D.S.
[8] T. Huang, D. Koller, J. Malik, G. Ogasawara, B. Rao, S. Russel, and Touretzky, M.C. Mozer, and M.E. Hasselmo, eds., vol. 8, 1996.
J. Weber, ªAutomatic Symbolic Traffic Scene Analysis Using Belief [34] F.V. Jensen, S.L. Lauritzen, and K.G. Olesen, ªBayesian Updating
Networks,º Proc. 12th Nat'l Conf. Artifical Intelligence, pp. 966-972, in Recursive Graphical Models by Local Computations,º Computa-
1994. tional Statistical Quarterly, vol. 4, pp. 269-282, 1990.
[9] C. Castel, L. Chaudron, and C. Tessier, ªWhat is Going On? A
High Level Interpretation of Sequences of Images,º Proc. Workshop
on Conceptual Descriptions from Images, European Conf. Computer Nuria M. Oliver received the BSc (honors) and
Vision, pp. 13-27, 1996. MSc degrees in electrical engineering and
[10] J.H. Fernyhough, A.G. Cohn, and D.C. Hogg, ªBuilding Qualita- computer science from ETSIT at the Universidad
tive Event Models Automatically from Visual Input,º Proc. Int'l Politecnica of Madrid (UPM), Spain, 1994. She
Conf. Computer Vision, pp. 350-355, 1998. received the PhD degree in media arts and
[11] W.L. Buntine, ªOperations for Learning with Graphical Models,º sciences from Massachusetts Institute of Tech-
J. Artificial Intelligence Research, 1994. nology (MIT), Cambridge, in June 2000. Cur-
[12] L.R. Rabiner, ªA Tutorial on Hidden Markov Models and Selected rently, she is a researcher at Microsoft
Applications in Speech Recognition,º Proc. IEEE, vol. 77, no. 2, Research, working in the Adaptive Systems
pp. 257-285. 1989. and Interfaces Group. Previous to that, she
[13] M. Brand, N. Oliver, and A. Pentland, ªCoupled Hidden Markov was a researcher in the Vision and Modeling Group at the Media
Models for Complex Action Recognition,º Proc. IEEE Computer Laboratory of MIT, where she worked with professor Alex Pentland.
Vision and Pattern Recognition, 1996. Before starting her PhD at MIT, she worked as a research engineer at
[14] M. Brand, ªCoupled Hidden Markov Models for Modeling Telefonica I+D. Her research interests are computer vision, statistical
Interacting Processes,º Neural Computation, Nov. 1996. machine learning, artificial intelligence, and human computer interaction.
[15] N. Oliver, B. Rosario, and A. Pentland, ªGraphical Models for Currently, she is working on the previous disciplines in order build
Recognizing Human Interactions,º Proc. Neural Information Proces- computational models of human behavior via perceptually intelligent
sing Systems, Nov. 1998. systems.
[16] N. Oliver, B. Rosario, and A. Pentland, ªA Synthetic Agent System
for Modeling Human Interactions,º Technical Report, Vision and
Barbara Rosario was a visiting researcher in the Vision and Modeling
Modeling Media Lab, MIT, Cambridge, Mass., 1998. http://
Group at the Media Laboratory of the Massachusetts Institute of
whitechapel. media. mit. edu/pub/tech-reports.
Technology. Currently, she is a graduate student in the School of
[17] B. Rosario, N. Oliver, and A. Pentland, ªA Synthetic Agent System
Information and Management Systems (SIMS) at the University of
for Modeling Human Interactions,º Proc. AA, 1999.
California, Berkeley.
[18] R.K. Bajcsy, ªActive Perception vs. Passive Perception,º Proc.
CASE Vendor's Workshop, pp. 55-62, 1985.
[19] A. Pentland, ªClassification by Clustering,º Proc. IEEE Symp. Alex P. Pentland is the academic head of the
Machine Processing and Remotely Sensed Data, 1976. MIT Media Laboratory. He is also the Toshiba
[20] R. Kauth, A. Pentland, and G. Thomas, ªBlob: An Unsupervised professor of media arts and sciences, an
Clustering Approach to Spatial Preprocessing of MSS Imagery,º endowed chair last held by Marvin Minsky. His
11th Int'l Symp. Remote Sensing of the Environment, 1977. recent research focus includes understanding
[21] A. Bobick and R. Bolles, ªThe Representation Space Paradigm of human behavior in video, including face, ex-
Concurrent Evolving Object Descriptions,º IEEE Trans. Pattern pression, gesture, and intention recognition, as
Analysis and Machine Intelligence, vol. 14, no. 2, pp. 146-156, Feb. described in the April 1996 issue of Scientific
1992. American. He is also one of the pioneers of
[22] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, ªPfinder: wearable computing, a founder of the IEEE
Real-time Tracking of the Human Body,º Photonics East, SPIE, wearable computer technical area, and general chair of the upcoming
vol. 2,615, 1995. IEEE International Symposium on Wearable Computing. He has won
[23] N. Oliver, F. BeÂrard, and A. Pentland, ªLafter: Lips and Face awards from the AAAI, IEEE, and Ars Electronica. He is a founder of the
Tracking,º Proc. IEEE Int'l Conf. Computer Vision and Pattern IEEE wearable computer technical area and general chair of the
Recognition (CVPR `97), June 1997. upcoming IEEE International Symposium on Wearable Computing. He
[24] B. Moghaddam and A. Pentland, ªProbabilistic Visual Learning is a senior member of the IEEE.
for Object Detection,º Proc. Int'l Conf. Computer Vision, pp. 786-793,
1995.
[25] C.R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, ªPfinder:
Real-Time Tracking of the Human Body,º IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 19, no. 7, pp. 780-785, July
1997.
[26] W.L. Buntine, ªA Guide to the Literature on Learning Probabilistic
Networks from Data,º IEEE Trans. Knowledge and Data Engineering,
1996.

You might also like