Computer Vision
Computer Vision
Computer Vision
AbstractÐWe describe a real-time computer vision and machine learning system for modeling and recognizing human behaviors in a
visual surveillance task [1]. The system is particularly concerned with detecting when interactions between people occur and
classifying the type of interaction. Examples of interesting interaction behaviors include following another person, altering one's path to
meet another, and so forth. Our system combines top-down with bottom-up information in a closed feedback loop, with both
components employing a statistical Bayesian approach [2]. We propose and compare two different state-based learning architectures,
namely, HMMs and CHMMs for modeling behaviors and interactions. The CHMM model is shown to work much more efficiently and
accurately. Finally, to deal with the problem of limited training data, a synthetic ªAlife-styleº training system is used to develop flexible
prior models for recognizing human interactions. We demonstrate the ability to use these a priori models to accurately classify real
human behaviors and interactions with no additional tuning or training.
Index TermsÐVisual surveillance, people detection, tracking, human behavior recognition, Hidden Markov Models.
1 INTRODUCTION
To specify the priors in our system, we have developed a For each moving object an appearance-based description
framework for building and training models of the is generated, allowing it to be tracked through temporary
behaviors of interest using synthetic agents [16], [17]. occlusions and multiobject meetings. A Kalman filter tracks
Simulation with the agents yields synthetic data that is the objects' location, coarse shape, color pattern, and
used to train prior models. These prior models are then used velocity. This temporally ordered stream of data is then
recursively in a Bayesian framework to fit real behavioral used to obtain a behavioral description of each object and to
data. This approach provides a rather straightforward and detect interactions between objects.
flexible technique to the design of priors, one that does not Fig. 1 depicts the processing loop and main functional
require strong analytical assumptions to be made about the units of our ultimate system.
form of the priors.1 In our experiments, we have found that 1. The real-time computer vision input module detects
by combining such synthetic priors with limited real data and tracks moving objects in the scene, and for each
we can easily achieve very high accuracies of recognition of moving object outputs a feature vector describing its
different human-to-human interactions. Thus, our system is motion and heading, and its spatial relationship to
robust to cases in which there are only a few examples of a all nearby moving objects.
certain behavior (such as in interaction type 2 described in 2. These feature vectors constitute the input to stochas-
Section 5) or even no examples except synthetically- tic state-based behavior models. Both HMMs and
generated ones. CHMMs, with varying structures depending on the
The paper is structured as follows: Section 2 presents an complexity of the behavior, are then used for
overview of the system, Section 3 describes the computer classifying the perceived behaviors.
vision techniques used for segmentation and tracking of the
Note that both top-down and bottom-up streams of
pedestrians and the statistical models used for behavior
information would continuously be managed and com-
modeling and recognition are described in Section 4. A brief
bined for each moving object within the scene. Conse-
description of the synthetic agent environment that we have
quently, our Bayesian approach offers a mathematical
created is described in Section 5. Section 6 contains experi-
framework for both combining the observations (bottom-
mental results with both synthetic agent data and real video
up) with complex behavioral priors (top-down) to provide
data and Section 7 summarizes the main conclusions and
expectations that will be fed back to the perceptual system.
sketches our future directions of research. Finally, a summary
of the CHMM formulation is presented in the Appendix.
3 SEGMENTATION AND TRACKING
2 SYSTEM OVERVIEW The first step in the system is to reliably and robustly detect
and track the pedestrians in the scene. We use 2D blob
Our system employs a static camera with wide field-of-view
features for modeling each pedestrian. The notion of ªblobsº
watching a dynamic outdoor scene (the extension to an
as a representation for image features has a long history in
active camera [18] is straightforward and planned for the
computer vision [19], [20], [21], [22], [23] and has had many
next version). A real-time computer vision system segments
different mathematical definitions. In our usage, it is a
moving objects from the learned scene. The scene descrip-
compact set of pixels that share some visual properties that
tion method allows variations in lighting, weather, etc., to
are not shared by the surrounding pixels. These properties
be learned and accurately discounted.
could be color, texture, brightness, motion, shading, a
1. Note that our priors have the same form as our posteriors, namely they combination of these, or any other salient spatio-temporal
are Markov models. property derived from the signal (the image sequence).
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 833
Fig. 2. Background mean image, blob segmentation image, and input image with blob bounding boxes.
and interact by influencing each other's states. One most likely state sequence. The underlying indepen-
example is the sensor fusion problem: Multiple dence structure is suitable for representing hierarch-
channels carry complementary information about ical structure in a signal, for example, the baseline of a
different components of a system, e.g., acoustical song constrains the melody and both constrain the
signals from speech and visual features from lip
harmony.
tracking [32]. In [29], a generalization of HMMs with
coupling at the outputs is presented. These are We use two CHMMs for modeling two interacting
Factorial HMMs (FHMMs) where the state variable processes, in our case, they correspond to individual
is factored into multiple state variables. They have a humans. In this architecture state, chains are coupled via
clear representational advantage over HMMs: to matrices of conditional probabilities modeling causal
model C processes, each with N states, each would (temporal) influences between their hidden state variables.
require an HMM with N C joint states, typically The graphical representation of CHMMs is shown in Fig. 4.
intractable in both space and time. FHMMs are Exact maximum a posteriori (MAP) inference is an O
T N 4
tractable in space, taking NC states, but present an
computation [34], [30]. We have developed a deterministic
inference problem equivalent to that of a combina-
O
T N 2 algorithm for maximum entropy approximations to
toric HMM. Therefore, exact solutions are intractable
in time. The authors present tractable approxima- state and parameter values in CHMMs. From the graph it
tions using Gibbs sampling, mean field theory, or can be seen that for each chain, the state at time t depends
structured mean field. on the state at time t ÿ 1 in both chains. The influence of one
2. Coupling the states. In [28], a statistical mechanical chain on the other is through a causal link. The Appendix
framework for modeling discrete time series is contains a summary of the CHMM formulation.
presented. The authors couple two HMMs to exploit In this paper, we compare performance of HMMs and
the correlation between feature sets. Two parallel CHMMs for maximum a posteriori (MAP) state estimation.
Boltzmann chains are coupled by weights that We compute the most likely sequence of states S^ within a
connect their hidden unitsÐshown in Fig. 5 as model given the observation sequence O fo1 ; . . . ; on g. This
Linked HMMs (LHMMs). Like the transition and
most likely sequence is obtained by S^ argmaxS P
SjO.
emission weights within each chain, the coupling
In the case of HMMs, the posterior state sequence
weights are tied across the length of the network.
The independence structure of such an architecture probability P
SjO is given by
is suitable for expressing symmetrical synchronous Q
Ps1 ps1
o1 Tt2 pst
ot Pst jstÿ1
constraints, long-term dependencies between hid- P
SjO ;
1
den states or processes that are coupled at different P
O
time scales. Their algorithm is based on decimation, a where S fa1 ; . . . ; aN g is the set of discrete states, st 2 S
method from statistical mechanics in which the :
corresponds to the state at time t. Pijj Pst ai jstÿ1 aj is the
marginal distributions of singly or doubly connected
nodes are integrated out. A limited class of graphs state-to-state transition probability (i.e., probability of being
can be recursively decimated, obtaining correlations in state ai at time t given that the system was in state aj at time
for any connected pair of nodes. t ÿ 1). In the following, we will write them as Pst jstÿ1 . The prior
:
Finally, Hidden Markov Decision Trees (HMDTs) probabilities for the initial state are Pi Ps1 ai Ps1 . And,
:
[33] are a decision tree with Markov temporal structure finally, pi
ot pst ai
ot pst
ot are the output probabilities
(see Fig. 5). The model is intractable for exact for each state, (i.e., the probability of observing ot given state
calculations. Thus, the authors use variational approx- ai at time t).
imations. They consider three distributions for the In the case of CHMMs, we introduce another set of
approximation: one in which the Markov calculations probabilities, Pst js0tÿ1 , which correspond to the probability of
are performed exactly and the layers of the decision state st at time t in one chain given that the other
tree are decoupled, one in which the decision tree chainÐdenoted hereafter by superscript 0 Ðwas in state s0tÿ1
calculations are performed exactly and the time steps at time t ÿ 1. These new probabilities express the causal
of the Markov chain are decoupled, and one in which a influence (coupling) of one chain to the other. The posterior
Viterbi-like assumption is made to pick out a single state probability for CHMMs is given by
836 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000
Ps1 ps1
o1 Ps01 ps01
o01 from the other agents on the scene. Their velocity is normally
P
SjO
P
O distributed around a mean that increases or decreases when
2 they slow down or speed up. When certain preconditions are
Y T
Pst jstÿ1 Ps0t js0tÿ1 Ps0t jstÿ1 Pst js0tÿ1 pst
ot ps0t
o0t ; satisfied a specific interaction between two agents takes place.
t2 Each agent has perfect knowledge of the world, including the
where st ; s0t ; ot ; o0t
denote states and observations for each of position of the other agents.
the Markov chains that compose the CHMMs. A coupled In the following, we will describe without loss of
HMM of C chains has a joint state trellis that is in principle N C generality, the two-agent system that we used for generat-
states wide; the associated dynamic programming problem is ing prior models and synthetic data of agents interactions.
O
T N 2 C. In [14], an approximation is developed using N- Each agent makes its own decisions depending on the type
heads dynamic programming such that an O
T
CN2 of interaction, its location, and the location of the other
algorithm is obtained that closely approximates the full agent on the scene. There is no scripted behavior or a priori
combinatoric result. knowledge of what kind of interaction, if any, is going to
Coming back to our problem of modeling human take place. The agents' behavior is determined by the
behaviors, two persons (each modeled as a generative perceived contextual information: current position, relative
process) may interact without wholly determining each position of the other agent, speeds, paths they are in,
others' behavior. Instead, each of them has its own internal directions of walk, etc., as well as by its own repertoire of
dynamics and is influenced (either weakly or strongly) by possible behaviors and triggering events. For example, if
others. The probabilities Pst js0tÿ1 and Ps0t jstÿ1 describe this kind one agent decides to ªfollowº the other agent, it will
of interactions and CHMMs are intended to model them in proceed on its own path increasing its speed progressively
as efficient a manner as possible. until reaching the other agent, that will also be walking on
the same path. Once the agent has been reached, they will
adapt their mutual speeds in order to keep together and
5 SYNTHETIC BEHAVIORAL AGENTS continue advancing together until exiting the scene.
We have developed a framework for creating synthetic For each agent the position, orientation, and velocity is
agents that mimic human behavior in a virtual environment measured, and from this data a feature vector is constructed
[16], [17]. The agents can be assigned different behaviors which consists of: d_12 , the derivative of the relative distance
and they can interact with each other as well. Currently,
between two agents; 1;2 sign
< vp 1; v 2 >, or degree of
they can generate five different interacting behaviors and
various kinds of individual behaviors (with no interaction). alignment of the agents, and vi x_ 2 y_ 2 ; i 1; 2, the
The parameters of this virtual environment are modeled on magnitude of their velocities. Note that such a feature vector
the basis of a real pedestrian scene from which we obtained is invariant to the absolute position and direction of the agents
measurements of typical pedestrian movement. and the particular environment they are in.
One of the main motivations for constructing such
synthetic agents is the ability to generate synthetic data 5.2 Agent Behaviors
which allows us to determine which Markov model The agent behavioral system is structured in a hierarchical
architecture will be best for recognizing a new behavior way. There are primitive or simple behaviors and complex
(since it is difficult to collect real examples of rare interactive behaviors to simulate the human interactions.
behaviors). By designing the synthetic agents models such In the experiments reported in Section 4, we considered
that they have the best generalization and invariance five different interacting behaviors that appear illustrated in
properties possible, we can obtain flexible prior models Figs. 6 and 7:
that are transferable to real human behaviors with little or
no need of additional training. The use of synthetic agents 1. Follow, reach, and walk together (inter1): The two
to generate robust behavior models from very few real agents happen to be on the same path walking in the
behavior examples is of special importance in a visual same direction. The agent behind decides that it wants
surveillance task, where typically the behaviors of greatest to reach the other. Therefore, it speeds up in order to
interest are also the most rare. reach the other agent. When this happens, it slows
down such that they keep walking together with the
5.1 Agent Architecture same speed.
Our dynamic multiagent system consists of some number of 2. Approach, meet, and go on separately (inter2): The
agents that perform some specific behavior from a set of agents are on the same path, but in the opposite
possible behaviors. The system starts at time zero, moving direction. When they are close enough, if they realize
discretely forward to time T or until the agents disappear that they ªknowº each other, they slow down and
from the scene. finally stop to chat. After talking they go on
The agents can follow three different paths with two separately, becoming independent again.
possible directions, as illustrated in Figs. 6 and 7 by the yellow 3. Approach, meet, and go on together (inter3): In this
paths.2 They walk with random speeds within an interval; case, the agents behave like in ªinter2,º but now after
they appear at random instances of time. They can slow talking they decide to continue together. One agent
down, speed up, stop, or change direction independently therefore, changes its direction to follow the other.
4. Change direction in order to meet, approach, meet,
2. The three paths were obtained by statistical analysis of the most and continue together (inter4): The agents start on
frequent paths that the pedestrians in the observed plaza followed. Note,
however, that the performance of neither the computer vision nor the different paths. When they are close enough they can
tracking modules is limited to these three paths. see each other and decide to interact. One agent waits
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 837
Fig. 6. Example trajectories and feature vector for the interactions: follow, approach+meet+continue separately, and approach+meet+continue
together.
for the other to reach it. The other changes direction in individual behaviors activated in each of the agents. Fig. 8
order to go toward the waiting agent. Then they meet, illustrates the timeline and synchronization of the simple
chat for some time, and decide to go on together. behaviors and events that constitute the interactions.
5. Change direction in order to meet, approach, meet, These interactions can happen at any moment in time and
and go on separately (inter5): This interaction is the at any location, provided only that the precondititions for the
same as ªinter4º except that when they decide to go on interactions are satisfied. The speeds they walk at, the
after talking, they separate, becoming independent. duration of their chats, the changes of direction, the starting
Proper design of the interactive behaviors requires the and ending of the actions vary highly. This high variance in
agents to have knowledge about the position of each the quantitative aspects of the interactions confers robustness
other as well as synchronization between the successive to the learned models that tend to capture only the invariant
838 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000
Fig. 7. Example trajectories and feature vector for the interactions: change direction+approach+meet+continue separately, change
direction+approach+meet+continue together, and no interacting behavior.
parts of the interactions. The invariance reflects the nature of one of the most controversial and open issues in Bayesian
their interactions and the environment. inference. As we have already described, we solve this
problem by using a synthetic agents modeling package,
which allows us to build flexible prior behavior models.
6 EXPERIMENTAL RESULTS
Our goal is to have a system that will accurately interpret
6.1 Comparison of CHMM and HMM Architectures
behaviors and interactions within almost any pedestrian with Synthetic Agent Data
scene with little or no training. One critical problem, We built models of the five previously described synthetic
therefore, is generation of models that capture our prior agent interactions with both CHMMs and HMMs. We used
knowledge about human behavior. The selection of priors is two or three states per chain in the case of CHMMs and
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 839
Fig. 8. Timeline of the five complex behaviors in terms of events and simple behaviors.
three to five states in the case of HMMs (accordingly to the
CN2 N
d d!
2 32 3
3 6 36 27 63
complexity of the various interactions). The particular parameters. A five state (N 5) HMM with six-dimen-
number of states for each architecture was determined sional (d 6) Gaussian observations has N 2 N
d
using 10 percent cross validation. Because we used the same d! 52 5
3 6 25 45 70 parameters to estimate.
amount of data for training both architectures, we tried Each of these architectures corresponds to a different
keeping the number of parameters to estimate roughly the physical hypothesis: CHMMs encode a spatial coupling in
same. For example, a three state (N 3) per chain CHMM time between two agents (e.g., a nonstationary process)
with three-dimensional (d 3) Gaussian observations has whereas HMMs model the data as an isolated, stationary
840 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000
Fig. 10. Example trajectories and feature vector for interaction 2, or approach, meet, and continue separately behavior.
842 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000
TABLE 2
Accuracy for Both Untuned, a Priori Models, and Site-Specific
CHMMs Tested on Real Pedestrian Data
behaviors with no additional tuning or training. This fact will be transformed into a pair of equations, one for the full
is especially important, given the limited amount of training posterior and another for the marginalized posterior :
data available. X
The presented CHMM framework is not limited to only i;t pi
ot pki0 ;t
ot Pijhj;tÿ1 Pijkj0 ;tÿ1 Pki0 ;t jhj;tÿ1 Pki0 ;t jkj;tÿ1 j;tÿ1
two interacting processes. Interactions between more than j
APPENDIX i;t
X X
FORWARD () AND BACKWARD () EXPRESSIONS pi
ot Pijhj;tÿ1 Pijkj0 ;tÿ1 pkg0 ;t
ot Pkg0 ;t jhj;tÿ1 Pkg0 ;t jkj0 ;tÿ1 j;tÿ1 :
j g
FOR CHMMS
5
In [14], a deterministic approximation for maximum a
posterior (MAP) state estimation is introduced. It enables The variable can be computed in a similar way by
fast classification and parameter estimation via expectation tracing back through the paths selected by the forward
maximization and also obtains an upper bound on the cross analysis. After collecting statistics using N-heads dynamic
entropy with the full (combinatoric) posterior, which can be programming, transition matrices within chains are reesti-
minimized using a subspace that is linear in the number of mated according to the conventional HMM expression. The
state variables. An ªN-headsº dynamic programming
coupling matrices are given by:
algorithm samples from the O
N highest probability paths
through a compacted state trellis, with complexity j;tÿ1 Pi0 jj ps0t i
o0t i0 ;t
O
T
CN2 for C chains of N states apiece observing T Ps0t i;stÿ1 jjO
6
P
O
data points. For interesting cases with limited couplings, the
complexity falls further to O
T CN 2 . PT
For HMMs, the forward-backward or Baum-Welch t2 Ps0t i;stÿ1 jjO
P^i0 jj PT :
7
algorithm provides expressions for the and variables, t2 j;tÿ1 j;tÿ1
whose product leads to the likelihood of a sequence at each
instant of time. In the case of CHMMs, two state-paths have
ACKNOWLEDGMENTS
to be followed over time for each chain: one path
corresponds to the ªheadº (represented with subscript The authors would like to thank Michael Jordan, Tony
ªhº) and another corresponds to the ªsidekickº (indicated Jebara, and Matthew Brand for their inestimable help and
with subscript ªkº) of this head. Therefore, in the new insightful comments.
OLIVER ET AL.: A BAYESIAN COMPUTER VISION SYSTEM FOR MODELING HUMAN INTERACTIONS 843