Real Time Face and Object Tracking As Component of Perceptual User Interface

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Real Time Face and Object Tracking as a Component of a Perceptual User Interface

Gary R. Bradski, Intel Corporation, Microcomputer Research Lab. [email protected]

Abstract
As a step towards a perceptual user interface, an object tracking algorithm is developed and demonstrated tracking human faces. Computer vision algorithms that are intended to form part of a perceptual user interface must be fast and efficient. They must be able to track in real time and yet not absorb a major share of computational resources. An efficient, new algorithm is described here based on the mean shift algorithm. The mean shift algorithm robustly finds the mode (peak) of probability distributions. We first describe histogram based methods of producing object probability distributions. In our case, we want to track the mode of an objects probability distribution within a video scene. Since the probability distribution of the object can change and move dynamically in time, the mean shift algorithm is modified to deal with dynamically changing probability distributions. The modified algorithm is called the Continuously Adaptive Mean Shift (CAMSHIFT) algorithm. CAMSHIFT is then used as an interface for games and graphics.

The mean shift algorithm [[6][2]] is a statistically robust method for finding the mode (peak) of a probability distribution. In vision, mean shift has previously been used to segment image regions 1131. The above applications of the mean shift algorithm have worked on fixed, static distributions. Distributions derived from objects in video frame sequences can change and move from frame to frame. In a previous paper, we described a new algorithm, CAMSHIFT [l] that modifies the mean shift algorithm to deal with probability distributions that change and move dynamically in time. This turns the mean shft algorithm from a statistically robust mode seeking algorithm into a dynamic mode tracker which is used to track objects in video frame sequences. Section 3 describes the CAMSHIFT algorithm. We want to develop tracking modules that can serve as part of a user interface that is in turn part of the computational tasks that a computer might routinely be expected to carry out. This tracker also needs to run on inexpensive consumer cameras without calibrated lenses. In this paper, to address these speed and consumer camera considerations, we focus on color-based tracking [[4][8][13][14]]. In Section 4, CAMSHIFT is analyzed for performance. Section 5 discusses considerations for, and examples of the use of CAMSHIFT as part of a computer interface for controlling games and 3D graphics. We end with discussion and conclusions.

1. Introduction
This paper describes part of a larger program to develop a real time Perceptual User Interface. We want to give computers the ability to segment, track, and understand the pose, gestures, and emotional expressions of humans and the tools they might be using in front of a computer or settop box. In this paper we describe the development of some core modules in this effort: histogram based methods of producing appearance based object probability distributions; a statistically robust dynamic probability distribution mode tracking algorithm, and a 4- degree of freedom color object tracker and its application to fleshtone-based face tracking.
CAMSHIFT tracks dynamically changing probability

2. Object Probability Distributions


Our method for generating object probability distributions takes after that of Schiele and Crowley [l 11 who used histograms as a basis for Bayesian object identification. Our interests, however, are in generating probability distributions of the location of given objects, not just in their identification. We decide on a local measure set M (for example, local color correlations, image gradients, derivatives etc.) to use on objects 0,. Taking a histogram of the object yields

distributions. Many approaches for using histograms to identify visual objects have been suggested [for example, [7][ IO][ 1411. A general, histogram based approach to probabilistic object recognition is described in Schiele and Crowley [ 111. In section 2 we describe how we use this approach to produce the probability distribution of a learned object in a visual scene.

where R and T are the rotation and translation of the object. In this paper we use color histograms which are relatively insensitive to rotation and translation so that we

2 14
0-8186-8606-5/98$10.00 0 1998 IEEE

drop consideration of R and T. Normalizing by the input count then gives us:

Equation 2 above is just opposite of what we want. We want the probability of an object given it's measure set which we get by applying Bayes rule. For a single measurement vector mk, at a point in the image we have:

Once we calculate the probability at a point given a region, we can translate that region by a local measure distance, divide out the uncovered points and multiply in the newly covered points and repeat in order to generate an object location probability distribution image. Experiments show that using measurement regions that cover as little as 13.5% of an object still yields above 90% object recognition rates [ I l l . Thus, small regions can be used to sweep out the object location probability image as described above. CAMSHIFT can then be applied to the resulting object probability distribution image derived from equation 3 or equation 5 to track the object as it moves in the video image frame. CAMSHIFT also returns an idea of the object scale which can be used to refine the image region to calculate over as described in section 3 below.

Equation 3 is in fact what we use for developing a color probability distribution of an object at each pixel with all object priors p(oJ set equal to one another. Note here that a probability of 0.5 implies complete uncertainty. Also note that the denominator in equation 3 may be precalculated for additional speed after all objects are learned. In this paper, we create the color histogram from sampling hue in a Hue Saturation Value (HSV) color space [12]. Hue is ill-defined at low saturation and, for discrete systems, at low brightness. We therefor use a threshold to disregard hue pixels with these conditions. White colors can take on ambiguous hues at high brightness, so these are also thresholded out. For more complicated measure sets and/or for more object discrimination, we can calculate the probability of many local measurement vectors over a region of the image:

3. CAMSHIFT Derivation
CAMSHIFT tracks objects using a probability distribution of the object in a video scene. The closest existing algorithm to CAMSHIFT is known as the mean shift algorithm[[6][2][3]]. The mean shift algorithm is a non-parametric technique that climbs the gradient of a probability distribution to find the nearest dominant mode (peak). How to Calculate the Mean Shift Algorithm 1. Choose a search window size. 2. Choose the initial location of the search window. 3. Compute the mean location in the search window. 4. Center the search window at the mean location computed in Step 3. 5 . Repeat Steps 3 and 4 until convergence.

3.1. Proof of Convergence [l]


Unfortunately, we don't know the joint probabilities A k m , over an image region. But, since our measures set is presumed to be made up of local measures, by spacing our measure points at least the distance of a local measure apart, we can assume our measures over a region are independent to yield: Assuming a Euclidean distribution space containing distribution f, the proof is as follows reflecting the steps above: 1. A window W is chosen at size s. 2. The initial search window is centered at data point pk 3. Compute the mean position within the search window

The mean shift climbs the gradient off@)


i

Equation 5 is used to yield the probability of an object at a point in the image given a region around that point. Unless we have good knowledge of the operating environment, we generally assume the object priors p(oJ to be equal to one another.

4. Center the window at point

p, (w).

5. Repeat Steps 3 and 4. Near the mode f ' ( p ) 0,so


the mean shift algorithm converges there.

215

The mean shift algorithm above is designed for static distributions and so fails as a tracker. Object distributions in video scenes can change with time as the object moves with respect to the camera. Small fixed-sized windows may get lost entirely for large object translation in the scene. Large fixed-sized windows may include distractors and too much noise. CAMSHIFT is designed to handle dynamically changing distributions by adjusting its search window size according to the distribution mass, or area it finds under its window. This can save computations since we neednt calculate the objects probability distribution over the whole image, but can instead restrict the calculation of the distribution to a smaller image region surrounding the current CAMSHIFT window.

The length and width of the probability distribution blob found by CAMSHIFT may be calculated in closed form as follows [ 5 ] . Let

b=2 [ 2
and
M 2 c=-- O Moo

- xey,

1
,

Ye

3.2. CAMSHIFT Equations


For 2D image probability distributions, the mean location (the centroid) within the search window (Steps 3 and 4 above) is found as follows: Find the zeroth moment (distribution area)

Then length 1 and width w are

I = ( a + c )+

Jh2+ ( a - c)
2

W=

( 12)

When used in face tracking, the above equations give us head roll, length, and width as shown in Figure 1. Then the mean search window location (the centroid) is

x, =-4

>

Moo

MO, ye=-, .
Moo

where I(x,y) is the pixel (probability) value at position (x,y) in the image, and x and y range over the search window. The 2D orientation of the probability distribution is also easy to obtain by using the second moments during the course of CAMSHIFTS operation. Second moments are

Figure 1: Orientation of the flesh probability distribution marked on the source video image queue

3.3. CAMSHIFT Plow Chart


Figure 2 shows the CAMSHIFT algorithm. For 2D color probability distributions where the maximum pixel value is 255, we adapt the size of window width as

M,, = c c x z l ( x , y ) ; M,, = c ~ X * I ( x , y ) . ( 9 )
X Y
X Y

Then the object orientation (major axis) is

e=

We divide area Mooby 256 to convert area found to units of number of pixels. To convert the resulting 2D region to a 1D window side length, we need to take the square root. In practice, for tracking faces, we set window width to 2s and window length to 2.4s since faces are elliptical.

2 16

...... .....
I

Set calculation but larger in size than the

- Object Prob. Dist. Image

Figure 4: Example of CAMSHIFT tracking from the converged search location in Figure 3 bottom right Since our probability distributions can yield object identification [[7][11][14]] as well as location, and since CAMSHIFT tends to stick to the mode of the dominant distribution, CAMSHIFT is not easily distracted from tracking it's object as shown in Figure 5.

Use (X,Y) to set search window center, 2 * a r e P

within the search window

Center search window at the center of mass

.. ... .

Figure 2: CAMSHIFT
I* ,
1

Figure 5: Left: Tracking a face with distractor faces. Right Tracking with occlusion. (sequence: down left then right columns)

4. CAMSHIFT Performance Analysis

Figure 3: CAMSHIFT in operation down the left then right columns In Figure 3, CAMSHIFT is shown beginning the search process at the top left down the left then right columns until convergence at bottom right. In this figure, the dark graph is a 1D cross-section of a flesh color probability distribution of an image of a face and a nearby hand. The light shade is the CAMSHIFT search window and the gray peak is the shift point (new window center). In Figure 3, CAMSHIFT finds the center of the the face, but ignores nearby distractors (the hand). Figure 4 below shows frame to frame tracking. At the left in Figure 4, CAMSHIFT's search window starts at its previous location from the bottom right in Figure 3. In one iteration it converges to the new face center. CAMSHIFT is shown to have low jitter, high noise tolerance, and high precision tracking behavior in [ 11. Example videos may be seen at the same site.

Figure 6: Since CAMSHIFT scales it's calculation region with tracked object size, performance scales inversely with tracked object size (ordinate is percent of CPU used). Here, CPU usage is of entire computer vision thread from image acquisition, HSV conversion, distribution calculation and tracking. The order of complexity of CAMSHIFT is O(aN') where the image is taken to be NxN. a is most influenced by the moment calculations and the average number of mean

217

shift iterations until convergence. The biggest computational savings come through scaling the region of calculation to an area around the search window size as previously discussed. CAMSHIFT was run on a 300 MHz PentiumB I1 processor, using an image size of 160x120 at 30 frames per second (see Figure 6). CAMSHIFTs performance scales with tracked object size. Figure 6 shows the CPU load from the entire computer vision thread including iniage acquisition, HSV conversion, color probability distribution calculation, and CAMSHIFT tracking. In Figure 6 tracking consumes between 10 and 55% of the CPU. In an actual control task of flying over a 3D model of Hawaii using tracked head movements, average computer vision CPU usage was 29%. VTUNETM analysis showed that the CAMSHIFT itself consumed under 12% of the CPU. CAMSHIFT benefits from Intels MMXTM technology optimized Image Processing Library available on the Web [9] to do RGB-to-HSV image conversion, image allocation and moments calculation.

Y Variable Special Case for Seated User

Figure 7: Lean changes Y and head roll If a user sits facing the camera and then makes an X move by leaning, Y will decrease as in Figure 7.To compensate, we use an empirical observation that w (face half width) is proportional to face size (see Figure l), which is, on average, often proportional to body size. Empirically, the ratio w to torso length from face centroid is 1 to 13.75 (2 inches to 27.5 inches).(control equation 3) Given lean distance x (in units of w), and seated size of 13.75, as Figure 7 so that sin(A) = d13.75. Then, A = sin-(d13.75), and b = 13.75( 1 - COS(A)) (control equation 4) (control equation 5)

5. CAMSHIFTS Use in an Interface


Piecewise Linear Transformation of Control Vars For game and graphics control, X, Y, Z, and Roll returned by CAMSHIFT face tracking often require a neutral position; that is, a position relative to which further face movement is measured. Each variables movement away from neutral is scaled in Wdifferent ranges. The formula for mapping captured video head position P to a screen movement factor F is
F = min(bl,P)sl + [min(b2-bl,P-bl)]+s2+ ... + [min(b(i+l)bi, P-bi)]s(i+l,+.. + [P-bv-l)]+sN, (control equation 1) .

in units of w. Control equations 4 and 5 give the Y distance to correct for (add back) when leaning.

Roll Considerations As can be seen Figure 7, for seated users lean also induces a change in head roll by the angle A. We correct for this by subtracting the measured roll from the leaninduced roll, A, calculated in control Equation 4 above.
Another problem can result when the user looks down too much and the forehead dominates the camera. This causes the face flesh color blob to look like it is oriented horizontally. To correct for this problem, we define a new variable, Q called Roll Quality. Q is the ratio of length I and width w, of the distribution color blob in the CAMSHIFT search window: Q = llw. (control equation 6)

where [#I equals # if # > 0, and zero otherwise; &(A, B) returns the minimum of A or B; bl-bN represents the bounds of the ranges, sI-sNare the corresponding scale factors for each range, and P is the magnitude of the variables location away from neutral. This allows differential movement away from neutral. Frame Rate Adjustment If the graphics rendering rate R can be determined, the final screen movement S is
S = FIR

(control equation 2)

Computer graphics and game movement commands S can be issued on each rendered graphics frame, but rendering speed depends on scene complexity. Thus for smooth movement, S should be sensitive to frame rate.

For forehead views, Roll Quality is nearly 1.0. So we ignore roll for quality measures less than 1.25. Roll is also ignored for Q greater than 2.0 since this is un-facelike -- a likely result of noise or occlusions.

5.1. CAMSHIFTs Use as an Interface


CAMSHIFT is being used as a face tracker to control games and 3D graphics. By inserting face control variables into the mouse queue, we can control unmodified commercial games such as Quake 2 shown in

2 18

Figure 8. We used left and right head movements to slide a user left and right, back and forth head movements to move the user back and forth, up or down movements to shoot, and roll left or right to turn the left or right. This methodology has been tested extensively in a series of demos with over 30 different users. Head tracking via CAMSHIFT has also been used to experiment with immersive 3D graphics control in which natural head movements are translated to moving the corresponding 3D graphics camera viewpoint. This has been extensively tested using a 3D graphics model of the Forbidden City in China as well as in exploring a 3D graphics model of the big island of Hawaii. Most users find it an enjoyable experience in which they naturally pick up how to control the graphics viewpoint movement.

References
[ 11 G. Bradski, Computer Vision Face Tracking For Use in a Perceptual User Interface, Intel Technology Journal, httu://developer.intel.com/technolog.v/iti/a2 1998iartic ledart 2.htm, Q2, 1998. [2] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal. Machine Intell., 17:790-799, 1995. [3] D. Comaniciu and P. Meer, Robust Analysis of Feature Spaces: Color Image Segmentation, CVPR97, pp. 750-755. [4] P. Fieguth and D. Terzopoulos, Color-based tracking of heads and other mobile objects at video frame rates, In Proc. Of IEEE CVPR, pp. 21-27, 1997. [5] W.T. Freeman, K. Tanaka, J.Ohta, and K. Kyuma, Computer Vision for Computer Games, Int. Conf. On Automatic Face and Gesture Recognition, pp. 100-105, 1996. [6] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, Boston, 1990. [7] B.V. Funt and G.D. Finlayson, Color constant color indexing, IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5): 522-529, 1995. [8] M. Hunke and A. Waibel, Face locating and tracking for human-computer interaction, Proc. Of the 2gth Asilomar Conf. On Signals, Sys. and Comp., pp. 1277-128I, 1994. [9] MMXTM technology optimized library in image processing & pattern recognition may be downloaded from:

Figure 8: CAMSHIFT-based face tracker used to play Quake 2 hands free by inserting control variables into the mouse

http://developer.intel.com/desicm/perftool/perflibst/in

6. Discussion and Conclusion


CAMSHIFT is simple, computationally efficient probability distribution based object tracker fast enough to be used as part of a perceptual user interface. Despite its simplicity, CAMSHIFT still handles some basic computer-vision tracking problems: Irregular object motion: CAMSHIFT scales its search window to object size and so scales its potential tracking speed with object distance from camera. Image noise CAMSHIFTs search window helps ignore outliers. Distractors: CAMSHIFT ignores objects outside its search window so objects such as nearby faces and hands do not affect CAMSHIFTs tracking. Occlusion: As long as occlusion isnt loo%, CAMSHIFT will still tend to follow what is left of the objects probability distribution. In this paper, color distribution tracking was discussed, but noting section 2, CAMSHIFT can track objects based on any valid distribution derived from a measurement set.

dex.htm)
[ 101K. Nago, Recognizing 3D objects using photometric invariants, ICCV95, pp 480-487, 1995. [ 1 I] B. Schiele and J.L. Crowley, Recognition without

correspondence using multidimensional receptive field histograms, MIT Media Lab. Perceptual Computing Technical Report, No. 453. [ 121A.R. Smith, Color Gamut Transform Pairs, SIGGRAPH 78, pp. 12-19, 1978. [ 131K. Sobottka and I. Pitas, Segmentation and tracking of faces in color images, Proc. Of the Second Intl. Conf. On Auto. Face and Gesture Recognition, pp. 236-241, 1996. [ 141M. Swain and D. Ballard, Color indexing, Intl. J. of Computer Vision, 7( I) pp. 11-32, 199 1.

2 19

You might also like