Tracking Humans Using Multi-modal Fusion - VISLab

Transcription

Tracking Humans using Multi-modal FusionXiaotao Zou, Bir BhanuCenter for Research in Intelligent SystemsUniversity of California, Riverside, CA 92521{xzou,bhanu}@vislab.ee.ucr.eduAbstractHuman motion detection plays an important role inautomated surveillance systems. However, it ischallenging to detect non-rigid moving objects (e.g.human) robustly in a cluttered environment. In thispaper, we compare two approaches for detectingwalking humans using multi-modal measurementsvideo and audio sequences. The first approach is basedon the Time-Delay Neural Network (TDNN), whichfuses the audio and visual data at the feature level todetect the walking human. The second approachemploys the Bayesian Network (BN) for jointlymodeling the video and audio signals. Parameterestimation of the graphical models is executed usingthe Expectation-Maximization (EM) algorithm. And thelocation of the target is tracked by the Bayes inference.Experiments are performed in several indoor andoutdoor scenarios: in the lab, more than one personwalking, occlusion by bushes etc. The comparison ofperformance and efficiency of the two approaches arealso presented.“detection and tracking” systems based on non-imagingmeasurements. A project named “Smart Floor” [6]aims to identify and track a user around the space withforce measuring load cells installed under the floor.However, along with the relatively high performancecomes the high cost and careful design of theinstrumented space.Although the video or IR sensors provide a detailedand friendly description about the environment, theirvolumes and costs restrict the use of them in theWireless Sensor Network (WSN). On the contrary, themicrophone turns out to be a qualified candidate to thesurveillance WSN in several aspects: its compact size,low cost, small data volume for transmission, lowpower consumption, and easiness of integration in thechip. In most surveillance scenarios, a simple sensornetwork composed of several off-the-shelf cameras anddozens, even hundreds, of microphones may beappropriate to cover an area up to a few acres. Sincethe audio and video sequences cover overlapping areas,a sensor fusion mechanism should be developed toobtain a more accurate and efficient solution.1. IntroductionAutomated Surveillance addresses real-timeobservation of people, vehicles and other movingobjects within a complicated environment, leading to adescription of their actions and interactions. Thetechnical issues include moving object detection andtracking, object classification, human motion analysis,activity understanding. Most commonly used sensorsfor surveillance are imaging sensors, e.g. videocameras and thermal imaging systems.There are a number of video surveillance systems[3], which consist of a single camera or hundreds ofcameras. To achieve the continuous monitoring,infrared (IR) cameras are used along with the opticalcameras under low illuminations [12]. Besides thesevideo or IR surveillance systems, there also existFigure 1. Experiment setup for the multi-modal surveillancesystem. lx is the horizontal position of the target.Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

AmplitudeTime (in second)Figure 2. Audio-video data correspondence in the sequence “walking-in-the-lab”. (Above) Sound pressure waveform received at themicrophone. (Bottom) Corresponding frames for beats in the step sound.In the multi-modal tracking system shown in Fig. 1,the audio waveforms are captured by two microphones,and the video sequences are recorded by the off-theshelf camera. The frames contain a person walking infront of a cluttered background that may include otherperson. The audio waveform contains the object’s stepsounds corrupted by background noises.The audio and visual signals are highly correlated asshown in Fig. 2. Also, the time delay between thesignals arriving at the two microphones is correlatedwith the position of the walking person in the frames.In principle, tasks such as tracking may be performedbetter by taking advantage of these correlations.However, relevant features are not directly observable.The audio signal propagating from the walker is usuallycorrupted by reverberation, multi-path effects andbackground noise, making it difficult to measure thetime delay. Moreover, the video sequence is clutteredby moving objects other than the walking person.In this paper, we compare two multi-modal fusionapproaches for walking human detection and tracking:Time-delay Neural Network (TDNN) and BayesianNetwork (BN). First, we discuss related work andmotivation of our approaches in Section 2. In the nextsection, the two approaches used for our problem areoutlined. The network architecture, feature selectionand parameter estimation are presented in details. Then,we compare the performance and efficiency of thesetwo approaches on various testing scenarios: indoor,outdoors, more than one moving humans, occlusion bybushes.2. Related work, motivation and contributions2.1. Related workSeveral studies on fusing audio-video data for objectdetection and tracking have been reported in theliterature. Vermaak et al. [14] proposed the particlefilter-based approach for audio-visual speaker tracking.Fisher and Darrell [7] presented an informationtheoretic approach for the fusion of multiple modalities,which can detect where a speaker is within a scene andwhether he or she is producing specific words. Avariation of the neural network, Time-Delay NeuralNetwork (TDNN), has been proposed to handle thesequential data for multi-modal fusion. Stork et al. [13]proposed a modified TDNN to perform the visual lipreading to improve the accuracy of acoustic speechrecognition. Cutler and Davis [4] also used the TDNNProceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

to learn the audio-visual correlation for searching thespeaking person in the scene. However, in all the worksmentioned above, it is assumed that the objects withinthe scene (e.g., the speaking faces) do not movedramatically. Consequently, they cannot address thedynamic changes of the objects.Bayesian Network (BN) [10], the graphicallystatistical model, finds a wide use in multi-modalfusion. Garg et al. [9] developed a supervised learningframework based on dynamic Bayesian Networks andapplied it to the problem of audio-visual speakerdetection for a smart kiosk. A graphical model foraudio-visual object tracking has been proposed by Bealand Jojic et al. [1], which extended the concept ofTransformed Mixture of Gaussian (TMG) [8] to audioand video data modeling and used the ExpectationMaximization algorithm to estimate the modelparameters.2.2. MotivationThe visual motion of a walking person (i.e., gait) isperiodic and highly correlated with the correspondingstep sound (Fig. 2). A similar fact (correlation betweenthe motion of mouths during speaking and the speechsounds) has been exploited for lip-reading [13] andspeaker detection [4]. Fig. 3 shows the recurrencematrix of the extracted human motion and the similarityof the step sounds in the sequence “walking-in-the-lab”.Recurrence matrix is a quantitative tool used toperform time series analysis of non-linear dynamicsystems, and it is defined by the correlation functionR ( I t , I t ) for frames I t , I t in our case. The highest1212.3. ContributionsThis paper first explores the relation between visualmotions and step sounds, and discusses the applicationof Time-Delay Neural Network (TDNN) in multimodal fusion for walking human detection. The audiovisual correlation is first learned by a time-delay neuralnetwork, which then performs a spatio-temporal searchover the audio-visual sequences for the walking person.For comparison, the paper also employs theBayesian Network for jointly modeling audio-visualdata and exploiting correlations between the twomodalities. Statistical models have several importantadvantages that make it suitable for our purpose. First,since we explicitly model the actual sources ofvariability in the problem, the resulting algorithm turnsout to be robust. Second, using statistical models leadsto an optimal solution by the Bayes inference. Third,parameter estimation is performed efficiently using theExpectation-Maximization algorithm.3. Technical approaches3.1. Time-Delay Neural Network approachThe Time-Delay Neural Network approach isillustrated in Fig. 4.3.1.1. TDNN dataflow diagram2correlation values can be found on the diagonal line inFig. 3(a). Similarly, we use the Euclidean distancebetween the amplitudes to define the similarity of thecorresponding audio signals at times t1 and t2. And thesimilarity is denoted by the brightness shown in Fig.3(b). In Fig. 3, we can find that the change in the audiodata is highly correlated with visual change in the gait.It prompts us to use some mechanism to detect thewalking persons in the scene by fusing these twodifferent modalities of signals, i.e. visual and audio.t1 (in frame)t2 (in frame)t2 (in frame)t1 (in frame)(a)(b)Figure 3. (a) Recurrence matrix RM(t1,t2) of extracted object invideo and (b) similarity of step sounds.Figure 4. Dataflow diagram of TDNN-based object detection.Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

Frequency (KHz)3.1.2. Time-Delay Neural Network architectureTime (sound sample point)Figure 6. Step sound spectrogram.tFigure 5. A typical Time-Delay Neural Network.Fig. 5 shows a typical Time-Delay Neural Networkarchitecture. While the architecture consists of input,hidden and output layers, much as the classical neuralnets, there is a crucial difference. Each hidden unitaccepts input from a restricted spatial range ofpositions in the input layer. Hidden units at “delayed”locations (i.e., shifted to the right) accept inputs fromthe input layer that are similarly shifted. Trainingproceeds as in standard back propagation, but with theadded constraint that corresponding weights are forcedto have the same value -an example of weight sharing.Thus, the weights learned do not depend upon theposition of the pattern (as long as the full pattern lies inthe domain of the input layer).3.1.3. Feature selection. As to visual input features forthe TDNN, we choose a simple measure of changebetween two images It and I t i (the normalized crosscorrelation):( I t ( x , y ) I t )( I t i ( x , y ) I t i )Rt ,t i ( x , y ) W{( x , y ) W1 I t ( x , y ) I t 2(1) I t i ( x , y ) I t i 2 }1 / 2( x , y ) W2In the moving object detection application, it willfail if the whole image is used to compute the crosscorrelation. For solving this problem, our approach isto track the center of the person. All moving objects inthe scene (walking or non-walking) have been firstdetected by the modified Background Subtractionalgorithm [2]. By computing correlations between eachsubtracted object, we obtain the desired visual feature(frame cross-correlation Rt ,t 1 ), which lies on the lineimmediately below the diagonal in Fig. 3(a).The sound spectrogram (also called sonogram)coefficients of the step sounds are used as the audioinput features. The sound spectrogram, like a musicalscore, is a visual representation of the sound. It iscalculated by the short-time Fourier Transform (STFT)on the one-dimensional audio signal (shown in Fig. 6).Its horizontal dimension corresponds to time, and thevertical dimension denotes frequency. The relativeintensity of the sound spectrogram at a particular timeand frequency is indicated by the brightness at thatpoint.3.2. Bayesian Network approachBayesian Network (BN) is an attractive frameworkfor statistical modeling, as it combines an intuitivegraphical representation with efficient algorithms forinference and learning. BN encodes conditionaldependences among a set of random variables in theform of a graph (e.g. Fig. 7). An arc between two nodesdenotes a conditional dependence relationship, which isparameterized by a conditional probability model. Thestructure of the graph encodes domain knowledge, suchas relationship between sensor outputs and hiddenstates, while the parameters of the conditionalprobability models can be learned from data. Anotheradvantage of the BN models is that it can be easilyextended to handle time series data, by means of thedynamic Bayesian Network (DBN) framework.3.2.1. Video component. Video frames are modeledusing a statistical model called Transformed Mixture ofGaussians (TMG) [8] (Fig. 7). This is a simplegenerative model that describes the observed image yin terms of an original image v that has been shifted byl x pixels, and further contaminated by additive noisewith covariance matrix Ψ . To account for thevariability in the original image, v is modeled by amixture model with components s. Component sconsists of a template with mean µ s and covariancematrix φs , and s has a prior probability π s .Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

factor λi on its way to the microphone 1 and 2. Itarrives at the microphone 2 with a time delay τ relativeto that at microphone 1. To account for variability inthe original signal, a is modeled by a mixture modelwith components r. Each component r has zero mean,covariance matrix η r , and the prior probability π r .Then we have:Pr( a r ) N ( a 0, η r ), Pr( r ) π r ,Pr( x1 a ) N ( x1 λ1a , υ1 ),(3)Pr( x2 a , τ ) N ( x2 λ2 Lτ a ,υ 2 ).In which all conditional probabilities are assumedGaussian, and Lτ denotes the temporal shift operatorFigure 7. Graphical model of video signals.And the parent-child relationships of the nodes aredescribed by the conditional probabilities:Pr( v s ) N ( v µ s , φs ), Pr( s ) π s ,(2)Pr( y v , l x ) N ( y Gl v , Ψ )xHere we assume both conditional probabilitiesPr( v s ), Pr( y v , l x ) are Gaussians, in whichN ( v µ s , φs )meansaGaussiandistributionofvariable v with the mean µ s and covariance matrix φs .Glx denotes the horizontal shift operator. And the priorprobability for the shift lx is assumed flat:Pr(l x ) π l (constant). The model parameters, includingthe image template µ s , their covariance matrix φs , andthe noise covariance Ψ , are learned from sequentialdata using the EM algorithm.Figure 8. Graphical model of audio signals received by twomicrophones.3.2.2. Audio component. Similarly, the audio modeldescribes the observed audio signals x1, x2 in terms ofan original signal a, which has been attenuated by aover original signal a.3.2.3. Link between audio and video signals. Thedependence of the time delay τ on the object locationlx in frames is modeled by a noisy linear mapping:Pr(τ l x ) N (τ α l x β , υτ )(4)In our experiment, the mapping involves only thehorizontal position, as the vertical movement of theobject has a significantly smaller effect on the arrivaltime compared to the horizontal motion. It can beshown that the linear approximation is fairly accuratefor the pinhole camera and the large microphonebaseline. To account for deviation from linearity andother inaccuracies in the simplified model, such asreverberation, we allow the mapping to be noisy, with anoise covariance matrix υτ . The audio-visualgenerative model is illustrated in Fig. 9.Figure 9. Graphical model representation of the fullBayesian Network to model both audio and video signalsjointly. The arrows and parameters of the audio-visual linkare highlighted.To handle the sequential audio-visual data, we extendthe static BN (Fig. 9) to the Dynamic BN, which obeysthe property of independent identical distribution,Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

which means the model parameters are shared betweendifferent times.3.2.4. Parameter estimation and Bayes inferenceIn the graphic model described above, the jointdistribution of the observed signals, the unobservedvariables and the component parameters, is given by:Pr( x , x , y , τ , l , r , s , a , v )12x Pr( x1 a ) Pr( x 2 a , τ ) Pr( a r ) Pr( r ) (Audio signal) Pr( y v , l x ) Pr( v s ) Pr( s )* Pr(τ l x ) Pr( l x )(5)(Video signal)(Audio-visual link)which is the product of the distributions defined by theaudio and the video models. The model parametersΘ {λ1 , υ1 , λ2 ,υ 2 , ηr , π r , π s , π l , µ s , φs , Ψ , α , β , υτ } are4.2. Experiment scenariosTo test the performance of our approaches, wedesigned several scenarios for the experiment(description listed in Table 1 and shown in Fig. 11).The basic assumption in our experiments is that there isonly one human (the “object”) walking in the indoor oroutdoor environment, which is monitored by an off-theshelf video camera and several microphones. Theenvironment could be complicated and cluttered, inwhich there may be other humans moving or walking.Our ultimate goal is to detect and track the object thatis moving and producing step sound which is recorded.estimated from the data sequence using theExpectation-Maximization (EM) algorithm [1].After estimating the parameters, the a posterioriprobability Pr( l x x1 , x2 , y ) of the location variable lx isTable 1: Description of test scenariosTest scenarios nameIndoor/HumansTest sequencesoutdoorin scenesize Outdoor1229calculated using the Bayes’ rule:Pr(l x , x1 , x2 , y Θ)Pr( l x x1 , x2 , y ) Pr( x1 , x2 , y Θ)To train the TDNN and BN, training sequences areapproximately three times the size of test sequences. InTDNN, we choose one visual feature ( Rt ,t 1 ) and four(6)The estimate of the object’s location at each frame isthe most likely estimate given the observed data: l x a rg max Pr(l x x1 , x2 , y )(7)lx4. Experimental results4.1. Experiment setup and devicesThe video sequences are recorded using a SONYDCR-VX1000 Digital Handycam camcorder. It hasthree 1/3-inch CCDs with 410,000 pixels for each. Theresolution of frames is 720*480 pixels. The videocapture rate is 30 frames per second. The sound wasrecorded with the built-in microphone of the DCRVX1000 camcorder. A LABTEC VERSE 303microphone is also used as an auxiliary audio recordingdevice when necessary. The sampling rate of bothmicrophones is 32KHz at the resolution of 12-bit. Andthe video camera and microphones are mounted on thefixed platform.The Time-Delay Neural Network has beenimplemented with the Neural Network Toolbox inMATLAB 6.1. And the Bayesian Network in this paperuses the Bayes Net Toolbox for MATLAB [11]. Thetesting platform is a Pentium IV 1.7 GHz PC with 256MB (PC2100) memory. audio features (spectrogram coefficients at 1, 1K, 10Kand 100K Hz) as input features. And as the TDNNinput specification, the audio features are normalized tothe range [0 1]. After obtaining both visual and audiofeatures from the training sequence, we fed them intothe TDNN described earlier to train the weights {W1W2 W84}. Then, the spatio-temporal search with thelearned TDNN is performed in each frame for locatingthe walking human. Fig. 11(a) shows the location of thedetected target (marked with cross and rectangle) in thetest sequence “Lab”.In our Bayesian Network approach, since there isonly one object in the scene, we assume a single visualtemplate s and a single audio component r. The EMalgorithm is used to train the graphical model (Fig. 9)based the observed multi-modal signals. The maximumnumber of iterations is set to be 10, and the iterativeoperations are terminated when a small stoppingcriterion (10-10) is met. If the estimated target location l x is within 30 pixels of the ground truth l x , weconsider it the correct detection of walking human. Theground truth, tracking results and the confidence plot ofthe test sequence “Two-person-test2” are shown in Fig.10.Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

Figure 10. Tracking results of test sequence “Two-person-test2”. Comparison of detected track and ground truth (left) and the confidence(a posteriori probability) plot (right).(a) “Lab” (tracking results are marked in each frame)(b) “Two-person-test1”(c) “Two-person-test2”(d) “Behind-bush-test1”(e) “Behind-bush-test2”Figure 11. Sample frames of all test sequences. The tracked object centers are highlighted as white blobs in eachframe.Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

4.3. Performance comparison of TDNN andBN approachesTable 2: Comparison of detection accuracy rateTest test2”TDNN approach89%91%52%48%39%BN approach93%95%86%83%72%Table 3: Efficiency (training time) comparisonTDNN (in second)BN (in second)Test test2”3272286004535083482765284114725. ConclusionIn this paper, we discuss the use of multi-modalfusion for human motion detection. Specifically, wepresent two approaches for detecting walking humansin the cluttered environment based on video sequencesand step sounds: the Time-Delay Neural Network(TDNN) and Bayesian Network (BN) approaches.The comparison of these two approaches illustratesthe advantages of statistical models (i.e. BayesianNetwork) over the Time-Delay Neural Network: First,it’s necessary to initialize the object in the videosignals (i.e. pre-detection) in TDNN approach. Incontrast, there is only random parameter initializationin the BN approach, and the choosing of the initialparameters doesn’t affect the performance of the BNapproach. Secondly, in the BN approach, there is amicrophone array (Microphone 1 and 2), and thecorrelation between the time delay in audio signals andthe object position in video sequences is modeled witha noisy linear mapping. The Bayesian Network encodesthis property in its structure and uses the TransformedMixture of Gaussians to model both video and audiodata. Thirdly, The explicit and easily accessiblestructure of graphical models is clearly an advantage,while the inner structure and parameters of the TDNNis not directly available to designers. Finally, besidesthe better tracking accuracy of the BN approach in allexperimental scenarios, there is the confidence (aposteriori probability of the estimates) as thequantitative measure of the support to the decision.The work presented in this paper can be extended inseveral ways. The microphone array could beemployed in the modified TDNN approach. And theperiodicity of the step sounds could be detected withthe power spectrum estimation techniques for thedetection of walking humans in audio sequences. Audiosignal separation and processing techniques can beincluded for the multiple object tracking. Furthermore,when heterogeneous sound sources (e.g., vehicles) arealso present, we may include seismic sensors in ourscheme.6. References[1]M. Beal, N. Jojic, H. Attias, “A graphical model foraudiovisual object tracking,” IEEE Trans. PatternAnalysis and Machine Intelligence, Vol. 25, No. 7, pp.828- 836, July 2003.[2] B. Bhanu and X. Zou, “Moving humans detection basedon multi-modal sensory fusion,” Proc. IEEE Workshopon Object Tracking and Classification Beyond theVisible Spectrum (OTCBVS’04), pp. 101- 108, July2004.[3] R. T. Collins, A. J. Lipton, H. Fujiyoshi and T. Kanade,“Algorithms for cooperative multisensor surveillance,”Proceedings of the IEEE, Vol. 89, No. 10, pp. 14561477, October 2001.[4] R. Cutler, L. Davis, “Look who’s talking: Speakerdetection using video and audio correlation,” Proc.IEEE Intl. Conf. Multimedia and Expo. (ICME’00), pp.1589- 1592, 2000.[5] R. O. Duda, P.E. Hart, and David G. Stork, PatternClassification, John Wiley & Sons, New York, 2001.[6] I. A. Essa, “Ubiquitous sensing for smart and awareenvironments: technology towards the building of anaware home,” IEEE Personal Communications, pp. 4749, October 2000.[7] J. W. Fisher III, T. Darrell, “Signal level fusion formultimodal perceptual user interface,” Proc. Workshopon Perceptive User Interfaces (PUI ’01), Nov. 2001.[8] B. J. Frey, N. Jojic, “Transformation-invariant clusteringusing the EM algorithm,” IEEE Tran. Pattern Analysisand Machine Intelligence, Vol. 25, No. 1, pp. 1- 17,January 2003.[9] A. Garg, V. Pavlovic and J. Rehg, “Boosted learning inDynamic Bayesian Networks for multimodal speakerdetection,” Proceedings of the IEEE, Vol. 91, No. 9, pp.1355- 1369, September 2003.[10] F. V. Jensen, An Introduction to Bayesian Networks,Springer-Verlag, New York, 1996.[11]1K. Murphy, Bayes Net Toolbox for MATLAB,www.ai.mit.edu/ murphyk/Software/BNT/bnt.html[12] S. Nadimi, B. Bhanu, “Physics-based models of colorand IR video for sensor fusion,” Proc. IEEE Intl. Conf.Multisensor Fusion and Integration for IntelligentSystems (MFI’03), pp. 161- 166, July 2003.[13] D. G. Stork, G. Wolff, E. Levine, “Neural networklipreading system for improved speech recognition,”Proc. Intl. Conf. Neural Networks, (IJCNN’92), Vol. 2,pp. 289- 295, 1992.[14] J. Vermaak, M. Gangnet, A. Blake, P. Perez,“Sequential Monte Carlo fusion of sound and visionfor speaker tracking,” Proc. IEEE Intl. Conf. ComputerVision (ICCV’01), Vol. 1, pp. 741- 746, July 2001.Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)1063-6919/05 20.00 2005 IEEE

Oct 01, 1997 · Tracking Humans using Multi-modal Fusion Xiaotao Zou, Bir Bhanu Center for Research in Intelligent Systems University of California, Riverside, CA 92521 {xzou,bhanu}@vislab.ee.ucr.edu Abstract Human motion detection plays an important role in automated surveillance systems. Howeve