Multi-sensor Detection And Tracking Of Humans For Safe .

Transcription

Multi-sensor Detection and Tracking of Humans for SafeOperations with Unmanned Ground VehiclesSusan M. Thornton, Member, IEEE, Mike Hoffelder, and Daniel D. Morris, Member, IEEEAbstract— This paper details an approach for the automaticdetection and tracking of humans using multi-sensor modalitiesincluding 3D Ladar and long wave infrared (LWIR) video. Bycombining data from these sensors, we can detect individualsregardless of whether they are erect, crouched, prone, orpartially occluded by other obstacles. Such algorithms areintegral to the development and fielding of future “intelligent”unmanned ground vehicles (UGVs). In order for robots to beintegrated effectively into small combat teams in theoperational environment, the autonomous vehicles mustmaneuver safely among our troops and therefore must becapable of detecting stationary and moving people in clutteredscenes.TI. INTRODUCTIONHE goal of this work is to robustly detect and track bothstationary and moving humans from a movingunmanned ground vehicle (UGV) using data from both3D Ladar and a long wave infrared (LWIR) camera.Detecting humans in any type of sensor data is a challengingproblem due to the wide variety of positions andappearances which humans can assume. Stationary humansfurther complicate the task since motion cues can not berelied on to eliminate false alarms. In order to reduce falsealarm rates, which can be significant in cluttered urbanenvironments, algorithms are usually limited to detectingmoving humans [2]-[3],[14] or to detecting upright humansin more ideal postures [4]-[8],[12]. These limitingassumptions are acceptable in the case of pedestriandetection for the automotive industry where the goal is tohave an alert system that aids a human driver or insurveillance applications in which the camera is in a fixedlocation.In the military context in which we are striving for a fullyautonomous UGV that has 360 degree situational awareness,it is critical that we not only reliably detect upright people,but those that are prone on the ground (e.g. a hurt soldier)and those not moving. Since stationary humans have thepotential to move at any time distinguishing them from otherobjects such as barrels and crates is important for effectiveand robust UGV path planning. To achieve this goal, weManuscript received January 15, 2008. This work was supported throughcollaborative participation in the Robotics Consortium which is sponsoredby the U.S. Army Research Laboratory under the Collaborative TechnologyAlliance Program, Cooperative Agreement DAAD19-01-2-0012S. M. Thornton is with General Dynamics Robotic Systems, Pittsburgh,PA 15221 USA (phone: 412-473-2167; fax: 412-473-2190; e-mail:sthornton@gdrs.com).M. Hoffelder and Daniel D. Morris are also with General l:mhoffelder@gdrs.com, dmorris@gdrs.com ).assert that it is necessary to fuse information from varioussensor modalities. Unlike the fusion approach in [12], wehave chosen Ladar and LWIR sensors which both provideday/night capability and operate under extensiveenvironmental conditions.In this paper, we first review a joint spatial-temporalsolution to the Ladar data association problem [1] in whichLadar returns are considered as samples on an underlyingworld surface. This surface is explicitly modeled and thenlocally matched over time intervals. The result is a naturalcategorization of the world into stationary and movingobjects. With a focus strictly on moving obstacles, thealgorithm in [1] achieves a high rate of performance with theuse of simple size features to distinguish between humansand other types of objects. We present improvements to thealgorithm in [1] that focus on extracting more advancedshape-based features from the Ladar data and enable thedetection of stationary humans.Even with these advances, it is a challenge to differentiatea prone human in the Ladar from the massive amount ofground returns that are received, or even to detect uprighthumans when they are standing against walls or other largestructures. In order to achieve these tasks, we use LWIRvideo to complement the capability of the Ladar. LWIRvideo provides information out to much longer ranges thanthe Ladar, and has the ability to highlight humans in avariety of non-ideal postures due to emissivity differenceswith their surroundings. We present a statistical andmorphological approach to human detection in LWIRimagery. Many algorithms make the simplifying assumptionthat humans will always be hot compared to the surroundingenvironment [5], [14], but this is often not the case, so wediscuss our approach to making the algorithm robust to suchchallenges.In Section II, we summarize our method of objectextraction and data association in Ladar data, and detail newfeatures that have been implemented to improveperformance.In addition, we present metrics andperformance results on a baseline set of Ladar data. SectionIII contains a detailed presentation of our LWIR algorithmthat extracts human regions-of-interest (ROIs). In SectionIV, we briefly discuss our approach for fusing Ladar andLWIR at both the feature and detection levels. Section Vsummarizes our work.

II. LADAR ALGORITHMA. Ladar Sensor DataOur 3D data are the result of a pair of scanning Ladars thathave been configured at fixed pan and tilt positions on thetop of a sport utility vehicle (SUV). This dual configurationallows for nearly 180 degree field-of-view (FOV) in thedirection the vehicle travels. In addition to the dual Ladars,the SUV has been equipped with an aluminum beam thatholds a set of stereo LWIR video cameras, amongst anumber of other sensor pairs, see Fig. 1.Frames of data from the right and left Ladars aresynchronized in time in order to account for either Ladardropping frames or for unexpected differences in the framerates of the two sensors. Each Ladar scans at approximately10 Hz, providing a 2D grid based depth map for each scan.Although there is a small angular overlap of the two Ladars,we have initially chosen to do a simple concatenation of thedepth maps from each sensor.B. Object ExtractionIn [1], an algorithm for detecting moving vehicles andpeople using a single scanning Ladar on a moving vehiclewas presented. Since the data from the dual Ladarconfiguration appear to the algorithm as a single depth map,the approach is applied to the new dual data withoutmodification. There are two key components to theapproach: find objects in the scene and analyze their motion.Object detection is achieved through a two-step process ofeliminating ground returns and then implementing acontiguous region building technique that leverages theadjacency information in the angle-depth map created by theLadars. This simple clustering approach has very lowcomputational requirements, and works well for both largeand small objects.The ability to effectively isolate human objects depends onthe performance of the ground removal algorithm. We foundthat modeling the ground with roughly horizontal planes [1]worked well at close range, but was unreliable at points farfrom the sensor and added undesirable computationalcomplexity. The ground is now labeled by computing theelevation angle between neighboring points in the angledepth map. Using the ground map in conjunction with theangle-depth map, a height-above-ground for each Ladarreturn is estimated. We use this height value to eliminate allFig. 1: Sensor configuration on a modified sport utilityvehicle. The vehicle has 2 Ladars and a stereo pair of LWIRcameras.points less than 0.25 meters above ground. This approachhas the advantage of removing clutter due to tall grass andvegetation, as well as curbs in an urban setting, all of whichpose difficulties to our algorithm if not eliminated.C. Data Association and TrackingWe use a surface probability density model for 3Dobject registration [1]. The registration approach relies onexplicitly modeling the object surface as a mixture of 3DGaussians, ρ S f X , centered at each sampled point:( )ρ S ( f X ) N (xi , σ i2 ) n .(1)iThe covariances σ i2 are proportional to the samplingdensity, and hence to the distance from the Ladar. Thismodels a wide variety of surfaces including coarselysampled natural objects such as trees. Models are registeredand scored by optimizing the Bhattacharya similaritymeasure which compares two density functions and gives anabsolute similarity estimate enabling the goodness of amatch to be assessed. The similarity measure is also usefulfor resolving matching ambiguities and detecting occlusions.A discrete implementation using convolution filteringenables real-time registration without being trapped by localminima.Most of the work in moving object detection is achievedby clustering and registration. However, there are a numberof sources of clutter as well as objects appearing anddisappearing due to occlusions. These effects can lead tospurious motion estimates and hence false positives. Use ofa Kalman filter tracker minimizes these effects, by enforcingmotion consistency.D. Classification FeaturesIn order to detect stationary, as well as moving humans,we can not rely on simple size constraints for classification.The Ladar data provide significant shape information aboutan object, particularly at close range. We have implementeda new feature that quantifies this shape information andeliminates false alarms due to random clutter in the data.Random clutter refers to natural things in the environmentsuch as thin vegetation, tall grass, or small branches whichpose little danger to the UGV if in its path.1) Shape-based feature extractionOnce points have been clustered into objects using ourdepth map region growing technique, the original 3D pointsfrom each human-sized object are projected onto a 2D plane,see Fig. 2. The projected cluster points are then used tocreate a 32 x 16 binary template which is aligned with themajor axis of the cluster. As a measure of how uniformlydistributed the returns are across the 2D grid, we compute afeature that we refer to as the fill-factor,ff 1 (# empty bins ) .(total # bins )Empirical analysis indicates that 2D binary templates oftrue humans, as well as other man-made objects, will be(2)

Fig. 2: (left) 3D Ladar points from one frame for a walkinghuman; (center) Projection onto a 2D plane; (right) Binary32x16 map. (ff 0.2246)roughly uniformly distributed, while a template resultingfrom random clutter which happens to be human size will beoddly shaped with points concentrated in small sections ofthe binary map. Random clutter clusters often meet humansize constraints when several small clusters are erroneouslygrouped together (e.g. small clumps of grass and vegetationin close proximity to one another). The fill-factor value forthese non-uniformly distributed clusters is much smaller.Since the number of Ladar points associated with a clusterdecreases dramatically with distance, the fill-factor value forhumans at long range also decreases. Fig. 3 illustrates thereturns for a human at x meters from the sensor compared tothose at 2x meters. As a result of the reduced number ofreturns at the longer distance, the fill-factor value is likely todrop below the specified threshold. In order to compensatefor the reduced number of returns, the fill-factor value forobjects at longer distances can be calculated byaccumulating returns over multiple frames. Allowing pointsto accumulate over even a short period of time can greatlyimprove the shape detail in the projected cluster, andincrease the fill-factor value appropriately, see Fig. 4. Inmany instances, false alarms due to random clutter do notpersist for more than a frame, therefore their 2D density doesnot accumulate and their fill-factor value remains low.Although the fill-factor feature helps to eliminate falsealarms due to random clutter, it is not able to distinguishhumans from other human-sized objects such as barrels andposts. To make this distinction, we are developing a moreadvanced shape-based technique that involves a 2Dcomparison of the Ladar returns with a pre-determined idealhuman template.2) Strength-of-DetectionWe have developed and tested an effective strength-ofdetection (SoD) value to associate with each cluster in eachframe of the Ladar data. Initially, we used the fill-factorvalue as this measure, but this, by itself, does not reflect theincreased confidence in the classification that results fromrepeatedly detecting and labeling a cluster as human overFig. 3: Projection onto the yz-plane of the Ladar returnsfrom a human at (left) x meters compared to (right) 2xmeters.Fig. 4: Improvement in shape information by accumulatingLadar points for clusters at longer range: (left) 1 frame;(center) 2 frames; (right) 5 frames.several frames of data. Consequently, we implemented animproved SoD measure that takes into account, not only, thefill-factor feature, but whether the object is of the proper sizeto be human, whether it is moving at a realistic human speed(or is stationary), and whether the cluster persists over time.Confidence is defined in terms of four components: size( c s ), shape ( c f ), speed ( cv ), and life ( cl ),C c s c f cv cl ,(3)where each component ranges in value from 0 to 1.The size component, c s , indicates how closely the height(h), width (w), and depth (d) of each cluster match nominalhuman dimensions (wmax, dmax, hmin, hmax):cs 1(t w t d t h )3(4)where tw, and td, are defined as 1.0,ti i max / i ,if i i maxotherwise,i w or d(5)and th isif hmin h hmax 1 .0 if h hmaxt h h max / h. h / hif h hminmin (6)We use a sigmoid to define shape in terms of a weightedversion of the fill-factor,cf 1.0,1.0 exp( ω f ff*l h )(7)where ff is the fill-factor computed from the 2D binneddensity, lh is the number of frames the cluster has beenhuman-sized, and ωf is an experimentally determinedconstant. Scaling ff by lh increases cf for clusters that areconsistently human-sized.Since humans are limited in the speed at which they cantravel, speed is another feature contributing to ourconfidence measure. A cluster moving faster than apractical human pace is more likely to be a part of a vehiclethan a human, so, as speed increases over the maximumallowable (vmax), confidence decreases,

if v v max 1.0,cv ./,otherwisevv max(8)The final component of C captures the confidenceassociated with tracking an object and its features over time.The longer a cluster is tracked and is human-sized the moreconfident we are that we have correctly identified a human.This persistence is characterized bycl 1.0,(9)(l 1)1.0 exp( ω l l h h)lwhere l is the life of the cluster which is the total number offrames that the cluster has been tracked, lh is the number offrames the cluster has been human-sized, and ωl is anexperimentally determined constant.The SoD characterizes the algorithm’s confidence that acluster is human (higher values equal more confidence) andprovides a means for doing a thorough receiver operatingcharacteristic (ROC) cuve analysis of algorithmperformance.E. ROC AnalysisWe use a baseline set of Ladar data for algorithm analysisthat consists of 18 scenarios, each of which is 60 – 90seconds in duration. In these scenarios, eight humans movealong straight line tracks at various orientations to the sensorvehicle. Each particular human always traverses the samebasic track. GPS ground truth for each human track, as wellas for the sensor vehicle, was recorded at approximately 0.1second intervals. In addtion to the moving humans, fourmannequins were used to represent stationary humans,resulting in a total of twelve possible human targets. Factorsthat are varied include, the speed of the humans (1.5 or 3.0meters per second) and the speed of the sensor vehicle (15 or30 kilometers per hour). As an added challenge, most of thescenarios contain other moving vehicles which occlude thehumans from the sensor at certain times.We define two metrics to characterize the overallperformance of our algorithm: the probability of detecting atrue track and the number of false tracks generated persecond. A track is defined as any set of M points that arelabeled with the same ID by the algorithm, whetherstationary or moving. As a result, the maximum number oftracks that can be detected for any given scenario is twelve.We typically set the number of points, M, equal to one.Using the SoD defined in (3) thru (9), a threshold, CT, isdefined such that each object with a SoD greater than CT isclassified as human. ROC curves, which plot the probabilityof detecting true tracks versus the number of false tracksdetected per second, are generated for different scenarioconditions by varying CT. Fig. 5 shows the overallperformance of our algorithm on the baseline data. SeparateFig. 5: Average performance results for a set of 18 Ladarfiles.curves for the moving and stationary targets reveal the highlevel of performance that the algorithm achieves in bothcases. On average, the algorithm achieves 99% detectionwith less than one false alaram per second.III. LWIR ALGORITHMA. Cluster Extraction and ClassificationOur LWIR algorithm generates human regions-of-interest(ROIs) using a two-stage process that first extracts clustersand then classifies the clusters based on a small number ofgeometrical features. We use a combination of statistical andedge features along with morphological techniques forcluster extraction. Global and local normalized intensitydeviation images are computed. Normalized intensitydeviation is defined as nij ( xij m) σ , where xij is pixelintensity and m and σ are the mean and standard deviationwhich have been computed either globally or within a smallwindow around the pixel. For local processing, an integralimage implementation has been used so that varying thewindow size does not impact the computational load of thealgorithm.Edge information is obtained using the gradient. Usingempirically determined thresholds, binary images are createdfrom the global and local deviation images, as well as theedge map. Future work is aimed at using statisticalprocessing to automate the choice of the thresholds in eachcase. Morphological dilation and cleaning are used prior tonearest neighbor clustering.The second stage of the algorithm computes simplegeometrical features for each cluster which are used to retainonly those clusters that are human-like in nature. The twofeatures are an axis ratio and an edge ratio. The first is theratio of the major axis of the cluster to the minor axis of thecluster and the second is the ratio of the number of perimeterpixels in the cluster to the number of edge pixels. We definetwo thresholds, ET and AT, such that all clusters with an axisratio less than AT and an edge ratio greater than ET areclassified as human. The features and thresholds have beenchosen so as not to eliminate the detection of prone andother non-ideally postured humans.

Fig. 7: Variation in thermal resolution with range. At closerange, better resolution results in over segmentation,whereas the lower resolution at long range leads to ahomogeneous signature across the body.Fig. 6: The use of LWIR data enables the detection ofhumans in a variety of positions (upright, squatting, visiblyoccluded).B. Processing ResultsTo establish a baseline performance for the algorithm, wefirst focused on analyzing data sets that are ideal in the sensethat the humans tend to be hotter than their surroundings andthere is little thermal clutter in the scenes. Fig. 6 shows theresults of the algorithm on a series of frames from one of thebaseline data sets. Under these ideal conditions, thealgorithm provides a high detection rate in conjunction withfew false alarms. The algorithm detects upright humans andhumans in non-ideal positions (squatting, crouched), as wellas humans visibly occluded by vegetation and bushes. In thelatter case, the person would not be detectable in EOimagery.A challenge to our LWIR algorithm performance is thevariation in resolution of the data with range. It can actuallybe more challenging to detect a person at close range than atfar range. The lower resolution, at far range, leads to ahomogeneous thermal signature for the entire subject,whereas, at close range, the better resolution reveals morevariation in a person’s thermal signature as a result of thingssuch as clothing and hair, see Fig. 7. At close range, a personoften gets separated into smaller features such as the face,arms, hands and legs. In the absence of additionalinformation, it can be challenging to put these pieces backtogether. However, by incorporating knowledge about therange of the object from the sensor, merging of object piecescan be achieved.C. Thermal Thresholding AnalysisLike many IR algorithms, our initial algorithm relied onthe human having an emissivity greater than that of itssurrounding which may be an acceptable assumption atnight, and in the early morning, or in cooler climates.However, in general this is not the case, and the algorithmneeds to be made robust to various environmentalconditions. We are evaluating the ability to thermallythreshold the data by leveraging the fact that the sensor isthermally calibrated. However, preliminary analysis hasshown that humans generate a wide range of thermalemissions across their body due to variation in clothingmaterial as well as thermal reflections from the surroundingenvironment. Fig. 8 highlights that a simple windowingthreshold based on the minimum and maximum intensity ofthe human eliminates only a small portion of the scene. Inthis case, all pixels with intensities higher than the maximumhuman intensity are shown in white and those withintensities less than the minimum human intensity are shownin black. Only the road and sky are eliminated from thescene. In addition, one can see that the legs of the humanappear much hotter than the rest of the body due to thermalreflections from the road (road is approx. 120 degrees). Weare in the process of developing a more advanced techniquewhich uses statistics of the thermal variation to dynamicallythreshold the scene and reduce the amount of clutter thatremains.IV.APPROACH TO FUSIONAnalysis of our Ladar and LWIR algorithms has revealedthe strengths and weaknesses of each approach. Ladarsensors provide a forum for fast and reliable detection andtracking of objects that are well separated in angle anddepth, but the data are limited in range and not appropriatefor more difficult scenarios (e.g. prone humans). In contrast,LWIR imagery is sensitive to much longer ranges and hasthe ability to discriminate objects in close proximity as aconsequence of emissivity variations. However, due tobetter resolution at close range, this same emissivity featureleads to over-segmentation of many objects. Robust andreliable human detection and tracking can be achieved bymerging these complementary modalities. We havedeveloped a multi-level approach to fusion that combinesinformation from the LadarFig. 8: Thermal emission variation across the human body.Regions hotter than the maximum human intensity areshown in white, cooler in black, and regions which fallbetween the minimum and maximum are in grayscale withdark gray being the hottest.

approach by enabling the detection of individuals at muchlonger range and in more difficult positions. Future effortswill be focused on finalizing the fusion of the two methods.ACKNOWLEDGMENTSusan M. Thornton thanks the U.S. AMRMC OHRP andDr. Sue Hill at ARL for their guidance in ascertainingapproval to include human subjects in data collections. Inaddition, the authors thank NIST for the use of their groundtruth equipment and the expertise of the personnel runningit. As a result, we have a baseline set of fully ground truthedLadar, LWIR, and visible wavelength video for algorithmassessment.REFERENCES[1][2][3][4]Fig. 9: Approach to fusion includes merging information atboth the feature and detection levels.[5]and LWIR sensors at both the feature and detection levels.Fig. 9 shows an overview of our approach. We are currentlyaddressing the first stage of feature-level fusion in which weincorporate range information with the LWIR imagery.The most common way of associating depth informationwith each cluster is through stereo techniques [10], [11].However, dense stereo methods tend to be computationallyintense and may require special processing boards in order toachieve the real-time rates that are required for this task. Asa result, we take advantage of registered Ladar and LWIRcameras to use depth information from the Ladar to reducethe computational load on the stereo by eliminating all closerange regions in the LWIR imagery from the stereoprocessing. Stereo techniques provide the long rangemapping.Any approach to fusion is dependent on the involvedsensors being accurately registered, both intrinsically andextrinsically. Under the sponsorship of the RoboticsConsortium, a method for achieving the desired registrationhas been developed [13]. Using the results of this newlydeveloped registration process, we will be moving forwardwith the fusion approach which we have designed.V. CONCLUSIONWe have developed an algorithm for 3D Ladar data thatconsists of object detection, data association, classification,and tracking. In depth analysis of the algorithm using a setof ground truthed data, shows that the algorithm providesgood performance on humans in more ideal positions thatare spatially distinct from other objects. A second algorithmdeveloped for LWIR imagery, complements our Ladar[6][7][8][9][10][11][12][13][14]D.D Morris, B.Colonna, and P. Haley, “Ladar-Based Mover Detectionfrom Moving Vehicles,” in Proceedings of the 25th Army ScienceConference, Nov 27-30, 2006.L. Navarro-Serment, C. Mertz, and M. Hebert, “Predictive MoverDetection and Tracking in Cluttered Environments,” in Proceedings ofthe 25th Army Science Conference, Nov 27-30, 2006.L. Brown, “View Independent Vehicle/Person Classification,” in ACMInternational Workshop on Video Surveillance and Sensor Networks2004, October 15, 2004, New York, NY.N. Dalai and B. Triggs, “Histograms of Oriented Gradients for HumanDetection,” in Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, Vol. 1, pp. 886-893,June 20-25, 2005.F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi,“Pedestrian Detection using Infrared images and Histograms ofOriented Gradients,” in Intelligent Vehicles Symposium 2006, June 1315, 2006, Tokyo, Japan.M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio, “ATrainable System for People Detection,” in International Journal ofComputer Vision, Vol. 38, No. 1, pp. 15-33, June 2000.B. Leibe, E. Seemann, and B. Schiele, “ Pedestrian Detection inCrowded Scenes,” in Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, Vol. 1, pp.878-885, June 20-25, 2005.D. M. Gavrila and S. Munder, “Multi-cue Pedestrian Detection andTracking from a Moving Vehicle,” in International Journal ofComputer Vision, Vol. 73, No. 1, pp. 41-59, June 2007.J. W. Davis and M. A. Keck, “A Two-Stage Approach to PersonDetection in Thermal Imagery,” in Workshop on Applications ofComputer Vision, Breckenridge, Co, January 5-7, 2005.M. Bertozzi, E. Binelli, A. Broggi, and M. Del Rose, “Stereo Visionbased approaches for Pedestrian Detection,” in IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, Vol.3, June 20-26, 2005.A. Talukder and L. Matthies, “Real-time Detection of Moving Objectsfrom Moving Vehicles using Dense Stereo and Optical Flow,” inProceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems, Vol. 4, pp. 3718-3725, Sept. 28 – Oct. 2, 2004.G. Monteiro, C. Premebida, P. Peixoto, and U. Nunes, “Tracking andClassification of Dynamic Obstacles Using Laser Range Finder andVision,” in IEEE IROS Workshop on Safe Navigation in OpenEnvironments, October 10, 2006, Beijing, China.G. W. Sherwin, P. Haley, and M. Hoffelder, “Multi-SensorCalibration and Registration in Support of Sensor Fusion for HumanDetection,” AUVSI, June 2008, submitted for publication.J. Zeng, A. Sayedelah, M. Chouikha, T. Gilmore and P. Frazier,“Infrared Detection of Humans in Stretching Poses Using Heat Flow,”submitted to ICASSP 2008.

detection and tracking of humans using multi-sensor modalities including 3D Ladar and long wave infrared (LWIR) video. By combining data from these sensors, we can detect individuals regardless of whether they are erect, crouched, prone, or p