A Survey On Object Detection, Classification And Tracking Methods - IJERT

Transcription

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 11, November-2014A Survey on Object Detection, Classification andTracking MethodsKirubaraj RaglandP. TharcisPost Graduate Scholar, Department of ECE,Christian College of Engineering and Technology,Oddanchatram, India.Assistant Professor, Department of ECE,Christian College of Engineering and Technology,Oddanchatram, India.Keywords— object detection, object classification, objecttracking, video processing.I.INTRODUCTIONTracking objects in a video sequence is an area of constantdevelopment which has a wide area of application fromsurveillance monitoring systems to wildlife monitoring andtracking without the aid of human intervention. This paperexplains what the different steps involved in tracking objects,be it humans or wildlife or cars, in a video and what thedifferent methods that are available to perform these steps.Then, different algorithms are studied and compared. We takenine algorithms for our study and analyze the differentmethods that are used there. The remaining of this paper isarranged as follows: Section II covers an extensive review ofliterature. Section III gives the basic steps involved in trackingan object. Section IV, V and VI elaborate on the methods forobject detection, classification and tracking respectively.Finally we conclude in section VII.II.REVIEW OF LITERATUREA number of algorithms have been developed over timeand the different methods that are used in each of thesealgorithms is shown as an outline below.Dong Kwon Park et al.(2000) present a semi-automaticobject tracking algorithm in [1]. The two steps involved areintra-frame object extraction and inter-frame object tracking.The human intervention is decreased by using homogeneousregion segmentation in intra-frame object extraction while theIJERTV3IS110458processing time of the inter-frame object tracking is reducedby the use of 1-D projected motion estimation. An improvedflooding method for the conventional water shed algorithm isalso proposed.Yining Deng and B.S. Manjunath (2001) propose amethod for unsupervised segmentation in both images andvideo in [2]. The algorithm, called JSEG, works based oncolour-texture regions in image as well as video. The twosteps in the proposed algorithm are colour quantization andspatial segmentation. In the first step, colours in the image arequantized to several representative classes which are used todifferentiate regions in the image. Then, the pixels arereplaced by their corresponding colour class labels. Thisleaves us with a class map of the image. Using the proposedcriterion for “good” segmentation to the class-map byapplying it to local windows, we obtain the „J-image‟ wherethe high values correspond to possible boundaries of colourtexture regions and the low values to the interiors. Then, aregion growing method is employed to segment the imagebased on multi-scale J-images. In case of video, an additionalregion tracking scheme is used along with the abovementioned process to get consistent results even with nonrigid object motion. The limitation of this method is thatwhen a smooth transition of colour (for e.g., from red toorange) occurs, the algorithm oversegments each of thecolours. However, if we overcome this problem by checkingfor smooth transitions, we face a problem when a smoothtransition may not indicate one homogeneous region. In caseof video on error generated in one frame is carried over to thesubsequent frames.IJERTAbstract— Video tracking is one of the fields of recentdevelopment in the field of computer vision. A lot of researchhas been going on in this area and new algorithms are beingproposed to detect and track objects in video. This field of studyhas experienced a sudden growth especially after robotics hasstarted gaining importance. The objective of this paper is topresent the different steps involved in tracking objects in a videosequence, namely object detection, object classification andobject tracking. We survey the different methods available fordetecting, classifying and tracking objects in a detailed manner.The pros and cons of each of the methods are discussed. Objectdetection methods are frame differencing, optical flow andbackground subtraction. Then, objects may be classified basedon shape, motion, colour and texture. Tracking methods involvepoint based tracking, kernel based tracking and silhouette basedtracking.Yaakov Tsaig and Amir Averbuch (2002) present analgorithm for automatic segmentation of moving objects inMPEG-4 videos in [3]. MPEG-4 relies on decomposition ofeach frame of an image sequence into video object planes(VOPs). Each VOP corresponds to a single moving object inthe scene. The basic process is to classify regions asforeground or background based motion information. Thesegmentation problem is formulated as detection of movingobjects over a static background. Motion of camera iscompensated by eight parameter perspective motion model.Initially spatial partition is obtained by means of watershedalgorithm. Then, Canny‟s gradient is used to estimate thespatial gradient in the colour space. Then, the optimizedrainfall watershed algorithm is applied. Based on the initialpartitioning, regions are classified as either foreground orbackground. The motion of each of the foreground regions iswww.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)622

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 11, November-2014estimated by region matching in hierarchical frame work. Theestimated motion vectors are incorporated into a Markovrandom field model, which is optimized using highestconfidence first (HCF), leading to classification of theregions initially. MRF includes information from previousframe. The final step includes a dynamic memory to ensuretemporal coherency of the segmentation process.R. Venkatesh Babu et al. (2004) present an approach forautomatically estimating the number of objects and extractingindependently moving video objects from MPEG-4 videosusing motion vectors in [4]. Since motion vectors are sparse(i.e. one motion vector per macro-block)in compressedMPEG videos, a method to enrich the motion informationfrom a few frames on either side of the current frame isproposed. Interpolation is performed using median filter sothat a motion vector is assigned to each pixel in the frame.Then, segmentation is done. Since sufficient data is notavailable to estimate motion parameters, ExpectationMaximization (EM) algorithm is used. An algorithm forestimating the number of motion models is proposed. Onceinitially segmented, Video Object Planes (VOPs) aregenerated by tracking. Finally, the VOs are subject to edgerefinement phase, where the pixels at the edges are assignedto the correct VO.Rana Farah et al. (2013) propose a robust tracking methodto extract a rodent from a frame under uncontrolled normallaboratory conditions in [8]. It works in two steps: First, threeweak features are combined to roughly track the target. Then,the boundaries of the tracker are adjusted to extract therodent. The newly introduced techniques include OverlappedHistograms of Intensity (OHI) and a new segmentationmethod which uses an online edge background subtractionand edglet-based constructed pulses. Edglets arediscontinuous pieces of edges. A sliding window technique isused to coarsely localise the target.Shao-Yi Chien et al. (2013) has two major contributions in[9]: First, a threshold decision algorithm for video objectsegmentation with multi-background model is proposed.Then, a video object tracking framework based on particlefilter with likelihood function is composed of diffusiondistance measuring colour histogram similarity and motionclue from video object segmentation. This framework canhandle drastic changes in illumination, background clutter andit can also track non-rigid moving objects. The thresholddecision algorithm determines an appropriate optimalthreshold value for segmentation. Colour based histogram isincluded for better tracking of non-rigid objects. A 1-D colourhistogram is used instead of a 3-D colour histogram so thatcomputational complexity can be reduced.IJERTThe aim of Vasileios Mezaris et al. (2004) is to segment avideo sequence to objects in [5]. There are three stages in thisalgorithm: Initial segmentation of the first frame usingcolour, motion and position information, a temporal trackingand finally a region-merging procedure. The segmentationstep uses K-means-with-connectivity-constraint algorithm.Tracking is done by means of Bayes classifier. A rule-basedprocessing is done to reassign changed pixels to existingregions and also how new regions introduced in the sequenceare handled. Region merging is done on a trajectory-basedapproach rather than a motion at a frame level. Oneadvantage is, it can track new objects appearing on the sceneor fast moving objects effectively.max/min to facilitate the optimization of the releasedobjective loss function.Weiming Hu et al. (2012) propose an incremental LogEuclidean Riemannian subspace learning algorithm in [6].The co-variance matrices of image features are first mappedinto a vector space using log-Euclidean Riemannian metric.Both global and local spatial layout information are capturedby a log-Euclidean block-division appearance model.Bayesian state inference based on particle filtering are usedfor single object as well as multi-object tracking withocclusion reasoning. Changes in object appearance arecaptured by incrementally updating the log-Euclidean blockdivision appearance model.In [7], Chang Huang et al. (2013) proposes a method inwhich the input is given as a frame-by-frame target detectionresults. The first set of target tracklets (tracking fragments) isgenerated by conservative dual-threshold strategy. So, onlyreliable detection responses are linked. Doubtful associationsare postponed until more evidence is collected. Multiplepasses are used by hierarchical association to achieveMaximum A Posteriori (MAP) problem, which hypotheses atrajectory of being a false alarm besides initializing, trackingand terminating them. Hungarian algorithm is used to solvethis issue. Next ranking these associations is seen as a bagranking problem, which is overcome by a bag-rankingboosting algorithm. Finally, this paper introduces a softIJERTV3IS110458III.BASIC STEPS IN OBJECT TRACKINGThe first part of this paper simply expounds on thedifferent steps involved in tracking an object or multipleobjects in a video sequence. The tracking process is precededby two steps which play a vital role in improving the accuracyof the tracking, namely, object detection and objectclassification. Though the difference between the three is verysubtle and may be missed at first glance, it is very crucial tounderstand it because all three are different and each one is anarea of study by itself. The simple flow diagram is as shown inFig.1. The first step in the process is to detect what objects arepresent in the video frame. Then, to classify these objectsdepending on what we want to track. Finally the actualtracking takes place. The three steps are defined below and thedifferent techniques used are also listed under each type.Fig. 1 Flow Diagram of Basic steps in Object TrackingA. Object DetectionObject detection is a computer technology that deals withdetecting instances of semantic objects of a certain class (suchas humans, buildings, or cars) in digital images and videos[18]. Object detection may be done by using some basictechniques like frame differencing, optical flow andbackground subtraction.www.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)623

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 11, November-2014B. Object ClassificationObject classification is the process by which the objectsthat are detected in the frame are classified as what object ofinterest it is. It is essentially just identifying what object it is.Object identification may be done based on differentparameters like shape, motion, colour and texture. So, basedon the parameter used, we can perform shape-basedclassification, motion-based classification, colour-basedclassification or texture based classification.C. Object TrackingObject tracking is a method of following an object throughsuccessive image frames to determine how it is movingrelative to other objects. This is most commonly done bymeasuring the position of the centroid of the object in (x, y) insuccessive frames [19]. Object tracking may be classified aspoint-based tracking, kernel-based tracking or silhouette basedtracking.IV.METHODS OF OBJECT DETECTIONC. Background SubtractionA video sequence may be separated into background andforeground. Foreground usually consists of the objects ofinterest whereas the background data is not important fortracking. If we remove the background data from the videoframe, then we are left with just the necessary data in theforeground, which contains the object of interest.We could get a better accuracy if we already know whatthe background is. For example, in stationary surveillancecameras as in road traffic monitoring, the background isalways constant. The road remains in the same position withrespect to the camera. This gives us the advantage of havingthe background already “modelled” for us. If this is not thecase, then background modelling has to be performed beforebackground subtraction. The objective background modellingis to generate a reference model. The video sequence iscompared with the reference model and the object is detectedby computing the variation between the two. Fig.3. showshow a background is modelled.There are two types ofalgorithms for background subtraction. They are recursivealgorithm and non-recursive algorithm.IJERTObject detection is the first step in tracking an object in thevideo. What actually happens in detection is that “objects” areactually groups of pixels clustered together. The basic idea isto identify these pixel clusters as objects which are not onlymoving in x and y directions but also in time. Let us examineeach of the methods of object detection in detail in thefollowing sections.sequence [20]. Optic flow is a vast area of study and [21]provides a summary of the different methods available forestimation of optical flow. Though this method can get betteraccuracy, it is computationally very costly and its ability todeal with noise is limited.A. Frame DifferencingFrame differencing is basically used to detect movingobjects on a static background. Because objects are movingwith difference to time, the position of the object on one frameis different from the position of the object on the consecutiveframe. By finding the difference between the two frames, wecan get the exact position of the object on the frame. Thecomputational complexity is very low, but complete outline ofthe moving object is difficult to obtain, which also leads tolower accuracy.1) Non - recursive algorithm: A non-recursive techniqueuses a sliding-window approach for background estimation.Select number of previous video frames are stored in a bufferand the background image is estimated based on the temporalvariation of each of the pixels within the buffer. Since onlyselect numbers of frames are stored in the buffer, errorscaused by frames outside the buffer limit are not taken intoconsideration. Storage requirements for non-recursivetechniques may be very large because of large bufferrequirements. This problem is overcome by storing the videoat lower frame rates. Some commonly used non-recursivetechniques include frame differencing, median filtering, linearpredictive filtering and non-parametric modelling [10].Fig. 3. Background modelling (a): Input video frame, (b): Backgroundmodel, (c): Object detected after background subtraction.Fig. 2. Methods of Object DetectionB. Optical FlowOptical flow or optic flow is the pattern of apparentmotion of objects, surfaces, and edges in a visual scenecaused by the relative motion between an observer (an eye ora camera) and the scene [18].Optical flow calculates avelocity for points within the images, and provides anestimation of where points could be in the next imageIJERTV3IS1104582) Recursive algorithm: No buffer is used in the case of arecursive technique. A single background model is updatedbased on each input frame. This means that even frames fromthe distant past could cause an error in the current model.This also reduces the storage space, as no memory would benecessary to buffer the data. An error caused can linger for along time. Some recursive techniques include approximatedmedian filtering, Kalman filtering and Mixture of Gaussians(MoG).www.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)624

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 11, November-2014V.METHODS OF OBJECT CLASSIFICATIONAfter the objects have been detected in a video sequence,the next step would be to identify these objects and classifythem according to our requirement. Classification is donebased on the parameter we select. Depending on whatparameter we select for classification, the methods aredefined as follows: shape-based classification, motion-basedclassification, colour-based classification and texture-basedclassification. Each of the methods is explained below.In all the above mentioned spaces, features to define anobject are not efficient. Therefore in the recent years, colourdescriptors have been classified into histogram based colourdescriptors and SIFT (Scale Invariant Feature Transform)based colour descriptors [12].D. Texture-based classificationTexture is an innate property of virtually all surfaces, thegrain of wood, the weave of fabric, the pattern of crop infields, etc. It contains important information about thestructural arrangement of surfaces and their relationship to thesurrounding environment. Since the textural properties ofimages appear to carry useful information, for discriminatingpurpose features have always been calculated for textures[17].Although it is quite easy for a human observer torecognize and describe in empirical terms, texture has beenextremely adverse to precise definition and analysis bycomputer. Texture is represented by means of texturedescriptors. They observe region homogeneity and histogramsof region borders. Different texture descriptors includehomogeneous texture descriptor (HTD), texture browsingdescriptor (TBD) and edge histogram descriptor (EHD) [18].IJERTA. Shape based classificationShape based classification is done based on shape analysis.Shape analysis is the automatic analysis of geometric shapesby a computer to detect similarly shaped objects by comparingagainst entries on a database. Mostly boundary basedrepresentation is used. However, volume based representationor point based representation of shapes is also possible. Thesimplified representation is called shape descriptor. Acomplete shape descriptor consists of all the informationrequired to reconstruct the shape. Shape descriptors may beinvariant with respect to congruency, isometry (intrinsic shapedescriptors).Graph based descriptors are another class [18].Shapes may also be classified based on part structure. Butcapturing part structure is not a trivial task considering thenon-linearity of shapes. Part structure capturing can basicallybe classified into three categories. The first one builds partmodels from sample images. This requires some priorknowledge of the part. The next two categories capture partstructures from only one image. In the second category,individual parts are compared with each other and a similaritymeasure is obtained. In the third category, the part structure iscaptured considering the interior of shape boundaries [16].reflectance property. Colour information is usuallyrepresented in the most commonly used RGB colour space.The problem with RGB is that it is not a uniform colourspaceTherefore, we have to consider the use of other colourspaces like L*a*b and L*u*v which are perceptually uniform.HSV (Hue, Saturation, and Value) is a relatively uniformcolour space.B. Motion based classificationMotion based classification works on periodicity of themotion. A system can be made to learn how the object movesand then classify it better. Motion based classification has tobe addressed for both rigid and non-rigid objects. Though it iseasier to track objects which are rigid and show periodicity inmotion, a limited amount of periodicity has been known toexist in non-rigid objects as well. Optical flow is also used formotion-based classification [10].Fig.4. Methods of Object ClassificationC. Colour-based classificationColour-based approach is based on studying the colourfeatures in an image. The two main colour features that areused to classify based on colour are spectral powerdistribution of the illuminant and the object‟s surfaceIJERTV3IS110458VI.METHODS OF OBJECT TRACKINGHaving detected the objects and classified them, the nextstep would be the actual tracking process. According to [21],tracking can be defined as the problem of estimating thetrajectory of an object in the image plane as it moves around ascene. There are three methods of object tracking which arediscussed in detail below. They include point tracking, kerneltracking and silhouette tracking [18].A. Point TrackingIn the point tracking approach objects are represented aspoints and are generally tracked across frames by evolvingtheir state (object position and motion). Point tracking may beKalman filtering, particle filtering or Multiple HypothesisTracking (MHT).1) Kalman Filtering: Kalman filtering is an algorithmthat uses a series of measurements observed over time,containing noise (random variations) and other inaccuracies,and produces estimates of unknown variables that tend to bemore precise than those based on a single measurement alone.More formally, the Kalman filter operates recursively onstreams of noisy input data to produce a statistically optimalestimate of the underlying system state [14].There are two steps in the algorithm. The prediction stepproduces estimates of the current state variables along withtheir uncertainties. Then, the outcome of the nextmeasurementwww.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)625

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 11, November-2014Fig. 5. Methods of Object Tracking4) matrix must also be defined. Each element of thematrix, cij, represents the probability that measurement i hasbeen originated due to j. MTH is capable of tracking multipletargets, handling occlusions and calculating optimal solutions[15].B. ernel Based TrackingIn kernel based approach, an object is tracked based oncomputing the motion of the rectangular or elliptical kernel inconsecutive frames. Motion may include translation, rotationand affine transformations.IJERTis observed and the estimates are updated using weightedaverage, with more weight being given to estimates withhigher certainty. Since it is a recursive algorithm, only thepresent value, previous value and uncertainty matrix areenough to calculate in real time.2) Particle Filtering: Particle filters are a set of on-lineposterior density estimation algorithms that estimate theposterior density of the state-space by directly implementingthe Bayesian recursion equations. Posterior density isrepresented by a set of particles. No assumptions about thedynamics of the state-space or the density function are made,but they provide a well-established method for generatingsamples from the required distribution. The samples arerepresented by a set of particles and each particle has aweight assigned to it. The weight of each particle representsthe probability that that particle is being sampled from theprobability density function. Resampling is done so as toavoid weight disparity which leads to weight collapse. Whenthe state varibles are not distributed normally (Gaussian),Kalman filter provides a poor approximation. Particle filtersare used to overcome this problem. This algorithm usescontours, colour features or texture mapping [18].3) Multiple Hypothesis Tracking(MHT): The MHTalgorithm begins with the set of hypothesis of the previousiteration also called parent hypothesis set and the set ofmeasurements from the beginning until that iteration. Eachhypothesis represents a different set of assignments of the setof measurements to the different tracks. Taking into accountthe new set of measurements and one of the previoushypotheses, a new hypothesis is generated, making a specificassignment of the current measurements. The set of plausibleassignments that can be done for a parent hypothesis isnamed ambiguity matrix (sometimes also called hypothesismatrix). Each element of the matrix, aij, can take a value of 1or 0, representing the possibility that measurement i isassociated to a previous track, a new track, is considerednoise, etc., or not. Associated to the ambiguity matrix, a costIJERTV3IS110458The problem with this method is that a part of the objectmay be left outside the kernel while a part of the background,which is not necessary, may be added into the kernel.Tracking may be based on geometric shape of the object,object features and appearance. Different kernel basedtracking approaches include simple template matching, meanshift method, support vector machine (SVM) and layeringbased technique. These are explained below.1) Simple Template Matching: Template matching is atechnique used in digital image processing for finding smallparts of an image or video that match a template image. Sincethere is no specific way of right and wrong, it is termed to bea brute force method of examining regions of interest in avideo. The matching procedure calculates a numerical indexfor how well the image in the frame matches with the imagein the template.Motion based classification works on periodicity of themotion. A system can be made to learn how the object movesand then classify it better. Motion based classification has tobe addressed for both rigid and non-rigid objects. Though it iseasier to track objects which are rigid and show periodicity inmotion, a limited amount of periodicity has been known toexist in non-rigid objects as well. Optical flow is also used formotion-based classification.2) Mean Shift Method: Mean shift algorithm is a nonparametric feature space analysis technique for locating themaxima of a density function [18]. It is an iterative algorithmand is based on predicting the future values based on the pastvalues. One of the simplest forms would be to calculate theconfidence map of the new image based on colour histogramwww.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)626

International Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 3 Issue 11, November-2014the contour of the next frame. Only if this requirement issatisfied, proper tracking results can be achieved.There are two ways to perform contour tracking. In thefirst approach, the contour shape and motion are modeledusing state space models. The second method is more directand direct minimization techniques such as gradient descentfor minimizing the contour energy, thereby evolving thecontour. The most significant advantage is that this can beused to track objects of irregular shapes.2) Shape matching: Shape matching is similar to theshape-based classification technique. But, instead ofclassifying the object, we use shapes to track the particularshape. It is also similar to template matching, because theshapes are stored in a database and then the shape on theframe is compared with the shape in the database and thustracking is done. Silhouettes from two successive frames canalso be matched to obtain the required results. A single objectcan be tracked and occlusions are handled by means of usingHough transform [11].VII. CONCLUSIONWe have surveyed the various steps involved in trackingan object from a video sequence and also the differentmethods to perform object detection, object classification andobject tracking were discussed. The advantages anddisadvantages of each of these methods were discussed. Inthe future, we plan to propose an algorithm that overcomesthe disadvantages of the existing object detection methodsand track objects in a video with capability to handle multipleobjects and occlusions.IJERTof the previous images. The peak of the confidence map willoccur at the next predicted position of the object. Thus, theconfidence map is a probability density function and the peakof the function may be calculated using mean shift method[15].The region of interest (ROI) is selected by a bounding boxfrom the first frame of the video. The probability densityfunction is calculated based on colour information. Then, thepresent probability density function is compared with theprobability density functions of the consecutive frames.Degree of similarity between the frames is represented byusing Battacharya coeffecient. The same procedure will befollowed till the last frame of the video is reached [11].This method has a lot of drawbacks. Only one object canbe tracked. The ROI has to be initialized manually. It cannottrack an object if the object is moving with high speed withinthe frame.3) Support Vector Machine (SVM): Support VectorMachine uses algorithms which are supervised learningmodels and their associated learning algorithms whichanalyze data and recognize patterns based on previouslylearnt data. Support vector machines are used mostly forclassification and regression [18].The basic idea behind a SVM algorithm is that it classifiesthe points into two sets of hyperplanes

The pros and cons of each of the methods are discussed. Object detection methods are frame differencing, optical flow and background subtraction. Then, objects may be classified based on shape, motion, colour and texture. Tracking methods involve point based tracking, kernel based tracking and silhouette based tracking.