Fast And Accurate Object Detection In High Resolution 4K .

Transcription

Fast and accurate object detection in high resolution 4K and 8K video using GPUsVı́t Růžička and Franz FranchettiDepartment of Electrical and Computer Engineering, Carnegie Mellon UniversityEmail: previtus@gmail.com, franzf@cmu.eduAbstract—Machine learning has celebrated a lot of achievements on computer vision tasks such as object detection, butthe traditionally used models work with relatively low resolution images. The resolution of recording devices is graduallyincreasing and there is a rising need for new methods of processing high resolution data. We propose an attention pipelinemethod which uses two staged evaluation of each image orvideo frame under rough and refined resolution to limit thetotal number of necessary evaluations. For both stages, wemake use of the fast object detection model YOLO v2. Wehave implemented our model in code, which distributes thework across GPUs. We maintain high accuracy while reachingthe average performance of 3-6 fps on 4K video and 2 fps on8K video.1. IntroductionMachine learning is a fast moving field which has experienced a revolution in working with imagery data for thetask of object classification and detection. Typical use ofthese computer vision tasks include security initiatives likefacial recognition, city planning efforts like traffic densityestimations [1].Current state of the art in object detection is using deepconvolutional neural networks models trained on ImageNetdataset [2] such as Faster R-CNN [3] and YOLO [4].The majority of these models are focused on workingwith low-resolution images for these three following reasons. First, in certain scenarios, the low-resolutions imagesare sufficient for the task, such as in the case of objectclassification, where most models use images up to 299x299pixels [5], [6], [7], [8]. Secondly, processing low-resolutionimages is more time efficient. Lastly, many public availabledatasets used to train these models such as ImageNet, CIFAR100 [9], Caltech 256 [10] and LFW [11] are themselvesmade up of low-resolution images. There are no large scaledatasets with more than hundred of images or videos andresolution as high as 4K (3840x2160 pixels).However, in low resolution images one can lose a lot ofdetail that is not forfeited when using high resolution capturedevices. Today’s high resolution data sources introduce 4K8K cameras therefore bringing a need for new models ormethods to analyze them. Also, there are advantages in howmuch information we can extract from higher resolutionimages. For example in Figure 1 we can detect more humanfigures in the original resolution as compared to resizing theimage to the lower resolution of the models.Figure 1: Example of crowded 4K video frame annotatedwith our method.With the limitations of current models, we came up withtwo baseline approaches. First, downscaling the image before evaluation and sacrificing accuracy. Or secondly, cuttingup the whole image into overlapping crops and evaluatingevery single crop while sacrificing speed.ContributionsIn this paper we propose a method for accurate andfast object detection which we call attention pipeline. Ourmethod uses the first approach by downscaling the originalimage into a low-resolution space. Object detection in lowresolution guides the attention of the model to the importantareas of the original image. In the second stage, the modelis directed back to the high resolution only reviewing thehighlighted areas.Specifically, in this paper we make the following contributions: We propose a novel method which processes highresolution video data while balancing the trade-offof accuracy and performance.We show that our method reduces the number ofinspected crops as compared to a baseline methodof processing all crops in each image and as a resultincreases performance by up to 27%.We increase the PASCAL VOC Average Precisionscore on our dataset from 33.6AP50 to 75.4AP50 ascompared to using YOLO v2 in baseline approachof downsampling images to the model’s resolution.Implement efficient code which distributes workacross GPU cluster and measure the performance ofeach individual operation of proposed method.

1.1. Related WorkIn this section, we will trace the important advancementsin the field of machine learning relevant for the task of objectdetection and the efforts to speed up existing models byusing the concept of attention.Revolution of deep convolutional neural networks.The success of deep convolutional neural networks (CNN)on imagery data has been initiated by the large annotateddataset of ImageNet and by the ILSVRC competition. In theILSVRC 2012 competition, AlexNet model [5] has becamethe new state of the art on object classification while inPASCAL VOC 2012 and the ILSVRC 2013 challenge theR-CNN model [12] extended the usage of CNNs on theobject detection task.Efforts to speed up object detection. One of theapproaches for object detection depends on hierarchy ofregion proposal methods [13]. The initial work of R-CNNuses CNNs to classify objects in the proposed regions aswell as the following Fast R-CNN [14] and Faster R-CNN[3]. Comparative study of [15] explores the performancegains of these models.In our paper we are using the YOLO model first introduced in [16] and later improved as YOLO v2 in [4]. Thesemodels have achieved real time performance by unifyingthe whole architecture into one single neural network whichdirectly predicts the location of regions and their classesduring inference.Closest to our approach is the work of [17], whichuses region proposal hierarchy on the task of real timeface detection in 4K videos. Our method differs in that itgeneralizes to other classes of objects given by the generalityof the YOLO v2 model.Attention. The concept of attention, to focus just onfew areas of the original image, can be used for two goals.First goal is to increase the model’s accuracy such as in theworks of [18], [19] which tries to select few parts of theimage relevant for the task.Second goal is aimed to limit the computational costs.Inspired by human foveal vision, the work of [20] examinesvideos under multiple resolutions, using only the centerof the image in its original resolution. In our proposedpipeline we will also use subsections of the original image,however our focus will be guided by initial fast yet impreciseattention evaluation.across neighboring frames (the same tracked object mightbe present in the next frame at similar location).The task of object detection consists of finding objectsof interest in the image by marking them with boundingboxes. Bounding box is a four coordinate rectangle with aclass label, which should contain the corresponding objectas tightly as possible. One image can contain many possiblyoverlapping bounding boxes of multiple classes (such as“person”, “car”, etc.).2x4 grid overlay of overlapping regionslabel: personoriginal imagecropbounding boxFigure 2: Illustration of the object detection task and termsused in this paper. By cutting out and resizing crop ofthe original image we can use YOLO v2 model for objectdetection of objects such as people.We define the term of “crop” when talking about anysubregion of the image. We can overlay any image byoverlapping regions and cut out crops of smaller size. SeeFigure 2. We will be using the YOLO v2 model, which isby design limited to square ratio input with fixed resolutionof 608x608 pixels. We comply with these limitations for thefast performance YOLO model offers.Baseline approaches. There are two baseline solutionsfor this task which provide us with an inspiration for our proposed pipeline and which we will use as poins of comparisonin measuring accuracy and performance. First approach is todownscale the whole image original image into resolution ofthe evaluating model. This approach offers fast evaluation,but loses large amounts of information potentially hidden inthe image, which especially applies with high resolutions.Second approach is to overlay the original image with aslightly overlapping grid of fixed resolution and evaluateeach cut out crop separately. In this second approach wepay the full computational cost as we evaluate every singlecrop of the image.2.2. Attention pipeline2. MethodIn this section we describe our proposed method we callattention pipeline.2.1. Problem definitionIn the context of this paper, we will be working with highresolution dataset of videos, which can be seen as a streamof consecutive frames. Additional to the spacial informationwithin each frame, there is temporal information [20] carriedWe propose an attention pipeline model, which leveragesthese two basic approaches in striving both for precision andperformance. We evaluate the image in staged manner. First“attention evaluation” stage looks at roughly sampled areasof the image to get areas suspicious of presence of objectswe are localizing. The second “final evaluation” stage thenlooks at these selected areas under higher resolution.Attention evaluation. The original image first enters theattention evaluation stage. For simple way to balance theaccuracy and speed we chose to parameterize the cropping

I. ATTENTION EVALUATIONcropscale608x608 pxYOLO v21 row2160x2160 px3840x2160 pxattention evaluationII. ACTIVE CROP SELECTIONcollect bounding boxes from all crops into the original imageactive2 rows3840 x 2160 pxIII. FINAL EVALUATIONmodelresolutionscale1060x1060 pxactive crops3840x2160 pxpcrofinal evaluationYOLO v2608x608 pxcollect all final bounding boxesFigure 3: Resolution handling on the example of 4K videoframe processing. During the attention step we process theimage under rough resolution which allows us to decidewhich regions of the image should be active in final finerevaluation.by number of rows a grid imposed over the original imagewill have and by the overlap in pixels between neighboringcells. For example we can choose the attention evaluationto crop the image by a grid of one row and 0 pixel overlap.The image will then be downscaled to have 608px heightand width corresponding to the aspect ratio. The width ofthe image will be subdivided into square crops of 608x608pixels, such that the original image is fully covered withthe minimal amount of squares. Note that these crops canbe overlapping. See examples of this grid in stages one andtwo in Figure 3.We evaluate these initial attention crops with YOLO v2model and get bounding boxes of detected objects. Notethat this initial evaluation might lose some of the small oroccluded objects in the image, however it will still pickup on rough areas of interest. In practical setting of videoanalysis, we use the temporal aspect of the video and mergeand reuse attention across few neighboring frames.Active crop selection. Secondly, we will subdivide theoriginal image into a finer grid. Each cell of the grid is thenchecked for intersections with the bounding boxes detectedin the attention evaluation. Intersecting cells will be markedas active crops for the final evaluation. The number of activecrops is usually lower than the number of all possible crops,depending on the density of the video.Final evaluation. Lastly, in the final evaluation, we usethe same YOLO v2 model to locate objects in these higherresolution square crops. See stage three in Figure 3 and notethe difference in resolution of crops (1060 px scaled to 608px instead of the full height of 2160 px scaled to 608 px).Postprocessing. Upon evaluation of multiple overlapping crop regions of the image, we obtain list of boundingboxes for each of these regions. To limit the number ofbounding boxes, we run a non-maximum suppression algorithm to keep only the best predictions. As we do not aprioriknow the size of the object or its position in the image,we might detect the same object in multiple neighboringregions. This occurs either if the object is larger than theregion of our grid, or if it resides on the border of twoneighboring regions effectively being cut in half. As isillustrated in Figure 5.Object being cut by the overlaid grid can be detectedif we look along the splitting borders. There we can tryto detect and merge nearby bounding boxes. This problemis data specific, if we are trying to localize object suchas humans, they tend to be high rather than wide in theimage. Empirically we have set several distance thresholdsunder which we can merge nearby bounding boxes of thesame object class. For human detection we consider mergingregions neighboring only vertically.2.3. Implementation of the client-server versionThe motivation of the proposed attention pipeline 2.2 isto make real time evaluation of 4K videos feasible. Thefollowing two specific properties of our problem can beleveraged to achieve fast processing.The first point is that the image crop evaluation is anembarrassingly parallel problem and as such it lends itselffor parallel distribution across multiple workers.Second suggested point is the specific property of ourpipeline, where the final evaluation always depends on theprevious attention evaluation step. We cannot bypass thisdependency, however, we can compute the next frame’s attention evaluation step concurrently with the current frame’sfinal evaluation. With enough resources and workers, we areeffectively minimizing the waiting time for each attentionevaluation step. See Figure 6.To leverage these properties, we used a client-server implementation as illustrated by Figure 7. Note that with strongclient machine we can also move the input/output operationsand image processing into multiple threads, further speedingup the per frame performance.3. ResultsIn this section we will first start with the system details,datasets and used metrics and then show the measuredaccuracy and performance results.3.1. MethodologySystem details. We are running our implementation onnodes of the PSC’s Bridges cluster1 . Each node is equippedwith 2 Intel Broadwell E5-2683 v4 CPUs running at 2.1 to3.0 GHz and 2 NVIDIA Tesla P100 Pascal GPUs. Each node1. https://www.psc.edu/bridges/

attention evaluationfinal evaluationYOLO v2YOLO v2sampled cropsbounding boxesactive crops608x608 pximgimgrough evaluationat lower resolutionfinal bounding boxes608x608 pxfiner evaluationat higher resolution3840 x 2160 px3840 x 2160 pxFigure 4: The attention pipeline. Stepwise breakdown of the original image under different effective resolution.attention evaluation server(s)neighboring bounding boxesclass: persond1d2d3clientclass: personFigure 5: Illustration of object residing on the border oftwo crop regions. Nearby bounding boxes can be merged inpostprocessing step if they are closer than several thresholds.parallelprocessesattention frame #0 final evaluation final evaluation final eval. #2for frame #0for frame #1attention frame #1attention #2frame 0pscrosoxeb.b4K cameraetc .attention #3frame 1cropsb.boxesNAttentionfinal evaluation server(s)NFinalFigure 7: Client-server implementation scheme. Client processes the captured video frames and sends only the listof crops for evaluation to the servers. Note that we havededicated NA servers for precomputing the next frame’sattention. The NF servers will have uniformly distributedload of crops to process.frame 2timeFigure 6: The final evaluation step dependency on theprevious attention evaluation step can be decomposed inpipelining manner similar to CPU instruction evaluationpipeline.was used by two running instances of code, each utilizingone of the CPUs and one of the GPUs. Peak performance ofP100 GPU operating on 32-bit floats is 9.3 TFLOPs. Peakperformance of the Intel Broadwell CPU is 768 GFLOPs.Peak bandwidth of transferring data between CPU and GPUwas 32 GB/s (PCIe3 x16), while transferring data betweennodes was 12.5 GB/s/direction. The worker nodes do notcommunicate between each other.Datasets. We are working with the PEViD-UHD dataset[21] of 26 short videos of security surveillance scenarios.These contain small number of participants performing labeled actions such as exchanging bags, stealing and other.Some of the 13 seconds long scenes contained not annotatedhuman figures, which is why we have chosen to make asubselection of the PEViD dataset which will be referred toas “PEViD clean” in our measurements. We have removedframes where an unannotated figure was present in thevideo. For comparability of the results we have kept “PEViDfull” untouched. In both cases we chose videos marked as“Exchange” and “Stealing”.We note that the density of human figures across thePEViD dataset is relatively low (usually just two individ-uals), which is why we recorded our own dataset usinga 4K camera. These include scenes of variety of lightingconditions, distances from the subjects and density of thepresent human figures. We have also manually labeled arepresentative sample of ten frames per video with boundingboxes around humans. Full videos are approximately oneminute long.Finally, we have also included an unannotated, publiclyavailable 8K video from YouTube2 .Accuracy metrics. For accuracy measurement we chosethe traditionally used Intersection over Union metric withthresholds 0.5, 0.25 and 0.75. We then analyze the detectedtrue and false positives in an PASCAL VOC average precision score [22], which we refer to as AP50 , AP25 and AP75depending on the used threshold.Performance metrics. For performance measurementswe use the full version of our own 4K videos, two fullvideos from the PEViD dataset and the publicly available8K video.In our results, we chose to differentiate between stagesof evaluation as described in 2.2. We have separated timemeasurements for I/O loading and saving, attention evaluation stage as one value and final evaluation stage in moredetail. The final evaluation stage is divided into client sideimage processing, time to transfer images between clientand the workers and finally the evaluation of the objectdetection model itself. As the work is being distributedacross multiple servers, and the processing speed is limited2. https://youtu.be/gdnHLE HCX0

TABLE 1: Crop setting table, sizes of each crop in px4K8K1x2 grid2x4 grid3x6 grid4x8 grid6x11 grid2160 px4320 px1098 px2196 px736 px1472 px554 px1107 px370 px480 pxby the slowest worker, we chose to show the measurementsfrom the slowest worker per each frame.Crop settings. Choice of crop settings influences by howmany rows we subdivide the original image. We cover thearea of image by minimal amount of square crops. Eachsquare is scaled to the model’s resolution which is 608x608pixels. Depending on the settings value, we influence howmuch we are downscaling the original image. While originally large objects will likely be detected even after thedownscaling, smaller objects in the background might belost. Table 1 contains the crop sizes in pixels depending oneach crop setting. These values include the default 20 pixelsoverlap between crops.is only 33.6AP50 while our method achieves 74.3AP50 ,which is very near the performance of the all crops baseline 75.4AP50 . Our method achieves accuracy as if it wasactually inspecting all crops of the original image.Figure 8 shows the distance between the number ofdetected objects and the ground truth count. We can seethat there is variable density of human figures in each videosequence. We note that in dense scenes we benefit from moredetailed setting such as in the case of video “S10” whichcontains more than 130 human figures. The main factor indifferent numbers of detected objects due to settings choiceare the human figures present in the very distant backgroundof the image.3.3. Performance analysisTime(s) averaged over multiple runs and numbers of active and total crops0.63.2. Accuracy analysis0.4We report the results of several settings used with eachvideo set in Table 2. We will use the naming scheme ofcombining the crop settings used during attention and finalevaluation.0.07060504030201000.2Number of found objectsGround Truth1 to 21 to 32 to 42 to 63020100S21S25S28S38S39Whole loopFinal evaluationS40S41frames of video filesS44S46S5112010080604020S10 0Figure 8: Number of detected object under different settingsas compared with the annotated ground truth. Note thatvideo S10 has been plotted separately as it contains vastlylarger amount of human figures present.Upon inspecting the results on Table 2 we note thatour method achieves accuracy of 91.7AP50 on the PEViDdataset and accuracy of 74.3AP50 on our own recordeddensely populated scenes.We compare our results with two baseline approachesintroduced in 2.1. We refer to a baseline approach whichdownscales the original 4K image to the resolution of YOLOv2 model as the ”downscale baseline”. ”All crops baseline”denotes the second baseline approach which cuts up theoriginal image and evaluates all resulting crops. Notice thatwith correct crop setting, our method vastly outperforms thedownscale method, while it achieves results close to the bestpossible accuracy of the all crops baseline.While PEViD dataset contains relatively simple challenge of low density videos, our 4K dataset of denselypopulated scenes presents more complicated task. We cansee that in the case of the downscale baseline the accuracyAttentionActiveTotal cropsPEViD All crops PEViD All crops 4K normal All crops 4K dense All crops 8k All crops1 to 2 baseline 1 to 3 baseline 2 to 4 baseline 2 to 4 baseline 2 to 6 baselineframesFigure 9: Comparison of the influence of the number ofactive crops and the setting of resolution of each crop onthe speed performance of one frame. Sample of 80 framesis shown for each video sequence.In Figure 9 we see the visualization of amount of activeand total possible crops and its influence on processingspeed. We also present comparison between our method andthe all crops baseline approach.In Figure 10 we compare the FPS performance of our attention pipeline model with the all crops baseline approach.We note that on an average video from the PEViD datasetour method achieves average performance of 5-6 fps. Sceneswhich require higher level of detail range between 3-4 fpsdepending on the specific density. On a bigger and morecomplex scene of 8K video we achieve 2 fps. Except for thevery dense 4K video, our method outperforms the baselineapproach.Upon inspecting the detailed decomposition of operations performed in each frame in Figure 11, we can see thatthe final evaluation in often not the most time consumingstep. We need to consider client side operations and thetransfer time between one client and many used servers.Note that attention evaluation stays negligible as we areusing additional servers for concurrent computation of thenext frame. In the case of 8K videos, the I/O time of openingand saving an image becomes a concern as well even as itis performed on another thread.

TABLE 2: AccuracyDataset name8Resolution# of videos# of framesSettingsAP75AP50AP25PEViD ”exchange, steal” clean4K102992downscale baseline1 att, 2 fin, 50 over1 att, 3 fin, 50 overall crops baseline, 3 7PEViD ”exchange, steal” full4K134784downscale baseline1 att, 2 fin, 50 over1 att, 3 fin, 50 overall crops baseline, 3 9densely populated scenes4K11112downscale baseline1 att, 2 fin, 20 over2 att, 4 fin, 20 overall crops baseline, 4 FPS (1/s)Time(s) for video S10, 2 att. crops and 4 fin. ientside image operationsTransferFinal0.440.230.021011.2PEViD1 to 2All cropsbaselinePEViD1 to 3All crops 4K normal All crops 4K dense All cropsbaseline 2 to 4 baseline 2 to 4 baseline8k2 to 6All cropsbaselineFigure 10: FPS analysis across multiple runs using the bestsetting.1223456789 10 11 12 13 14 15 16number of servers dedicated to final evaluationFigure 12: Measured per frame processing speed on ourcustom 4K video named ”S10” with variable amount ofservers. The numbers above the chart indicate the amountof servers dedicated for attention precomputing.4. ConclusionTime(s) of the best performance over different settings and videos0.6OIAttention0.5Clientside image operationsTransferFinal0.40.30.20.10.0PEViD1 to 2PEViD1 to 34K normal 4K dense2 to 42 to 48k2 to 6Figure 11: Comparison of average run times on differentdatasets under their best performance.Finally, we have explored the influence of number ofused server for attention precomputation stage and for thefinal evaluation stage in Figure 12. We can see, that there isa moment of saturation in scaling the number of workers.This is due to finite amount of crops generated in each frame- after certain number of crops assigned to each server, wedon’t see any speedups in further division.As a motivation of this paper we have stated two goalsin processing high resolution data. First goal consists of theability to detect even small details included in the 4K or 8Kimage and not loosing them due to downscaling. Secondlywe wanted to achieve fast performance and save on thenumber of processed crops as compared with the baselineapproach of processing every crop in each frame.Our results show that we outperform the individual baseline approaches, while allowing the user to set the desiredtrade-off between accuracy and performance.AcknowledgmentsThis work used the Extreme Science and EngineeringDiscovery Environment (XSEDE), which is supported byNational Science Foundation grant number 1548562. Specifically, it used the Bridges system, which is supported byNSF award number 1445606, at the Pittsburgh Supercomputing Center (PSC). This work was also sponsored DARPABRASS program under agreement FA8750-16-2-003 andDARPA PERFECT program under agreement HR0011-132-0007.

References[1]Shanghang Zhang, Guanhang Wu, João P. Costeira, and José M. F.Moura. Fcn-rlstm: Deep spatio-temporal neural networks for vehiclecounting in city cameras. CoRR, abs/1707.09476, 2017.[2]J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei.Imagenet: A large-scale hierarchical image database. In 2009 IEEEConference on Computer Vision and Pattern Recognition, pages 248–255, June 2009.[3]Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. FasterR-CNN: towards real-time object detection with region proposalnetworks. CoRR, abs/1506.01497, 2015.[4]Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger.CoRR, abs/1612.08242, 2016.[5]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In F. Pereira,C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances inNeural Information Processing Systems 25, pages 1097–1105. CurranAssociates, Inc., 2012.[6]Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition. CoRR, abs/1409.1556,2014.[7]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. CoRR, abs/1512.03385, 2015.[8]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E.Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, andAndrew Rabinovich. Going deeper with convolutions. CoRR,abs/1409.4842, 2014.[9]Alex Krizhevsky. Learning Multiple Layers of Features from TinyImages. 2009.[10] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 objectcategory dataset. 2007.[11] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.Labeled faces in the wild: A database for studying face recognitionin unconstrained environments. Technical report, Technical Report07-49, University of Massachusetts, Amherst, 2007.[12] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.Rich feature hierarchies for accurate object detection and semanticsegmentation. CoRR, abs/1311.2524, 2013.[13] Chunhui Gu, J. J. Lim, P. Arbelaez, and J. Malik. Recognition usingregions. In 2009 IEEE Conference on Computer Vision and PatternRecognition, pages 1030–1037, June 2009.[14] Ross B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.[15] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, AnoopKorattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song,Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offsfor modern convolutional object detectors. CoRR, abs/1611.10012,2016.[16] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and AliFarhadi. You only look once: Unified, real-time object detection.CoRR, abs/1506.02640, 2015.[17] Ilya Kalinowski and Vladimir Spitsyn. Compact convolutional neuralnetwork cascade for face detection. CoRR, abs/1508.01292, 2015.[18] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization.CoRR, abs/1512.04150, 2015.[19] Efstratios Gavves, Basura Fernando, Cees G. M. Snoek, ArnoldW. M. Smeulders, and Tinne Tuytelaars. Local alignments for finegrained categorization. International Journal of Computer Vision,111(2):191–212, Jan 2015.[20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei. Large-scale video classification with convolutional neuralnetworks. In 2014 IEEE Conference on Computer Vision and PatternRecognition, pages 1725–1732, June 2014.[21] Pavel Korshunov and Touradj Ebrahimi. Uhd video dataset forevaluation of privacy. In Sixth International Workshop on Quality ofMultimedia Experience (QoMEX 2014), Singapore, 18–20 September2014.[22] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, JohnWinn, and Andrew Zisserman. The pascal visual object classes (voc)challenge. International Journal of Computer Vision, 88(2):303–338,Jun 2010.

Current state of the art in object detection is using deep convolutional neural networks models trained on ImageNet . resolution guides the attention of the model to the important areas of the original image. In the second stage, the model . to crop the image by a grid of one row and 0 pixel