Tracking Without Re-recognition In Humans And Machines

Transcription

Tracking Without Re-recognition in Humans andMachinesDrew Linsley 1 , Girik Malik 2 , Junkyung Kim3 ,Lakshmi N Govindarajan1 , Ennio Mingolla†2 , Thomas Serre†1drew linsley@brown.edu and malik.gi@northeastern.eduAbstractImagine trying to track one particular fruitfly in a swarm of hundreds. Higherbiological visual systems have evolved to track moving objects by relying onboth their appearance and their motion trajectories. We investigate if state-of-theart spatiotemporal deep neural networks are capable of the same. For this, weintroduce PathTracker, a synthetic visual challenge that asks human observersand machines to track a target object in the midst of identical-looking “distractor”objects. While humans effortlessly learn PathTracker and generalize to systematicvariations in task design, deep networks struggle. To address this limitation, weidentify and model circuit mechanisms in biological brains that are implicated intracking objects based on motion cues. When instantiated as a recurrent network,our circuit model learns to solve PathTracker with a robust visual strategy that rivalshuman performance and explains a significant proportion of their decision-makingon the challenge. We also show that the success of this circuit model extends toobject tracking in natural videos. Adding it to a transformer-based architecture forobject tracking builds tolerance to visual nuisances that affect object appearance,establishing the new state of the art on the large-scale TrackingNet challenge.Our work highlights the importance of understanding human vision to improvecomputer vision.1IntroductionLettvin and colleagues [1] presciently noted, “The frog does not seem to see or, at any rate, isnot concerned with the detail of stationary parts of the world around him. He will starve to deathsurrounded by food if it is not moving.” Object tracking is fundamental to survival, and higherbiological visual systems have evolved the capacity for two distinct and complementary strategies todo it. Consider Figure 1: can you track the object labeled by the yellow arrow from left-to-right? Thetask is trivial when appearance cues, like color, make it possible to solve the temporal correspondenceproblem by “re-recognizing” the target in each frame (Fig. 1a). However, this strategy is not effectivewhen objects cannot be discriminated by their appearance alone (Fig. 1b). In this case integration ofobject motion over time is necessary for tracking. Humans are capable of tracking objects by theirmotion when appearance is uninformative [2, 3], but it is unclear if the current generation of neuralnetworks for video analysis and tracking can do the same. To address this question we introducePathTracker, a synthetic challenge for object tracking without re-recognition (Fig. 1c).* † Theseauthors contributed equally to this work.Carney Institute for Brain Science, Brown University, Providence, RI2Northeastern University, Boston, MA3DeepMind, London, UK135th Conference on Neural Information Processing Systems (NeurIPS 2021).

Leading models for video analysis rely on object classification pre-training. This gives them access torich semantic representations that have supported state-of-the-art performance on a host of tasks, fromaction recognition to object tracking [4–6]. As object classification models have improved, so too havethe video analysis models that depend on them. This trend in model development has made it unclearif video analysis models are effective at learning tasks when appearance cues are uninformative. Theimportance of diverse visual strategies has been highlighted by synthetic challenges like Pathfinder,a visual reasoning task that asks observers to trace long paths embedded in a static cluttered display [7, 8].Pathfinder tests object segmentationwhen appearance cues like category orshape are missing. While humans can (a)easily solve it [8], deep neural networks struggle, including state-of-theart vision transformers [7–9]. Importantly, models that learn an appropri- (b)ate visual strategy for Pathfinder alsoexhibit more efficient learning and improved generalization on object segmentation in natural images [10, 11]. (c)TimeOur PathTracker challenge extends (c)this line of work into video by posing an object tracking problem where Figure 1: The appearance of objects makes them (a) easy orthe target can be tracked by motion (b) hard to track. We introduce the PathTracker Challengeand spatiotemporal continuity, not cat- (c), which asks observers to track a particular green dot as ittravels from the red-to-blue markers, testing object trackingegory or appearance.when re-recognition is impossible.Contributions. Humans effortlessly solve our novel PathTracker challenge. A variety of state-ofthe-art models for object tracking and video analysis do not. We find that neural architectures including R3D [12] and state-of-the-art transformer-based TimeSformers [5] are strained by long PathTracker videos. Humans, on the other hand, are far moreeffective at solving these long PathTracker videos. We describe a solution to PathTracker: a recurrent network inspired by primate neural circuitryinvolved in object tracking, which renders decisions that are strongly correlated with humans. These same circuit mechanisms improve object tracking in natural videos through a motion-basedstrategy that builds tolerance to changes in target object appearance, resulting in the state-of-the-artscore on TrackingNet [13]. We release all PathTracker data, code, and human psychophysics at http://bit.ly/InTcircuitto spur interest in the challenge of tracking without re-recognition.2Related WorkModels for video analysis A major leap in the performance of models for video analysis camefrom using networks which are pre-trained for object recognition on large image datasets [4]. Therecently introduced TimeSformer [5] achieved state-of-the-art performance with weights initializedfrom an image categorization transformer (ViT; [14]) that was pre-trained on ImageNet-21K. Themodeling trends are similar in object tracking [15], where successful models rely on “backbone”feature extraction networks trained on ImageNet or Microsoft COCO [16] for object recognition orsegmentation [6, 17].Shortcut learning and synthetic datasets A byproduct of the great power of deep neural networkarchitectures is their vulnerability to learning spurious correlations between inputs and labels. Perhapsbecause of this tendency, object classification models have trouble generalizing to novel contexts [18,19], and render idiosyncratic decisions that are inconsistent with humans [20–22]. Synthetic datasetsare effective at probing this vulnerability because they make it possible to control spurious image/labelcorrelations – providing a fairer test of the computational abilities of these models. For example, thePathfinder challenge was designed to test if neural architectures can trace long curves in clutter – a2

visual computation associated with the earliest stages of visual processing in primates. That challengeidentified diverging visual strategies between humans and transformers that are otherwise state of theart in natural image object recognition [9, 14]. Other challenges like Bongard-LOGO [23], cABC [8],SVRT [24], and PSVRT [25] have highlighted limitations of leading neural network architecturesthat would have been difficult to identify using natural image benchmarks like ImageNet [26]. Theselimitations have inspired algorithmic solutions based on neural circuits discussed in SI §1.Translating circuits for biological vision into artificial neural networks While the Pathfinderchallenge [7] presents immense challenges for transformers and deep convolutional networks [8], itcan be solved by a simple model of intrinsic connectivity in visual cortex, with orders-of-magnitudefewer parameters than standard models for image categorization. This model was developed bytranslating descriptive models of neural mechanisms from Neuroscience into an architecture thatcan be fit to data using gradient descent [7, 11]. Others have found success in modeling objecttracking by drawing inspiration from “dual stream” theories of appearance and motion processing invisual cortex [27–30], or basing the architecture off of a partial connectome of the Drosophila visualsystem [31]. We adopt a similar approach in the current work, identifying mechanisms for objecttracking without re-recognition in Neuroscience, and developing those into differentiable operationswith parameters that can be optimized by gradient descent. This approach has the dual purpose ofintroducing task-relevant inductive biases into computer vision models, and developing theory ontheir relative utility for biological vision.Multi-object tracking in computer vision The classic psychological paradigms of multi-objecttracking [2] motivated the application of models, like Kalman filters, which had tolerance to objectocclusion when they relied on momentum models [32]. However, these models are computationallyexpensive, hand-tuned, and because of this, no longer commonly used in computer vision [33]. Morerecent approaches include flow tracking on graphs [34] and motion tracking models that are relativelycomputationally efficient [35, 36]. However, even current approaches to multi-object tracking are notlearned, instead relying on extensive hand tuning [37, 38]. In contrast, the point of PathTracker is tounderstand the extent to which state-of-the-art neural networks are capable of tracking a single objectin an array of distractors.3The PathTracker ChallengeOverview PathTracker asks observers to decide whether or not a target dot reaches a goal location(Fig. 2). The target dot travels in the midst of a pre-specified number of distractors. All dots areidentical, and the task is difficult because of this: (i) appearance is not useful for tracking the target,and (ii) the paths of the target and distractors can momentarily “cross” and occupy the same space,making it impossible to individuate them in that frame and meaning that observers cannot solely relyon the target’s location to solve the task. This challenge is inspired by object tracking paradigms incognitive psychology [2, 3, 39], which suggest that humans might tap into mechanisms for motionperception, attention and working memory to solve a task like PathTracker.The trajectories of target and distractor dots are randomly generated, and the target occasionallycrosses distractors (Fig. 2). These object trajectories are smooth by design, giving the appearanceof objects meandering through a scene, and the difference between the coordinates of any dot onsuccessive frames is no more than 2 pixels with less than 20 of angular displacement. In other words,dots never turn at acute angles. We develop different versions of PathTracker with varying degrees ofcomplexity based on the number of distractors and/or the length of videos. These variables changethe expected number of times that distractors cross the target and the amount of time that observersmust track the target (Fig. 2). To make the task as visually simple as possible and maximize contrastbetween dots and markers, the dots, start, and goal markers are placed on different channels in 32 32pixel three-channel images. Markers are stationary throughout each video and placed at randomlocations. Examples videos can be viewed at http://bit.ly/InTcircuit.Human benchmark We began by testing if humans can solve PathTracker. We recruited 180 individuals using Amazon Mechanical Turk to participate in this study. Participants viewed PathTrackervideos and pressed a button on their keyboard to indicate if the target object or a distractor reachedthe goal. These videos were played in web browsers at 256 256 pixels using HTML5, which helpedensure consistent frame rates [40]. The experiment began with an 8-trial “training” stage, which3

1 distractor and 32 frames(b)1816243214 distractors and 32 frames(c)(d)182416321 1Fram 6 32e num.(a)1 1Fram 6 32e num.Crossing1(e)816243240485664(f)Frame number1 16Fra 32 48menum 64.14 distractors and 64 framesFigure 2: PathTracker is a synthetic visual challenge that asks observers to watch a video clip andanswer if a target dot starting in a red marker travels to a blue marker. The target dot is surrounded byidentical “distractor” dots, each of which travels in a randomly generated and curved path. In positiveexamples, the target dot’s path ends in the blue square. In negative examples, a “distractor” dot endsin the blue square. The challenge of the task is due to the identical appearance of target and distractordots, which, as we will show, makes appearance-based tracking strategies ineffective. Moreover, thetarget dot can momentarily occupy the same location as a distractor when they cross each other’spaths, making it impossible to individuate them in that frame and compelling strategies like motiontrajectory extrapolation or working memory to recover the target track. (b) A 3D visualization of thevideo in (a) depicts the trajectory of the target dot, traveling from red-to-blue markers. The target anddistractor cross approximately half-way through the video. (c,d) We develop versions of PathTrackerthat test observers sensitivity to the number of distractors and length of videos (e,f ). The number ofdistractors and video length interact to make it more likely for the target dot to cross a distractor in avideo (compare the one X in b vs. two in d vs. three in f ; see SI §2 for details).familiarized participants with the goal of PathTracker. Next, participants were tested on 72 videos.The experiment was not paced and lasted approximately 25 minutes, and participants were paid 8for their time. See http://bit.ly/InTcircuit and SI §2 for an example and more details.Participants were randomly entered into one of two experiments. In the first experiment, they weretrained on the 32 frame and 14 distractor PathTracker, and tested on 32 frame versions with 1, 14,or 25 distractors. In the second experiment, they were trained on the 64 frame and 14 distractorPathTracker, and tested on 64 frame versions with 1, 14, or 25 distractors. All participants viewedunique videos to maximize our sampling over the different versions of PathTracker. Participantswere significantly above chance on all tested conditions of PathTracker (p 0.001, test details in SI§2). They also exhibited a significant negative trend in performance on the 64 frame datasets as thenumber of distractors increased (t 2.74, p 0.01). There was no such trend on the 32 framedatasets, and average accuracy between the two datasets was not significantly different. These resultsvalidate our initial design assumptions: humans can solve PathTracker, and manipulating distractorsand video length increases difficulty.4Solving the PathTracker challengeCan leading models for video analysis match humans on PathTracker? To test this question wesurveyed a variety of architectures that are the basis for leading approaches to many video analysistasks, from object tracking to action classification. The models fall into four groups: (i) deep4

Train: 14 distractors and 32 frames(a)Test: 1 distractors and 32 framesTest: 14 distractors and 32 framesAccuracy100%Test: 25 distractors and 32 framesMean human acc.80%60%40%(b)Train: 14 distractors and 64 framesAccuracyTest: 1 distractors and 64 framesTest: 14 distractors and 64 framesTest: 25 distractors and 64 frames100%80%60%40%r ) rr ) rr ) rrrrtttui 3D )D 3D C3 C3 3D e 1 te ow e Uui 3D )D 3D C3 C3 3D e 1 te ow e Uui 3D )D 3D C3 C3 3D e 1 te ow e Urc R 1 R M M R form (2 Fil c Fl form -GRrc R 1 R M M R form (2 Fil c Fl form -GRrc R 1 R M M R form (2 Fil c Fl form -GRR an ti S nvR an ti S nvR an ti S nvCi et (2 e etCi Net R(2 ide etCi Net R(2 ide etT N R rid NeSeSeSnT ge et Str geNnT ge et Str geNlm Op ime Colm Op ime Colm Op ime CoIn age Net o St ageIImmmiiiaaaa N o aa N o aTTTK 3D TK 3D TK 3D TIm age N ImIm age N ImIm age N ImetetetRT RRT RRT ReNeNeNImImImSOSOSOagagagImImImFigure 3: Model accuracy on the PathTracker challenge. Video analysis models were trained tosolve 32 (a) and 64 frame (b) versions of challenge, which featured the target object and 14 identicaldistractors. Models were tested on PathTracker datasets with the same number of frames but 1, 14,or 25 distractors (left/middle/right). Colors indicate different instances of the same type of model.Grey hatched boxes denote 95% bootstrapped confidence intervals for humans. Only our InT Circuitrivaled humans on each dataset.convolutional networks (CNNs), (ii) transformers, (iii) recurrent neural networks (RNNs), and (iv)Kalman filters. The deep convolutional networks include a 3D ResNet (R3D [12]), a space/timeseparated ResNet with “2D-spatial 1D-temporal” convolutions (R(2 1)D [12]), and a ResNet with3D convolutions in early residual blocks and 2D convolutions in later blocks (MC3 [12]). We trainedversions of these models with random weight initializations and weights pretrained on ImageNet.We included an R3D trained from scratch without any downsampling, in case the small size ofPathTracker videos caused learning problems (see SI §3 for details). We also trained a versionof the R3D on optic flow encodings of PathTracker (SI §3). For transformers, we turned to theTimeSformer [5]. We evaluated two of its instances: (i) where attention is jointly computed for alllocations across space and time in videos, and (ii) where temporal attention is applied before spatialattention, which results in massive computational savings. Both models performed similarly onPathTracker. We report the latter version here as it was marginally better (see SI §3 for performance ofthe other, joint space-time attention TimeSformer). We include a version of the TimeSformer trainedfrom scratch, and a version pre-trained on ImageNet-20K. Note that state-of-the-art transformers forobject tracking in natural videos feature similar deep and multi-headed designs [6]. For the RNNs, weinclude a convolutional-gated recurrent unit (Conv-GRU) [41]. Finally, our exemplar Kalman filter isthe standard Simple and Online Realtime Tracking (SORT) algorithm, which was fed coordinates ofthe objects extracted from each frame of every video [37].Method The visual simplicity of PathTracker cuts two ways: it makes it possible to comparehuman and model strategies for tracking without re-recognition as long as the task is not too easy.Prior synthetic challenges like Pathfinder constrain sample sizes for training to probe specificcomputations [7–9]. We adopted the following strategy to select a training set size that would help ustest tracking strategies that do not depend on re-recognition. We took Inception 3D (I3D) networks [4],which have been a strong baseline architecture in video analysis over the past several years, and testedtheir ability to learn PathTracker as we adjusted the number of videos for training. As we discuss inSI §1, when this model was trained with 20K examples of the 32 frame and 14 distractor versionof PathTracker it achieved good performance on the task without signs of overfitting. We thereforetrained all models in subsequent experiments with 20K examples. By happenstance, this dataset sizegives PathTracker a comparable number of frames to large-scale real world tracking challenges likeLaSOT [42] and GOT-10K [43].We measure the ability of models to learn PathTracker and systematically generalize to novel versionsof the challenge when trained on 20K samples. We trained models using a similar approach as inour human psychophysics. Models were trained on one version of PathTracker, and tested on other5

1 Introductione[t] 2e[t (c) 1] 2 [i[t] ( i[t] µ) m[t] e[t 1]] e[t]E e[texy 1] 2 [i[t] ( i[t] µ) m[t] e[t 1]] respectively:1 Introductionhh ihi1Introduction2 Results, where1 Introduction(c) i[t]a[t]h (c))im[t] i[t 1]]i[t] , wherei[t 1] g (1EI g) ei[t] [z[t]ixy ( i[tE 1] ge (c)xy(1 g) [z[t] ( i[t]a[t] (1) ) m[t] i[t 1]] AbstractE e(1)xy2 Results [?] Mely et al developed a model of circuits in primate early visual cortex, whichhhe[t] i( i[t]i hµ) n[t]e[t] e[t 1] h (1 h) [i[t]i e[t 1]] ( i[t] µ) n[t] e[t 1]] (1i E)h) [i[t] (c) e[t 1]hh (c)2 Resultsexplained a host of behavioral contextual illusions In this model, contextual2 ResultshA i ia(c)I (WI iaxyxy(c)xyhi, wherephenomenaemergefrom interactionsunits incortex,input feature[?] Mely et al developeda modelof circuitsin primatebetweenearly visualwhichmaps X 2E , wheree(c)xy hhii I hixyH,W,CE i e(c)xy, with contextualheight,andofchannelsdenotedbyH, W , andC, which[?]Mely etaaldevelopeda modelof circuitsinprimatevisualcortex,whichh i (c)[?]Melyet aldevelopeda width,modelof numbercircuitsinprimateearlyvisualcortex,1 Introductionexplainedhostof RbehavioralillusionsInearlythismodel,contextual(c) Wh (c)i(Wa E) alxyAi h A (WAi(c) E)aa(c)respectively:axyxyIM e,a(c) (W E)explainedaexplainedhost of contextualxyA a eaturemapsX2aI ixyxyh iiihihhphenomenaemergeinteractionsbetweenunits betweenindenotedinput featuremapsX feature2i2 Resultsh l(c)i h(I(c)RH,W,C , withheight,fromwidth,and fromnumberof channelsby H,inW, andC,(c)phenomenaemergeinteractionsunitsinputmaps X 2(c)MAbstractA (W WE) lmW(2) WMA)a i,ea e,axyMM A Aa i[t] i[t1] g of(1channelsg) [z[t] denoted( i[t]a[t] by )H,m[t]H,W,C(c)xye,aa i,e a(c)xyxy e,a R, withheight,width,andnumberW , i[tand 1]]C,Wa (WA a E)respectively:hih Mie,a hlxy i xyRH,W,C, withwidth,andnumberof channelsdenotedbyH, W(1), and C,[?] Mely et al developed a modelof )h InTiCircuit he[t] e[t1]h (1h)[i[t] ( i[t] espectively: Me,i mxy (c)We,i (Eh A)im(c)(c)explained a hostMe,i i,eW e,iMi,e n (c)W EMi,e m(I (E MA)(2) of behavioral contextual illusions In this model,h contextuale,i(c)e,ie,a )iM W(I xyWMxyi,ee,a )h i mxyxy1interactionsIntroductionMii,emMe,a)(2) fromi[t], where units in input feature(c)i,e (Iphenomena emergeX2i Synapsesxyih Wh W emaps g i[t between1] (1 g) [z[t] (E i[t]a[t]i[t 1]] hExcit.xy ) m[t]Ni,e n(c) i(2)(c)i,e IH,W,Cxy hInhib.e!anxy Wi,e IR, with height, width,andnumberdenotedH, W i, and)(-)C,hNni,e(c) i (c)(1)i[t]g i[t1] of channels(1i̇ g) [z[t](bym[t] mi[t1]]Sub.( ei Div. (i[t]a[t] µ) mi!e ) zhtxych i e(c) Wxyc 2 Resultse,i ExyMe,i(c)µ) n[t] xycxy EM E We,ie[t] h e[t 1] (1xyc h) [i[t]xyc ( i[t]e[t ijk1]] h Wi e,irespectively:(c) nxyI i Excit.(1)nU(c) Exye,i(1) G e,igxy M(W(2)xyg I g iZ)(c)hihG gxy (Wg I Ug Z)(2)e[t] hete[t1] a(1[i[t]e in( i[t] earlyµ)n[t]e[t which1]] e!ii visualh (c) ih h)[?] Melyal developedoficircuitscortex,ėmodel ihxyc( ixyc )mhxycxyc primate(c)ixyc (c), where explained a host of behavioral(c)I (c) i i(c) E h eUMult.Add.( )A (x) 1]]aIn model,(Wacontextual E)contextualillusionsxy(WE(c)Inhib.hixythisxyh hI)(c) iE eei[t] gi[t1] (1g)[z[t](i[t]a[t] )m[t]i[tH Eh xyhxy inputxy E eH, where phenomena emerge frominteractionsbetween unitshini feature(1) maps X 2h xy i hxy (Wh E Uh I)E e(c)i e[tofMchannelsh ihh(c)xyiiRH,W,C, h)withheight,width,andnumberdenotedby H, W , and C,wheree[t] h e[t 1] (1[i[t] ( i[t] hµ)n[t]1]] h i(c)i hE) iA)e,i mxy We,i (E a(c)E Ae(c)I (Wa e(c)I hi(c)respectively:Gatexy (c)xyxy iAe!ahihiEIh xy i i(c)I ixyxym n E)xyc, where(c) (WI hi(c)h xy i h i UnitsNi,exycxy iE e(c)xy Wi,e Ixyhiehi(c)I i(c)(c)g (1hag) [z[t]( i[t]a[t]) m[t]xy 23i i[tI 1]] h i[t]i i[t 1]ATimI i(c)a hE)xy i (W(c) A a(c)xy (Wa E)mi!e I))xy(1)(c) (W (E(c)xycIh (c) i zxyA axyG n[t]gxy I Ugxyc Z)(2)I i(c)e[t] e[t 1]Ah ( i[t] aµ)e[t 1]](W (1hah)(WE)Spah i(Wa E)i xy gxy[i[t]I zxytial(c) e Memorye!ih6h7hii(Wh (c)i iiI zhme,ixyc Me,i E m W (EA)E I)xyclocxym, where(c)(c)xy32i̇xych ( e µ)(c)Abstract5i zInputtxycxyc(a)4(b)(c)exycaA axy (Wa E xyU(c)H hxy (W(c)Inhib. hE) i No memoryAbstracttiona Z)h E Uh I)32(c) aM m W (EA)A (Whi {z}e,ie,i hh xyiA axy xy(Wa E) a(c)hii(c) iA adrive i(c)xy (Wa E)e67ExcitatoryNi,e I Wi,e Ie(c)M m(c)(Ehhnxy(c)A)xyE2 xy We,i3xy2 2 i̇xyc 6 µ) mexyc7i4ztxyc 32 ( exyc5333(3) model is e,iNi,e Neuroscience W Iihii,e ( e hi2 i̇xycxy4ztxycA hna(c) (W sof motion xyc{zµ) mxyc}5(InT)axy(c)ii6Ni,e n(c) I hhgxy{z drive } 7I6i(c)e 7 Excitatory (W Ug Z)(2)xy Wi,eG7xyg Ie6i̇xyc 4ztxyc( e µ)m(c)5exyc6 7( e(c)Me,i receivesWe,i (E4i̇ zA)( eiexecutivei circuitxy perception[45]andfunctions (3)[46]. (a)inputencodingsfromi̇xycxyc µ)cognitivemxyc 5hI iU4ztxycxycxyc µ) mExcitatoryxyc drivexyc 5G hhgmxy(W Z)(2){z} )m31 hTheIntroduction 64zixyc (( ei2g txyc5 7xyv xyc (c)i i Z)(c)xyc g (2){z}i̇ėxycµ)m}exyc1IntroductionG gxy (WU4 txycExcitatory5 {z 3 } (c)I zE2(3)xycg I g (c)drivexy U I) {zhi N n W IH h (Wi,ei,exyhh ExcitatoryExcitatory drivexya2 video (z), Inhibitorywhichbyandexcitatoryunits(i,e)[7,}3{z are processeddrive10],(3) interacting recurrent inhibitoryhihi(c) hi67 32(3) EZ) Uh(W I) E) (2) 3hh(c) )mi 7xy(2)H h(c) (WH EU i I) (Wėxyv 6ixyc7 drive( driveixyc selective Excitatory Uga (c)hG gxyI2 h4axy InT ixyc 5 “attention” (a) that tracks(c)hi (Wg A6 2 a ėmechanismandthetarget(b)units havei2 h (c)xyResults )m 6( forixyc E hhelocation.4ixyci{z )mxyc}53(3)ėxyv 4ixyc ( xyvixyc xyii 72Resultsxyc 5(c)2h I)37(c) (Wh E6 Uė i (i )m45H h {z}xyvxycxycxyc {z}xyE e hexy iwhereiInhibitorydrivexy fields.{z}Spatial connections are ls (weightixyc )mh (c)i ėxyv 4iwithxycxyc75driveh i6of circuits(c) i6 Inhibitory drive Inhibitory7Inhibitorydrive E hiexy[?]Melyetal developeda modelinprimate{z}5 early visual cortex, which(c)i̇xyc µ)mexycxytxyc ( eėxyv(W 4i,xyc ( Temporalixyc )mixycconnectionsI i(g,Melyetaldevelopeda 4zmodelof ygates(c)Modelparametersarefit {z} primate early visual cortex, whichI ontextualillusionsIn this model, contextual {z}hiI ofi(c)Excitatorydrive(c)xy In this model,contextualhi(c)I zhixy23(3)I zgradient1 descent.Softplus [.], sigmoid σ, convolution product between. units inInhibitory drivexy , elementwise(c)phenomenaemergefrominputmaps X2 X2(c) 11I I hemergezzxyi interactionsh phenomenaifrominteractionsbetweenunitsin featureinput featuremapsxywhereh (c)i width, and6 number of channelsi 7RH,W,C,withheight,denotedbyH,W,andhiA a(c) (W E)H,W,CA (Wėa(W 4i ( numberixycI )mxyc5channels denoted by H, WC,eE)xyRaxyvxycxyA aa(c) E),a withheight,width,andof,and C,xym (W E) {z ijk}A a(c) (Wa ijkE)respectively:2133E drive Inhibitoryrespectively:22 xy 3i1m (W I)26 e 7( eof distractors.3 In theversions withand the samei̇xycor 64zdifferentnumberijk first7ijk1 the same number of frames,i̇xyc µ)µ) mexyc5Tim( exyctxyc64zmtxyc xyc 5 xyc {z 1 e 7e i̇xyc {z 4ztxyc( e z}µ) m 7 5 exyc1 xycExcitatorydriveexperiment, models were trained on the 32frame and 14 distractorPathTracker,thenonthei̇xyc i̇ } ixyc xyca(3) a32) m i!e62drivetxycxyz testede z}1(xyc i!eExcitatory {z3 i (exyc i̇ z( e µ)m45xycxyctxycxyz ) mxyctxycxycn xyc xyc o23(3)it drive2t } E(1) (1) Excitatory{z ca distractors (Fig. 3a). In theframe PathTracker datasets with 1, 14, tor25secondexperiment,modelso67 I I x( E µ)Rliijkijk ijkxykėxyv 2 ijk( ixyc )m3(3)1xyc }5 drive4ixyc 7ialExcitatorye!i6i {z(3) 1 3( i (3) ė xycixyc µ)m ėxyv 4ixyc ( ixyc )mSpa5 eInhibitoryxycxyc2 xycxycI e!idrivewere trained on the 64 frame and 14 distractorPathTracker,thenonthe64PathTracker 2 Eeijk I i ( iµ)m xyc xycxycxyc tested{z 6 Ėė}ijk frame (I )Rxycijk ijk ijk7Inhibitoryėxyv driveixyc ( ixyc )mixyc5datasets with 1, 14, or 25 distractors (Fig. 3a). Models weretrainedto64detectthei }7 dot reached{z targetiifxyc )mxyc 5where whereėxyv 4ixyc 1( Inhibitorydrivewhere {zA} the blue goal marker using binary crossentropy and the Adam optimizer[44]untilperformanceonaxyc a I (WI AInhibitorydrive E)1 (W(W ijkE)xycRxyc xycE)ijk a test set of 20K videos with 14 distractors decreased for 200 straight epochs.experiment,Ie!ai!e In eachmxyc (W(EEI M))xyce!aE I)mRi!e (W(W(E M)), respectively:where[[[Preferred feature[, wherePreferred feature1ijkijkxycwe selected model weights that performed best on the 14 distractor dataset.e!iModelswere retrainedm 1 (WE I)xycEme!iI)xyc bestxyc (W Thethree times on learning rates {1e-2, 1e-3, 1e-4, 3e-4, 1e-5} to optimize xycperformance.performing model was then tested on the remaining 1 and 25 distractor datasets in the1 experiment.We used four NVIDIA GTX GPUs and a batch size 180 for training.exycxyc(2)exyczxyc zxycixyc i(3)Results We treat human performance as the benchmark for models on PathTracker. Nearly allaaxychumanCNNs and the ImageNet-initialized TimeSformer performed well enough to reach thexyc95%confidence interval on the 32 frame and 14 distractor PathTracker. However, all neural networkmodels performed worse when systematically generalizing to PathTracker datasetswith a different3232number of distractors, even when that number decreased (Fig. 3a, 1 distractor).model6 Specifically,6 ( exyc µ) mexyc 7i̇xyc i̇4ztxyce 75( emxyc 5performance on the 32 frame PathTracker datasets was worst when the videos containedxyc 4ztxycxyc µ) 25 distractors: {z drive{z }}Excitatoryno CNN or transformer reached the 95% confidence interval of humans on this2version oftheExcitatorydataset drive 323(Fig. 3a).xyc66 ( ixyc )

Humans are capable of tracking objects by their motion when appearance is uninformative [2,3], but it is unclear if the current generation of neural networks for video analysis and tracking can do the same. To address this question we introduce PathTracker, a synthetic challenge for object