SoccerNet-v2: A Dataset And Benchmarks For Holistic . PDF Free Download

2y ago

25 Views

1 Downloads

5.61 MB

12 Pages

Report/dmca

Download PDF

Transcription

SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding ofBroadcast Soccer VideosAdrien Deliège* Anthony Cioppa* Silvio Giancola* Meisam J. Seikavandi* Jacob V. Dueholm*University of LiègeUniversity of LiègeKamal NasrollahiAalborg University, Milestone SystemsKAUSTAalborg UniversityAalborg UniversityBernard Ghanem Thomas B. Moeslund Marc Van DroogenbroeckKAUSTAalborg UniversityUniversity of LiègeAbstractUnderstanding broadcast videos is a challenging task incomputer vision, as it requires generic reasoning capabilities to appreciate the content offered by the video editing.In this work, we propose SoccerNet-v2, a novel large-scalecorpus of manual annotations for the SoccerNet [24] videodataset, along with open challenges to encourage more research in soccer understanding and broadcast production.Specifically, we release around 300k annotations withinSoccerNet’s 500 untrimmed broadcast soccer videos. Weextend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection, and we define a novel replay grounding task. Foreach task, we provide and discuss benchmark results, reproducible with our open-source adapted implementationsof the most relevant works in the field. SoccerNet-v2 is presented to the broader research community to help push computer vision closer to automatic solutions for more generalvideo understanding and production purposes.1. IntroductionSports is a profitable entertainment sector, capping 91billion of annual market revenue over the last decade [15]. 15.6 billion alone came from the Big Five European Soccer Leagues (EPL, La Liga, Ligue 1, Bundesliga and SerieA) [16, 17, 18], with broadcasting and commercial activitiesbeing the main source of revenue for clubs [19]. TV broadcasters seek to attract the attention and indulge the curiosityof an audience, as they understand the game and edit thebroadcasts accordingly. In particular, they select the bestcamera shots focusing on actions or players, allowing forsemantic game analysis, talent scouting and advertisement(*) Equal contributions. More at https://soccer-net.org/.Contacts: adrien.deliege@uliege.be, anthony.cioppa@uliege.be, silvio.giancola@kaust.edu.sa, meisamjam@gmail.com, jvdu@create.aau.dk.Figure 1. SoccerNet-v2 constitutes the most inclusive dataset forsoccer video understanding and production, with 300k annotations, 3 computer vision tasks and multiple benchmark results.placement. With almost 10,000 games a year for the BigFive alone, and an estimated audience of 500M people ateach World Cup [60], automating the video editing processwould have a broad impact on the other millions of gamesplayed in lower leagues across the world. Yet, it requires anunderstanding of the game and the broadcast production.Recent computer vision works on soccer broadcasts focused on low-level video understanding [50], e.g. localizinga field and its lines [13, 20, 32], detecting players [12, 82],their motion [21, 47], their pose [6, 87], their team [34], theball [67, 72], or pass feasibility [3]. Understanding framewise information is useful to enhance the visual experienceof sports viewers [59] and to gather player statistics [73],but it falls short of higher-level game understanding neededfor automatic editing purposes (e.g. camera shot selection,replay selection, and advertisement placement).In this work, we propose a large-scale collection of manual annotations for holistic soccer video understanding and

several benchmarks addressing automatic broadcast production tasks. In particular, we extend the previous SoccerNet [24] dataset with further tasks and annotations, andpropose open challenges with public leaderboards. Specifically, we propose three tasks represented in Figure 1: (i)Action Spotting, an extension from 3 to 17 action classesof SoccerNet’s main task, (ii) Camera Shot Understanding,a temporal segmentation task for camera shots and a camera shot boundary detection task, and (iii) Replay Grounding, a task of retrieving the replayed actions in the game.These tasks tackle three major aspects of broadcast soccervideos: action spotting addresses the understanding of thecontent of the game, camera shot segmentation and boundary detection deal with the video editing process, and replay grounding bridges those tasks by emphasizing salientactions, allowing for prominent moments retrieval.Contributions. We summarize our contributions as follows. (i) Dataset. We publicly release SoccerNet-v2, thelargest corpus of manual annotations for broadcast soccer video understanding and production, comprising 300kannotations temporally anchored within SoccerNet’s 764hours of video. (ii) Tasks. We define the novel task of replay grounding and further expand the tasks of action spotting, camera shot segmentation and boundary detection, fora holistic understanding of content, editing, and productionof broadcast soccer videos. (iii) Benchmarks. We releasereproducible benchmark results along with our code andpublic leaderboards to drive further research in the field.2. Related WorkVideo understanding datasets. Many video datasets propose challenging tasks around human action understanding [25, 68], with applications in movies [45, 48, 70],sports [40, 53, 61], cooking [14, 44, 62], and large-scalegeneric video classification [2, 41, 71]. While early efforts focused on trimmed video classification, more recent datasets provide fine-grained annotations of longervideos at a temporal [30, 38, 70, 84, 88] or spatio-temporallevel [27, 49, 61, 80]. THUMOS14 [38] is the first benchmark for temporal activity localization, introducing 413untrimmed videos, totalling 24 hours and 6k temporallyanchored activities split into 20 classes, then extended to65 classes in MultiTHUMOS [84]. ActivityNet [30] gathers the first large-scale dataset for activity understanding,with 849 hours of untrimmed videos, temporally annotatedwith 30k anchored activities split into 200 classes. A yearlyActivityNet competition highlights a variety of tasks withhundreds of submissions [22, 23]. Some datasets considervideos at an atomic level, with fine-grained temporal annotations from short snippets of longer videos [27, 51, 88].Multi-Moments in Time [52] provides 2M action labels for1M short clips of 3s, classified into 313 classes. Something-Something [26] collects 100k videos annotated with 147classes of daily human-object interactions. Breakfast [44]and MPII-Cooking 2 [63] provide annotations for individualsteps of cooking activities. EPIC-KITCHENS [14] scalesup those approaches with 55 hours of cooking footage, annotated with around 40k action clips of 147 classes.Soccer-related datasets. SoccerNet [24] is the first largescale soccer video dataset, with 500 games from major European leagues and 6k annotations. It provides completegames with a distribution faithful to official TV broadcasts,but it only focuses on 3 action classes, making the task toosimplistic and of moderate interest. SoccerDB [37] adds 7classes and player bounding boxes for half of SoccerNet’svideos and 76 extra games. However, it misses a completeset of possible actions and editing annotations to allow fora full understanding of the production of TV broadcasts.Yu et al. [85] released a dataset with 222 halves of soccermatches with annotations of actions, shot transitions, andplayer bounding boxes. They have few annotations and donot carry out any experiment nor propose any task. Pappalardo et al. [58] released a large-scale dataset of soccerevents, localized in time and space. However, they focus onplayer and team statistics rather than video understanding,as they do not release any video. We address the limitationsof these datasets by annotating all the interesting actions ofthe 500 SoccerNet games. Also, we provide valuable annotations for video editing, and we connect camera shots withactions to allow for salient moments retrieval.Action spotting. Giancola et al. [24] define the task of action spotting in SoccerNet as finding the anchors of soccerevents in a video and provide baselines based on temporalpooling. Rongved et al. [64] focus on applying a 3D ResNetdirectly to the video frames in a 5-second sliding windowfashion. Vanderplaetse et al. [77] combine visual and audiofeatures in a multimodal approach. Cioppa et al. [11] introduce a context-aware loss function to model the temporalcontext surrounding the actions. Similarly, Vats et al. [78]use a multi-tower CNN that accounts for the uncertainty ofthe action locations. Tomei et al. [74] fine-tune a featureextractor and use a masking strategy to focus on the framesafter the actions. We build upon those works to providebenchmark results on our extended action spotting task.Camera shot segmentation and boundary detection.Camera shot boundaries are typically detected by differences between frames, using pixels [5], histograms [56],motion [86] or deep features [1]. In soccer, Hu et al. [33]combine motion vectors and a filtration scheme to improvecolor-based methods. Lefèvre et al. [46] consider adaptivethresholds and features from a hue-saturation color space.Jackman [35] uses popular 2D and 3D CNNs but detectsmany false positives, as it appears difficult to efficiently process the temporal domain. Yet, these works are fine-tunedfor only a few games. Regarding camera classification,

3. SoccerNet-v2 DatasetOverview. Table 1 compares SoccerNet-v2 with the relevant video understanding datasets proposing localizationtasks. SoccerNet-v2 stands out as one of the largest overall, and the largest for soccer videos by far. In particular,we manually annotated 300k timestamps, temporally anchored in the 764 hours of the 500 games of SoccerNet [24].We center the vocabulary of our classes on the soccer gameand soccer broadcast domains, hence it is well-defined andconsistent across games. Such regularity makes SoccerNetv2 the largest dataset in term of events instances per class,thus enabling deep supervised learning at scale. As shownin Figure 2, SoccerNet-v2 provides the most dense annotations w.r.t. its soccer counterparts, and flirts with the largestfine-grained generic datasets in density and size.We hired 33 annotators for the annotation process, allfrequent observers of soccer, for a total of 1600 hours ofannotations. The quality of the annotations was validatedby observing a large consensus between our annotators onidentical games at the start and at the end of their annotationprocess. More details are provided in supplementary material. The annotations are divided in 3 categories: actions,camera shots, and replays, discussed hereafter.Average video lenght (s)102101101001000Number of annotations per hourFigure 2. Datasets comparison. The areas of the tiles representthe number of annotations per dataset. SoccerNet-v2 (SN-v2) extends the initial SoccerNet [24] (SN-v1) with more annotationsand tasks, and it focuses on untrimmed broadcast soccer .54%0.00%0.16%0.00%0.05%0.00%0.04%0.00%Replay grounding. In soccer, multiple works focus on detecting replays [66, 79, 81, 83, 89], using either logo transitions or slow-motion detection, but grounding the replayswith their action in the broadcast has been mostly overlooked. Babaguchi et al. [4] tackle replay linking in American football but use a heuristic approach that can hardlygeneralize to other sports. Ouyng et al. [57] introduce avideo abstraction task to find similarities between multiplecameras in various sports videos, yet their method requirescamera parameters and is tested on a restricted dataset. Replay grounding can be likened to action similarity retrieval,as in [28, 39] for action recognition. Jain et al. [36] use aSiamese structure to compare the features of two actions,and Roy et al. [65] also quantify their similarity. We propose a task of replay grounding to connect replay shots withsalient moments of broadcast videos, which could find further uses in action retrieval and highlight production.103Number of instancesTong et al. [75] first detect logos to select non-replay camera shots, further classified as long, medium, close-up orout-of-field views based on color and texture features. Conversely, Wang et al. [79] classify camera shots for the taskof replay detection. Sarkar et al. [66] classify each framein the classes of [75] based on field features and player dimensions. Kolekar et al. [43] use audio features to detectexciting moments, further classified in camera shot classesfor highlight generation. In this paper, we offer a unifiedand enlarged corpus of annotations that allows for a thorough understanding of the video editing process.ShownUnshown100lay -in Foul e-kick rance target target orner tution ick-off e-kick ffside card Goal enalty d card d cardC bstiof p ThrowO ellowP Re - reK t frefre Clea ts on ts offcYSurectowDireSho ShoIndiYellutall oBFigure 3. SoccerNet-v2 actions. Log-scale distribution of ourshown and unshown actions among the 17 classes, and proportion that each class represents. The dataset is unbalanced, withsome of the most important actions in the less abundant classes.Actions. We identify 17 types of actions from the most important in soccer, listed in Figure 3. Following [24], weannotate each action of the 500 games of SoccerNet with asingle timestamp, defined by well-established soccer rules.For instance, for a corner, we annotate the last frame of theshot, i.e. showing the last contact between the player’s footand the ball. We provide the annotation guidelines in supplementary material. In total, we annotate 110,458 actions,on average 221 actions per game, or 1 action every 25 seconds. SoccerNet-v2 is a significant extension of the actionsof SoccerNet [24], with 16x more timestamps and 14 extra classes. We represent the distribution of the actions inFigure 3. The natural imbalance of the data corresponds tothe distribution of real-life broadcasts, making SoccerNetv2 valuable for generalization and industrial deployment.Additionally, we enrich each timestamp with a novel binary visibility tag that states whether the associated actionis shown in the broadcast video or unshown, in which casethe action must be inferred by the viewer. For example, thishappens when the producer shows a replay of a shot off target that lasts past the clearance shot of the goalkeeper: theviewer knows that the clearance has been made despite itwas not shown on the TV broadcast. Spotting unshown actions is challenging because it requires a fine understanding

Table 1. Datasets. Comparative overview of relevant datasets for action localization or spotting in videos. SoccerNet-v2 provides thesecond largest number of annotations and the largest in soccer. computed with the 116k annotations of the 200 fully annotated games.Context Duration #Actions ClassesDensity Avg. eventsAvg. videoDataset(hrs)(act./hr)per class length (sec)THUMOS14 [38]ActivityNet [30]Charades [70]AVA [27]HACS [88]EPIC-Kitchen 11111713–8.75641144381 432,2123,4286236,4988,976 –2750.414.12708.12750.42750.42750.4Live or 6%0.05%0.07%0.00%0.08%Duration (Hours)1033181544244,818695266Live or 0039,5960.00%10520.38%3.82%65.37%Number of 64882107.5 .41%0.23%0.15%0.22%0.11%0.12%SoccerNet [24]SoccerDB [37]Yu et al. [85]SoccerNet-v2 (actions)SoccerNet-v2 (cameras)SoccerNet-v2 lfttrrrrnte staf goa Public e goa era lef a righ corne Othe amera e goa ameraayec e th gy cp prel fereerea ce p side nd theth cam mer -uprudednedeimuca Closhi inSpi Insi chnoloCloosr fiealidn ca Close- up behn be Ma MainMeMaie tel linClosGoaFigure 4. Camera shots. Log-scale distribution of our camerashot timestamps among the classes in terms of instances (top) andvideo duration (bottom), separated in replays and live or othersequences, and percentage of timestamps that each bar represents.of the game, beyond frame-based analysis, as it forces toconsider the temporal context around the actions. We annotate the timestamps of unshown actions with the best possible temporal interpolation. They represent 18% of the actions (see Figure 3), hence form a large set of actions whosespotting requires a sharp understanding of soccer. Finally,to remain consistent with SoccerNet [24], we annotate theteam that performs each action as either home or away, butleave further analysis on that regard for future work.Cameras. We annotate a total of 158,493 camera changetimestamps, 116,687 of which are comprehensive for a subset of 200 games, the others delimiting replay shots in theremaining games (see hereafter). For the fully annotatedgames, this represents an average of 583 camera transitionsper game, or 1 transition every 9 seconds. Those timestamps contain the type of camera shot that has been shown,among the most common 13 possibilities listed in Figure 4.We display their distribution in terms of number of occurrences and total duration. The class imbalance underpins adifficulty of this dataset, yet it represents a distribution consistent with broadcasts used in practical applications.Besides, different types of transitions occur from onecamera shot to the next, which we append to each timestamp. These can be abrupt changes between two cameras(71.4%), fading transitions between the frames (14.2%), orlogo transitions (14.2%). Logos constitute an unusual typeof transition compared with abrupt or fading, which arecommon in videos in the wild or in movies, yet they arewidely used in sports broadcasts. They pose an interestingcamera shot detection challenge, as each logo is differentand algorithms must adapt to a wide variety thereof. Forlogo and fading camera changes, we locate the timestampsas precisely as possible at the middle of the transition, whilewe annotate the last frame before an abrupt change.Eventually, we indicate whether the camera shot happenslive (86.7%) with respect to the game, or shows a replay ofan action (10.9%), or another type of replay (2.4%). Thedistribution in Figure 4 provides per-class proportions of replay camera shots and groups other replays and live shots.Replays. For the 500 games of SoccerNet [24], we boundeach video shot showing a replay of an action with twotimestamps, annotated in the same way as for the camerashot changes. For each replay shot, we refer the timestampof the action replayed. When several replays of the sameaction are shown consecutively with different views, we annotate all the replay shots separately. This gives one replayshot per type of view, all of which are linked to the same action. In total, 32,932 replay shots are associated with theircorresponding action, which represents an average of 66 replay shots per game, for an average replay shot duration of

6.8 seconds. Retrieving a replayed action is challenging because typically, 1 to 3 replays of the action are shown fromdifferent viewpoints hardly ever found in the original livebroadcast video. This encourages a more general video understanding rather than an exact frame comparison.4. Broadcast Video Understanding TasksWe propose a comprehensive set of tasks to move computer vision towards a better understanding of broadcastsoccer videos and alleviate the editing burden of video producers. More importantly, these tasks have broader implications as they can easily be transposed to other domains.This makes SoccerNet-v2 an ideal playground for developing novel ideas and implementing innovative solutions inthe general field of video understanding.In this work, we define three main tasks on SoccerNetv2: action spotting, camera shot segmentation with boundary detection, and replay grounding, which are illustrated inFigure 5. They are further motivated and detailed hereafter.Action spotting. In order to understand the salient actionsof a broadcast soccer game, SoccerNet [24] introduces thetask of action spotting, which consists in finding all the actions occurring in the videos. Beyond soccer understanding, this task addresses the more general problem of retrieving moments with a specific semantic meaning in longuntrimmed videos. As such, we foresee moment spottingapplications in e.g. video surveillance or video indexing.In this task, the actions are anchored with a single timestamp, contrary to the task of activity localization [30],where activities are delimited with start and stop timestamps. We assess the action spotting performance of analgorithm with the Average-mAP metric, defined as follows. A predicted action spot is positive if it falls withina given tolerance δ of a ground-truth timestamp from thesame class. The Average Precision (AP) based on PR curvesis computed then averaged over the classes (mAP), afterwhat the Average-mAP is the AUC of the mAP computedat different tolerances δ ranging from 5 to 60 seconds.Camera shot segmentation and boundary detection. Selecting the proper camera at the right moment is the crucialtask of the broadcast producer to trigger the strongest emotions on the viewer during a live game. Hence, identifyingcamera shots not only provides a better understanding of theediting process but is also a major step towards automatingthe broadcast production. This task naturally generalizes toany sports broadcasts but could also prove interesting fore.g. cultural events or movies summarization.Camera shot temporal segmentation consists in classifying each video frame among our 13 camera types and isevaluated with the mIoU metric. Concurrently, we definea task of camera shot boundary detection, where the objective is to find the timestamps of the transitions between thecamera shots. For the evaluation, we use the spotting mAPmetric with a single tolerance δ of 1 second as transitionsare precisely localized and happen within short durations.Replay grounding. Our novel replay grounding task consists in retrieving the timestamp of the action shown in agiven replay shot within the whole game. Grounding a replay with its action confers it an estimation of importance,which is otherwise difficult to assess. Derived applicationsmay be further built on top of this task, e.g. automatic highlight production, as the most replayed actions are usuallythe most relevant. Linking broadcast editing to meaningfulcontent within the video not only bridges our previous tasks,but it can also be applied to any domain focusing on salientmoments retrieval. We use the Average-AP to assess performances on this task, computed as described for the spottingtask but without averaging over the classes. We choose thismetric as replay grounding can be seen as class-independentaction spotting conditioned by the replay sequence.5. Benchmark ResultsGeneral comments. SoccerNet [24] provides high and lowquality videos of the 500 games. For easier experimentation, it also provides features from ResNet [29], I3D [8]and C3D [76] computed at 2 fps, further reduced with PCAto 512 dimensions. Following [11, 24], in our experiments,we use the ResNet 512-dimensional frame features actingas compressed video representations as they yielded betterresults in early experiments. We adapt the most relevant existing methods to provide benchmark results on the SoccerNet [24] test set. We release our codes to reproduce them,and we will host leaderboards on dedicated servers.5.1. Action SpottingMethods. We adapt or re-implement efficiently all themethods that released public code on SoccerNet [24].1. MaxPool and NetVLAD [24]. Those models pool temporally the ResNet features before passing them through aclassification layer. Non-overlapping segments of 20 seconds are classified as to whether they contain any actionclass. In testing, a sliding window of 20 seconds with astride of 1 frame is used to infer an actionness score in time,reduced to an action spot using NMS. We consider the basic yet lightweight max pooling and a learnable NetVLADpooling with 64 clusters. We re-implement the methodbased on the original code for a better scaling to 17 classes.2. AudioVid [77]. The network uses NetVLAD to pooltemporally 20-second chunks of ResNet features, as wellas VGGish [31] synchronized audio features, subsampledat 2 fps. The two sets of features are temporally pooled,concatenated and fed to a classification module, as in [24].Similarly, the spotting prediction is at the center of the videochunk. We scaled the classification module to 17 classes.

Figure 5. Tasks overview. We define a 17-class action spotting task, a 13-class camera shot segmentation and boundary detectiontasks, and a novel replay grounding task, with their associated performance metrics. They respectively focus on understanding thecontent of broadcast soccer games, addressing broadcast video editing tasks, and retrieving salient moments of the game.3. CALF [11]. This network handles 2-minute chunksof ResNet features and is composed of a spatio-temporalfeatures extractor, kept as is, a temporal segmentation module, which we adapt for 17 classes, and an action spottingmodule, adapted to output at most 15 predictions per chunk,classified in 17 classes. The segmentation module is trainedwith a context-aware loss having four context slicing hyperparameters per class. Following [11], we determine optimal values for them with a Bayesian optimization [54]. Were-implement the method and optimize the training strategybased on the existing code to achieve a decent training time.Results. We provide the leaderboard of our benchmark results for action spotting in Table 2. We further compute theperformances on shown/unshown actions as the AveragemAP for predicted spots whose closest ground truth timestamp is a shown/unshown action. We show qualitative results obtained with CALF in Figure 6.The pooling approaches MaxPool and NetVLAD are noton par with the other methods on SoccerNet-v2. We believethe hard pruning with MaxPool has a restricted learning capacity, limited to a single fully connected layer. Similarly,NetVLAD may lag behind because of a non-optimal choicein the design of the spotting module, in particular the NonMaximum Suppression that discards the results with confidence score below 0.5. AudioVid prevails on the shown instances and on 5/17 actions classes. Injecting audio featuresappears to help with visible actions, as the sound is usuallysynchronized with the image. Also, it performs best on ac-Figure 6. Action spotting result obtained from CALF adapted:temporal segmentation, ground truth, and spotting predictions. The network performs well on corners with only one falsepositive, and moderately on fouls with a few false negatives.tions preceded or followed by the whistle of the referee, underlining the importance of audio features. Yet, the audiofeatures appear less useful on unshown instances. CALFperforms best globally, on the unshown instances and onmost action classes. The context-aware loss focuses on thetemporal context to spot the actions, which is useful for thistask. This emphasizes the benefits of the temporal contextsurrounding the actions, that contains valuable information.5.2. Camera Segmentation and Boundary DetectionMethods. 1. Basic model. For our first baseline for the segmentation part, we train a basic model composed of 3 layersof 1D CNN with a kernel of 21 frames, hence aggregated intime, on top of ResNet features, and a MSE loss.

PenaltyYel. RedRed all out41unshownCounts (test set) 1369 22551 18641 3910 6460 3809 2414 2283 1631 1175 1058 999 579 514 431 416 382 337shownGoalDir. free-kickYellow cardSubstitutionShots off tar.Shots on tar.Ind. free-kickSoccerNet-v2SoccerNet-v1Table 2. Leaderboard for action spotting (Average-mAP %). Methods with codes publicly available were tested on SoccerNet-v2.MaxPool [24]-18.621.515.038.7 34.7 26.8 17.9 14.9 14.0 13.1 26.5 40.0 30.3 11.8 2.6 13.5 24.2 6.2 0.0 0.9NetVLAD [24]49.731.434.323.347.4 42.4 32.0 16.7 32.7 21.3 19.7 55.1 51.7 45.7 33.2 14.6 33.6 54.9 32.3 0.0 0.0AudioVid [77]56.039.943.023.354.3 50.0 55.5 22.7 46.7 26.5 21.4 66.0 54.0 52.9 35.2 24.3 46.7 69.7 52.1 0.0 0.0CALF [11]62.540.742.129.063.9 56.4 53.0 41.5 51.6 26.6 27.3 71.8 47.3 37.2 41.7 25.7 43.5 72.2 30.6 0.7 0.7Other SoccerNet-v1 results but with no public code available: Rongved et al. [64]: 32.0 ; Vats et al. [78]: 60.1 ; Tomei et al. [74]: 75.1.2. CALF (seg.) [11]. We adapt CALF as it provides asegmentation module on top of a spatio-temporal featuresextractor. We replace its loss with the cross-entropy for easier experimentation and we focus on the segmentation byremoving the spotting module. The number of parameters isreduced by a factor of 5 compared with the original model.3. Content [9]. For the boundary detection task, we testthe popular scene detection library PySceneDetect. We usethe Content option, that triggers a camera change when thedifference between two consecutive frames exceeds a particular threshold value. This method is tested directly onthe broadcast videos provided in SoccerNet [24].4. Histogram, Intensity [69]. We test two scene detectionmethods of the Scikit-Video library. The Histogram methodreports a camera change when the intensity histogram difference between subsequent frames exceeds a given threshold [55]. The Intensity method reports a camera changewhen variations in color and intensity between frames exceed a given threshold. Those methods are tested directlyon the broadcast videos provided in SoccerNet [24].5. CALF (det.) [11]. Since we can see the camera

@create.aau.dk. Figure 1. SoccerNet-v2 constitutes the most inclusive dataset for soccer video understanding and production, with 300k annota-tions, 3 computer vision tasks and multiple benchmark results. placement. Wit