Since the success of the previous ActivityNet Challenges and based on your feedback, we have worked hard on making this round richer and more inclusive. We are proud to announce that this year's challenge will be a packed half-day workshop with parallel tracks and will host 12 diverse challenges, which aim to push the limits of semantic visual understanding of videos as well as bridging visual content with human captions. Three out of the twolve challenges are based on the ActivityNet Dataset. These tasks focus on tracing evidence of activities in time in the form of class labels, captions, and object entities. In this installment of the challenge, we will host ten guest tasks, which enrich the understanding of visual information in videos. These tasks focus on complementary aspects of the video understanding problems at large scale and involve challenging and recently compiled datasets.
The goal of this challenge is to temporally localize actions in untrimmed videos, in (i) supervised and (ii) weakly-supervised manners. Please find more details on the website.
This challenge leverages the SoccerNet-V2 dataset, which contains over 500 games covering three seasons of the six major European football leagues. It aims to encourage participants to spot the exact timestamps in the video at which: (i) various actions occur, and (ii) actions replayed in the sequences occur, given a professional soccer broadcast.
This task seeks to encourage the development of robust automatic activity detection algorithms for an extended video. Challenge participants will develop algorithms to detect and temporally localize instances of Known Activities using an ActEV Command Line Interface (CLI) submission on the Unknown Facility EO video dataset.
This task aims to evaluate how grounded or faithful a description (could be generated or ground-truth) is to the video they describe. An object word is first identified in the description and then localized in the video in the form of a spatial bounding box. The prediction is compared against the human annotation to determine the correctness and overall localization accuracy.
This challenge evaluates the ability of vision algorithms to understand complex related events in a video. Each event may be described by a verb corresponding to the most salient action in a video segment and its semantic roles. VidSRL involves 3 sub-tasks: (1) predicting a verb-sense describing the most salient action; (2) predicting the semantic roles for a given verb; and (3) predicting event relations given the verbs and semantic roles for two events.
This challenge focuses on cross-modal video action understanding ways addressing shortcomings in visual-only approaches, by leveraging both sensor- and vision-based modalities in ways that can overcome the limitations imposed by modality discrepancy between train (sensor + video) and test (only video) phase. Two sub-tasks are provided: (1) Action Recognition; (2) Action Temporal Localization.
We are releasing, Action Genome, and Home Action Genome (HOMAGE). Action Genome is a compositional activity recognition based on Charades. As with Action Genome, Home Action Genome is focused on compositional activity recognition in the home, but adds multiple views, and additional sensor modalities. We will have two tracks for this challenge: Activity Recognition, and Scene Graph Detection.