Challenge Description

Challenge Introduction

Since the success of the previous ActivityNet Challenges and based on your feedback, we have worked hard on making this round richer and more inclusive. We are proud to announce that this year's challenge will be a packed half-day workshop with parallel tracks and will host 12 diverse challenges, which aim to push the limits of semantic visual understanding of videos as well as bridging visual content with human captions.
Three out of the twolve challenges are based on the ActivityNet Dataset. These tasks focus on tracing evidence of activities in time in the form of class labels, captions, and object entities. In this installment of the challenge, we will host ten guest tasks, which enrich the understanding of visual information in videos. These tasks focus on complementary aspects of the video understanding problems at large scale and involve challenging and recently compiled datasets.

Action Recognition

Kinetics-700 Challenge

The Kinetics 2021 challenge will have two tracks: supervised and self-supervised classification. Both will be restricted to using RGB and/or audio modalities from videos in the Kinetics-700-2020 dataset.


This challenge focuses on recognizing tiny actions in videos. There can be multiple activities present in a video and the videos can have varying range of resolution from 10x10 to 128x128 pixels. The task will run on TinyVIRAT benchmark dataset.

Temporal Localization

ActivityNet Temporal Action Localization

This task is intended to evaluate the ability of algorithms to temporally localize activities in untrimmed video sequences. Here, videos can contain more than one activity instance, and mutiple activity categories can appear in the video.

HACS Temporal Action Localization Challenge 2021

The goal of this challenge is to temporally localize actions in untrimmed videos, in (i) supervised and (ii) weakly-supervised manners. Please find more details on the website.

SoccerNet Challenge

This challenge leverages the SoccerNet-V2 dataset, which contains over 500 games covering three seasons of the six major European football leagues. It aims to encourage participants to spot the exact timestamps in the video at which: (i) various actions occur, and (ii) actions replayed in the sequences occur, given a professional soccer broadcast.

Spatio-Temporal Localization

AVA-Kinetics & Active Speakers

This challenge addresses two fundamental problems for spatio-temporal video understanding: (i) localize actions extents in space and time, and (ii) densely detect active speakers in video sequences.

ActEV SDL Unknown Facility (UF)

This task seeks to encourage the development of robust automatic activity detection algorithms for an extended video. Challenge participants will develop algorithms to detect and temporally localize instances of Known Activities using an ActEV Command Line Interface (CLI) submission on the Unknown Facility EO video dataset.

Complex Event Understanding

ActivityNet Event Dense-Captioning

This task involves both detecting and describing events in a video. For this task, participants will use the ActivityNet Captions dataset, a new large-scale benchmark for dense-captioning events.

ActivityNet Entities Object Localization

This task aims to evaluate how grounded or faithful a description (could be generated or ground-truth) is to the video they describe. An object word is first identified in the description and then localized in the video in the form of a spatial bounding box. The prediction is compared against the human annotation to determine the correctness and overall localization accuracy.

Video Semantic Role Labeling (VidSitu dataset)

This challenge evaluates the ability of vision algorithms to understand complex related events in a video. Each event may be described by a verb corresponding to the most salient action in a video segment and its semantic roles. VidSRL involves 3 sub-tasks: (1) predicting a verb-sense describing the most salient action; (2) predicting the semantic roles for a given verb; and (3) predicting event relations given the verbs and semantic roles for two events.

Multi-view & Cross-modal Video Understanding

MMAct Challnge

This challenge focuses on cross-modal video action understanding ways addressing shortcomings in visual-only approaches, by leveraging both sensor- and vision-based modalities in ways that can overcome the limitations imposed by modality discrepancy between train (sensor + video) and test (only video) phase. Two sub-tasks are provided: (1) Action Recognition; (2) Action Temporal Localization.


We are releasing, Action Genome, and Home Action Genome (HOMAGE). Action Genome is a compositional activity recognition based on Charades. As with Action Genome, Home Action Genome is focused on compositional activity recognition in the home, but adds multiple views, and additional sensor modalities. We will have two tracks for this challenge: Activity Recognition, and Scene Graph Detection.