Task B – Spatio-temporal Action Localization (AVA)

This task is intended to evaluate the ability of algorithms to localize human actions in space and time. Each labeled video segment can contain multiple subjects, each performing potentially multiple actions. The goal is to identify these subjects and actions over continuous 15-minute video clips extracted from movies.

For this task, participants will use the new AVA atomic visual actions dataset. The long term goal of this dataset is to enable modeling of complex activities by building on top of current work in recognizing atomic actions.

This task will be divided into two challenges. Challenge #1 is strictly computer vision, i.e. participants are requested not to use signals derived from audio, metadata, etc. Challenge #2 lifts this restriction, allowing creative solutions that leverage any input modalities. We ask only that users document the additional data and features they use. Performance will be ranked separately for the two challenges.

For more information on this task, or questions, please subscribe to Google Group: ava-dataset-users.


The AVA Dataset will be used for this task. AVA dataset densely annotates 80 atomic visual actions in 351k movie clips with actions localized in space and time, resulting in 1.65M action labels with multiple labels per human occurring frequently. Clips are drawn from 15-minute contiguous segments of movies, to open the door for temporal reasoning about activities. The dataset is split into 242 videos for training, 66 videos for validation, and 144 videos for test. More information about how to download the AVA dataset is available here.

The list of test videos will be released approximately one month before the challenge. It will be made available on the AVA website, and announced both here and on ava-dataset-users.

Evaluation Metric

The evaluation code used by the evaluation server will be made available soon.

The official metric used in this task is the Frame-mAP at spatial IoU >= 0.5. Since action frequency in AVA follows the natural distribution, averaged across the top 60 most common action classes in AVA, listed here.


A pre-trained model will be available soon. Baseline results can be found in the paper on arXiv.org.

Submission Format

When submitting your results for this task, please use the same CSV format used for the ground truth AVA train/val files.

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id

  • video_id: YouTube identifier
  • middle_frame_timestamp: in seconds from the start of the YouTube.
  • person_box: top-left (x1, y1) and bottom-right (x2,y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • action_id: integer identifier of an action class, from ava_action_list_v2.0_for_activitynet_2018.pbtxt.txt.

An example taken from the validation set is: