Task B – Spatio-temporal Action Localization (AVA)

This task is intended to evaluate the ability of algorithms to localize human actions in space and time. Each labeled video segment can contain multiple subjects, each performing potentially multiple actions. The goal is to identify these subjects and actions over continuous 15-minute video clips extracted from movies.

For this task, participants will use the new AVA atomic visual actions dataset. The long term goal of this dataset is to enable modeling of complex activities by building on top of current work in recognizing atomic actions.

This task will be divided into two challenges. Challenge #1 is strictly computer vision, i.e. participants are requested not to use signals derived from audio, metadata, etc. Challenge #2 lifts this restriction, allowing creative solutions that leverage any input modalities. We ask only that users document the additional data and features they use. Performance will be ranked separately for the two challenges.

For more information on this task, or questions, please subscribe to Google Group: ava-dataset-users.


The AVA Dataset version v2.1 will be used for this task. The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute movie clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per human occurring frequently. Clips are drawn from contiguous segments of movies, to open the door for temporal reasoning about activities. The dataset is split into 235 videos for training, 64 videos for validation, and 131 videos for test. More information about how to download the AVA dataset is available here.

The list of test videos is now available on the AVA website, along with details of which timestamps will be used for testing.

Evaluation Metric

The evaluation code used by the evaluation server can be found in the ActivityNet Github repository. Please contact the AVA team via this Google Group with any questions or issues about the code.

The official metric used in this task is the Frame-mAP at spatial IoU >= 0.5. Since action frequency in AVA follows the natural distribution, averaged across the top 60 most common action classes in AVA, listed here.


A basic pre-trained model will be available on the AVA website. Baseline results can be found in the paper on arXiv.org.

Submission Format

When submitting your results for this task, please use the same CSV format used for the ground truth AVA train/val files, with the addition of a score column for each box-label.

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, score

  • video_id: YouTube identifier
  • middle_frame_timestamp: in seconds from the start of the video.
  • person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • action_id: integer identifier of an action class, from ava_action_list_v2.1_for_activitynet_2018.pbtxt.
  • score: a float indicating the score for this labeled box.

An example taken from the validation set is: