Task B – Spatio-temporal Action Localization (AVA)

This task is intended to evaluate the ability of algorithms to localize human actions in space and time, using the AVA Dataset. This year we'll continue with AVA Actions as the primary challenge, while also introducing a new secondary challenge based on the recently-released AVA ActiveSpeaker dataset. Performance will be ranked separately for the two challenges.

For more information on the challenges, or questions, please subscribe to Google Group: ava-dataset-users.

Challenge #1: AVA Actions

For this task, participants will use the AVA Actions dataset. The long term goal of this dataset is to enable modeling of complex activities by building on top of current work in recognizing atomic actions.

Each labeled video segment can contain multiple subjects, each performing potentially multiple actions. The goal is to identify these subjects and actions over continuous 15-minute video clips extracted from movies.

Participants are allowed to leverage any input modalities (e.g. audio/video) or additional datasets, but are requested to document these in the report to be submitted to the challenge.

Dataset

The AVA Actions Dataset version v2.2 will be used for this task. The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute movie clips, where actions are localized in space and time, resulting in 1.62M action labels with multiple labels per human occurring frequently. Clips are drawn from contiguous segments of movies, to open the door for temporal reasoning about activities. The dataset is split into 235 videos for training, 64 videos for validation, and 131 videos for test. More information about how to download the AVA dataset is available here.

The list of test videos is available on the AVA website, along with details of which timestamps will be used for testing.

Evaluation Metric

The evaluation code used by the evaluation server can be found in the ActivityNet Github repository. Please contact the AVA team via this Google Group with any questions or issues about the code.

The official metric used in this task is the Frame-mAP at spatial IoU >= 0.5. Since action frequency in AVA follows the natural distribution, averaged across the top 60 most common action classes in AVA, listed here.

Baselines

A basic pre-trained model will be available on the AVA website. Baseline results on AVA v2.1 can be found in the results from last year's challenge.

Submission Format

When submitting your results for this task, please use the same CSV format used for the ground truth AVA train/val files, with the addition of a score column for each box-label.

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, score

  • video_id: YouTube identifier
  • middle_frame_timestamp: in seconds from the start of the video.
  • person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • action_id: integer identifier of an action class, from ava_action_list_v2.2_for_activitynet_2019.pbtxt.
  • score: a float indicating the score for this labeled box.

An example taken from the validation set is:

          1j20qq1JyX4,0902,0.002,0.118,0.714,0.977,12,0.9
          1j20qq1JyX4,0905,0.193,0.016,1.000,0.978,11,0.8
          1j20qq1JyX4,0905,0.193,0.016,1.000,0.978,74,0.96
          20TAGRElvfE,0907,0.285,0.559,0.348,0.764,17,0.72
          ...
          

Challenge #2: Active Speaker Detection

The goal of this task is to evaluate whether algorithms can determine if and when a visible face is speaking.

Each labeled video segment and accompanying audio can contain multiple visible subjects. Each visible subject’s face bounding box will be provided, as well as box association over time. Your task will be to determine whether the specified faces are speaking at a given time.

For this task, participants will use the new AVA-ActiveSpeaker dataset. The purpose of this dataset is to both extend the AVA Actions dataset to the very useful task of active speaker detection, and to push the state-of-the-art in multimodal perception. So participants are encouraged to use both the audio and video data. If additional data is used, either other modalities or other datasets, we ask that participants provide documentation.

Dataset

The AVA-ActiveSpeaker dataset will be used for this task. The AVA-ActiveSpeaker dataset associates speaking activity with a visible face on the AVA v1.0 videos. It contains 3.65 million frames, in 15-minute continuous video segments. It contains 120 videos for training, and 33 videos for validation.

The held-out test set for the challenge, containing a total of 2,053,509 frames across all the test videos, is now available at the Active Speaker Download page. The true label for these entries is not provided; instead the label column always contains SPEAKING_AUDIBLE.

More information about how to download the AVA dataset is available here, and information about how to submit model predictions to the evaluation server are provided in the Submission Format section below.

Evaluation Metric

The evaluation code used by the evaluation server can be found in the ActivityNet Github repository. Please contact the AVA team via this Google Group with any questions or issues about the code.

The official metric used in this task is the mAP.

Baselines

A basic pre-trained model will be available on the AVA website. Baseline results can be found in the paper on arXiv.org.

Submission Format

Submissions to the evaluation server will consist of a single CSV file, containing model predictions for each entry in each video in the held out test set, and must therefore contain a total of 2,053,509 lines.

When submitting your results for this task, please use the same CSV format used for the ground truth AVA-ActiveSpeaker train/val files, with the addition of a score column for each box-label.

The format of a row is the following: video_id, frame_timestamp, entity_box, label, entity_id, score

  • video_id: YouTube identifier
  • frame_timestamp: in seconds from the start of the video.
  • entity_box: face bounding box, top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • label: SPEAKING_AUDIBLE (other labels will be ignored).
  • entity_id: a unique string allowing this box to be linked to other boxes.
  • score: a float in [0.0,1.0] indicating the score for the label. Larger values indicate a higher confidence that the user is SPEAKING_AUDIBLE.

An example taken from the validation set is:

          -IELREHX_js,1744.88,0.514803,0,0.919408,0.701754,SPEAKING_AUDIBLE,-IELREHX_js_1740_1800:5,0.713503
          ...