Task C – Spatio-temporal Action Localization (AVA)

This task is intended to evaluate the ability of algorithms to localize human actions in space and time, using the AVA Dataset. This year we'll introduce Ava-Kinetics Crossover Challenge and Active Speaker Detection. Performance will be ranked separately for the two challenges.

For more information on the challenges, or questions, please subscribe to Google Group: ava-dataset-users.

Challenge #1: Ava-Kinetics Crossover

The AVA-Kinetics task is an umbrella for a crossover of the previous AVA and Kinetics tasks, where Kinetics has now been annotated with AVA labels (but AVA has not been annotated with Kinetics labels). There has always been some interactions between the two datasets, e.g. many of the AVA methods are pre-trained on Kinetics. The new annotations should allow for improved performance on both tasks and also increase the diversity of the AVA evaluation set (which now also includes Kinetics clips).

For information related to this task, please contact: dross@google.com

Dataset

The AVA-Kinetics Dataset will be used for this task. The AVA-Kinetics dataset consists of the original 430 videos from AVA v2.2, together with 238k videos from the Kinetics-700 dataset.

AVA-Kinetics, our latest release, is a crossover between the AVA Actions and Kinetics datasets. In order to provide localized action labels on a wider variety of visual scenes, we've provided AVA action labels on videos from Kinetics-700, nearly doubling the number of total annotations, and increasing the number of unique videos by over 500x. We hope this will expand the generalizability of localized action models, and open the door to new approaches in multi-task learning. AVA-Kinetics is described in detail in the arXiv paper.

Evaluation Metric

The evaluation code used by the evaluation server can be found in the ActivityNet Github repository. Please contact the AVA team via this Google Group with any questions or issues about the code.

The official metric used in this task is the Frame-mAP at spatial IoU >= 0.5. Since action frequency in AVA follows the natural distribution, averaged across the top 60 most common action classes in AVA, listed here.

Baselines

A basic pre-trained model will be available on the AVA website. Baseline results on AVA v2.1 can be found in the results from last year's challenge.

Submission Format

When submitting your results for this task, please use the same CSV format used for the ground truth AVA train/val files, with the addition of a score column for each box-label.

The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, score

  • video_id: YouTube identifier
  • middle_frame_timestamp: in seconds from the start of the video.
  • person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • action_id: integer identifier of an action class, from ava_action_list_v2.2_for_activitynet_2019.pbtxt.
  • score: a float indicating the score for this labeled box.

An example taken from the validation set is:

          1j20qq1JyX4,0902,0.002,0.118,0.714,0.977,12,0.9
          1j20qq1JyX4,0905,0.193,0.016,1.000,0.978,11,0.8
          1j20qq1JyX4,0905,0.193,0.016,1.000,0.978,74,0.96
          20TAGRElvfE,0907,0.285,0.559,0.348,0.764,17,0.72
          ...
          

Challenge #2: Active Speaker Detection

The goal of this task is to evaluate whether algorithms can determine if and when a visible face is speaking.

Each labeled video segment and accompanying audio can contain multiple visible subjects. Each visible subject’s face bounding box will be provided, as well as box association over time. Your task will be to determine whether the specified faces are speaking at a given time.

For this task, participants will use the new AVA-ActiveSpeaker dataset. The purpose of this dataset is to both extend the AVA Actions dataset to the very useful task of active speaker detection, and to push the state-of-the-art in multimodal perception. So participants are encouraged to use both the audio and video data. If additional data is used, either other modalities or other datasets, we ask that participants provide documentation.

Dataset

The AVA-ActiveSpeaker dataset will be used for this task. The AVA-ActiveSpeaker dataset associates speaking activity with a visible face on the AVA v1.0 videos. It contains 3.65 million frames, in 15-minute continuous video segments. It contains 120 videos for training, and 33 videos for validation.

The held-out test set for the challenge, containing a total of 2,053,509 frames across all the test videos, is now available at the Active Speaker Download page. The true label for these entries is not provided; instead the label column always contains SPEAKING_AUDIBLE.

More information about how to download the AVA dataset is available here, and information about how to submit model predictions to the evaluation server are provided in the Submission Format section below.

Evaluation Metric

The evaluation code used by the evaluation server can be found in the ActivityNet Github repository. Please contact the AVA team via this Google Group with any questions or issues about the code.

The official metric used in this task is the mAP.

Baselines

A basic pre-trained model will be available on the AVA website. Baseline results can be found in the paper on arXiv.org.

Submission Format

Submissions to the evaluation server will consist of a single CSV file, containing model predictions for each entry in each video in the held out test set, and must therefore contain a total of 2,053,509 lines.

When submitting your results for this task, please use the same CSV format used for the ground truth AVA-ActiveSpeaker train/val files, with the addition of a score column for each box-label.

The format of a row is the following: video_id, frame_timestamp, entity_box, label, entity_id, score

  • video_id: YouTube identifier
  • frame_timestamp: in seconds from the start of the video.
  • entity_box: face bounding box, top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
  • label: SPEAKING_AUDIBLE (other labels will be ignored).
  • entity_id: a unique string allowing this box to be linked to other boxes.
  • score: a float in [0.0,1.0] indicating the score for the label. Larger values indicate a higher confidence that the user is SPEAKING_AUDIBLE.

An example taken from the validation set is:

          -IELREHX_js,1744.88,0.514803,0,0.919408,0.701754,SPEAKING_AUDIBLE,-IELREHX_js_1740_1800:5,0.713503
          ...