Kinetics-700 Challenge 2021

The goal of the Kinetics dataset is to help the computer vision and machine learning communities advance models for video understanding. Given this large human action classification dataset, it may be possible to learn powerful video representations that transfer to different video tasks.

For information related to this task, please contact: joaoluis@google.com

Dataset

The Kinetics-700-2020 dataset will be used for this challenge. Kinetics-700-2020 is a large-scale, high-quality dataset of YouTube video URLs which include a diverse range of human focused actions. The aim of the Kinetics dataset is to help the machine learning community create more advanced models for video understanding. It is an approximate super-set of both Kinetics-400, released in 2017, Kinetics-600, released in 2018 and Kinetics-700, released in 2019.

The dataset consists of approximately 650,000 video clips, and covers 700 human action classes with at least 700 video clips for each action class. Each clip lasts around 10 seconds and is labeled with a single class. All of the clips have been through multiple rounds of human annotation, and each is taken from a unique YouTube video. The actions cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.

More information about how to download the Kinetics dataset is available here.

Tasks

The Kinetics 2021 challenge will have two tracks: supervised and self-supervised classification. Both will be restricted to using RGB and/or audio modalities from videos in the Kinetics-700-2020 dataset (no other modalities, and no external data).

The supervised track is similar to previous years, except for evaluation. For both tracks, this year we will ask participants to upload one 512d feature vector for each training and test video -- not class scores anymore. We will then train ourselves a linear classifier on top of these feature vectors to determine top-1 and top-5 accuracy and decide on the winning model.

For the self-supervised track we will ask the participants to train on videos from a subset of classes (found here) to test for out of domain generalization. Class labels should otherwise not be used for this track -- the goal is to learn representations without them! Participants in this track will be asked to upload feature vectors for videos from all classes in both train and test splits (even for those videos from the split that the model is not trained on).

We will release our own leaderboard for both tasks (stay tuned). The deadline for submitting results is the 9th of June. For general questions about the challenge please join the following group: kinetics-dataset-users@googlegroups.com.

Challenge participants are encouraged to also try out in the AVA-Kinetics challenge, which shares much of the same training data. We are excited to see if anyone can improve performance using co-training on the two tasks, or other creative uses of the two label sets.

FAQ

1. Possible to use ImageNet checkpoints?
We allow finetuning from public ImageNet checkpoints for the supervised track -- but a link to the specific checkpoint should be provided with each submission.

2. Possible to use optical flow?
Flow can be used as long as not trained on external datasets, except if they are synthetic.

3. Can we train on test data without labels (e.g. transductive)?
No.

4. Can we use semantic class label information?
Yes, for the supervised track.

5. Will there be special tracks for methods using fewer FLOPs / small models or just RGB vs RGB+Audio in the self-supervised track?
We will ask participants to provide the total number of model parameters and the modalities used and plan to create special mentions for those doing well in each setting, but not specific tracks.