Task 2: Trimmed Action Recognition

The goal of the Kinetics dataset is to help the computer vision and machine learning communities advance models for video understanding. Given this large human action classification dataset, it may be possible to learn powerful video representations that transfer to different video tasks.
For information related to this task, please contact: brianzhang@google.com, joaoluis@google.com

Dataset

The Kinetics dataset will be used for this challenge. Kinetics is a large-scale, high-quality dataset of YouTube video URLs which include a diverse range of human focused actions. The dataset consists of approximately 300K video clips, and covers 400 human action classes with at least 400 video clips for each action class. Each clip lasts around 10s and is labeled with a single class. All of the clips have been through multiple rounds of human annotation, and each is taken from a unique YouTube video. The actions cover a broad range of classes including human-object interactions such as 'playing instruments', as well as human-human interactions such as 'shaking hands' and 'hugging'. More information about how to download the Kinetics dataset is available here.

Evaluation Metric

The evaluation code used by the evaluation server can be found here.
For each video, an algorithm will produce $k$ labels $l_{j}$, $j = 1,..,k$. The ground truth label for the video is $g$. The error of the algorithm for that video would be:

$e = \min_{j} d(l_{j}, g)$,
with $d(x,y) = 0$ if $x=y$ and $1$ otherwise. The overall error score for an algorithm is the average error over all videos. We will use $k=1$ and $k=5$ and the winner of the challenge will be selected based on the average of these two errors.

Baselines

Pre-trained models and pre-computed features will be available soon.

Submission Format

Please follow the following JSON format when submitting your results for the challenge:

{
version: "KINETICS VERSION 1.0",
results: {
  -3B32lodo2M: [
      {
      label: "abseiling",
      score: 0.65
      },
      {
      label: "shot put",
      score: 0.15
      },
      {
      label: "skydiving",
      score: 0.08
      },
      {
      label: "smooking hookah",
      score: 0.04
      },
      {
      label: "cleaning windows",
      score: 0.04
      }
  ]
}
external_data: {
  used: true, # Boolean flag. True indicates the use of external data.
  details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # This string details what kind of external data you used and how you used it.
}
}
            

The example above is illustrative. Comments must be removed in your submission. Please take into account that only 5 predictions are allowed per video.

Awards

The winner of Task 2 (trimmed action recognition) will receive 4,000 USD and a Qualcomm gift. The second place will receive 2,000 USD