Task A – Trimmed Action Recognition

The goal of the Kinetics dataset is to help the computer vision and machine learning communities advance models for video understanding. Given this large human action classification dataset, it may be possible to learn powerful video representations that transfer to different video tasks.

For information related to this task, please contact: enoland@google.com, joaoluis@google.com

Dataset

The Kinetics-600 dataset will be used for this challenge. Kinetics is a large-scale, high-quality dataset of YouTube video URLs which include a diverse range of human focused actions. The dataset consists of approximately 500K video clips, and covers 600 human action classes with at least 600 video clips for each action class. Each clip lasts around 10s and is labeled with a single class. All of the clips have been through multiple rounds of human annotation, and each is taken from a unique YouTube video. The actions cover a broad range of classes including human-object interactions such as 'playing instruments', as well as human-human interactions such as 'shaking hands' and 'hugging'. More information about how to download the Kinetics dataset is available here.

Evaluation Metric

The evaluation code used by the evaluation server can be found here. For each video, an algorithm will produce \( k \) labels \( l_j \) where \( j = 1, \dots , k \). The ground-truth label for the video is \( g \). The error of the algorithm for that video would be: $$ e = \text{min}_j d \left( l_j, g \right)$$ with \( d \left( x, y \right) = 0 \) if \( x = y \) and \( 1 \) otherwise. The overall error score for an algorithm is the average error over all videos. We will use \( k = 1 \) and \( k = 5 \) and the winner of the challenge will be selected based on the average of these two errors.

Baselines

Pre-trained models and pre-computed features will be available soon.

Submission Format

Please use the following JSON format when submitting your results for the challenge:

              {
  version: "KINETICS VERSION 1.0",
  results: {
    # Note that each result composes a unique key results "youtubeid_timestart_timeend"
    "p76-UdadD7w_47_57": [
        {
          label: "abseiling",
          score: 0.65
        },
        {
          label: "shot put",
          score: 0.15
        },
        {
          label: "skydiving",
          score: 0.08
        },
        {
          label: "smooking hookah",
          score: 0.04
        },
        {
          label: "cleaning windows",
          score: 0.04
        }
    ]
  },
  external_data: {
    used: true, # Boolean flag. True indicates the use of external data.
    details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # This string details what kind of external data you used and how you used it.
  }
}
              
            

The example above is illustrative. Comments must be removed in your submission. Please take into account that only 5 predictions are allowed per video.

Awards

Stay tuned 😉