Task B – Kinetics Challenge

The goal of the Kinetics dataset is to help the computer vision and machine learning communities advance models for video understanding. Given this large human action classification dataset, it may be possible to learn powerful video representations that transfer to different video tasks.

For information related to this task, please contact: enoland@google.com, joaoluis@google.com

Dataset

The Kinetics-700 dataset will be used for this challenge. Kinetics-700 is a large-scale, high-quality dataset of YouTube video URLs which include a diverse range of human focused actions. Our aim in releasing the Kinetics dataset is to help the machine learning community to advance models for video understanding. It is an approximate super-set of both Kinetics-400, released in 2017 and Kinetics-600, released in 2018.

The dataset consists of approximately 650,000 video clips, and covers 700 human action classes with at least 600 video clips for each action class. Each clip lasts around 10 seconds and is labeled with a single class. All of the clips have been through multiple rounds of human annotation, and each is taken from a unique YouTube video. The actions cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.

More information about how to download the Kinetics dataset is available here.

Challenge participants are encouraged to leverage the newly released AVA-Kinetics train set. We are excited to see if anyone can improve performance using co-training on the two tasks, or other creative uses of the two label sets.

Evaluation Metric

The evaluation code used by the evaluation server can be found here. For each video, an algorithm will produce \( k \) labels \( l_j \) where \( j = 1, \dots , k \). The ground-truth label for the video is \( g \). The error of the algorithm for that video would be: $$ e = \text{min}_j d \left( l_j, g \right)$$ with \( d \left( x, y \right) = 0 \) if \( x = y \) and \( 1 \) otherwise. The overall error score for an algorithm is the average error over all videos. We will use \( k = 1 \) and \( k = 5 \) and the winner of the challenge will be selected based on the average of these two errors.

Baselines

Pre-trained models and pre-computed features will be available soon.

Submission Format

Please use the following JSON format when submitting your results for the challenge:

              {
  version: "KINETICS VERSION 1.0",
  results: {
    # Note that each result composes a unique key results "youtubeid_timestart_timeend"
    "p76-UdadD7w_47_57": [
        {
          label: "abseiling",
          score: 0.65
        },
        {
          label: "shot put",
          score: 0.15
        },
        {
          label: "skydiving",
          score: 0.08
        },
        {
          label: "smooking hookah",
          score: 0.04
        },
        {
          label: "cleaning windows",
          score: 0.04
        }
    ]
  },
  external_data: {
    used: true, # Boolean flag. True indicates the use of external data.
    details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # This string details what kind of external data you used and how you used it.
  }
}
              
            

The example above is illustrative. Comments must be removed in your submission. Please take into account that only 5 predictions are allowed per video.

Leaderboard

Co-winners: Team Google Cloud AI (learn more)

Co-winners: Team CUHK-Sensetime (learn more)

Runners-up: Team USTC-BAIDU