Task 1: Untrimmed Video Classification

Video data is naturally untrimmed. For example, it is extremely difficult to find YouTube videos containing only a single activity, such as walking the dog. The videos usually contain a large amount of background/context information. It has been a motivation for computer vision researchers to develop algorithms that are able to globally analyze a video and generate its global classification. In this task, we evaluate the capability of such methods on recognizing activities in untrimmed video sequences. Here, videos can contain more than one activity, and typically large time lapses of the video are not related with any activity of interest.
For information related to this task, please contact: fabian.caba@kaust.edu.sa, humam.alwassel@kaust.edu.sa, victor.escorcia@kaust.edu.sa


The ActivityNet Version 1.3 dataset will be used for this challenge. The dataset consists of more than 648 hours of untrimmed videos from a total of ~20K videos with ~1.5 annotations per video. It contains 200 different daily activities such as: 'walking the dog', 'long jump', and 'vacuuming floor'. The distribution among training, validation, and testing is ~50%, ~25%, and 25% of the total videos respectively. The dataset annotations can be downloaded directly from here .

Evaluation Metric

The evaluation code used by the evaluation server can be found here. We will use the top-1 error ($e$) as the official metric for this task. Given $m$ videos containing $n$ different $g_{ij}$ ground truth annotations each, where $i = 1,...,m$ and $j = 1, ...,n$, the top-1 error for a set of $m$ predictions $p_i$ is computed as:

$e = \frac{1}{m} \sum_{i=1}^{m} f(i)$
$f(i) = \frac{1}{n} \sum_{j=1}^{n} d(p_i, g_{ij})$
with $d(p_i, g_{ij})=1$ if $p_i=g_{ij}$. Note that this metric takes into account that multiple labels can exist in a video.


Please refer to last challenge summary for information about baselines and state-of-the-art methods.

Submission Format

Please follow the following JSON format when submitting your results for the challenge:

version: "VERSION 1.3",
results: {
  5n7NCViB5TU: [
      label: "Discus throw",
      score: 0.95
      label: "Shot put",
      score: 0.77
external_data: {
  used: true, # Boolean flag. True indicates the use of external data.
  details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # This string details what kind of external data you used and how you used it.

The example above is illustrative. Comments must be removed in your submission. You can download here a sample submission file.


The winner of Task 1 (untrimmed classification) will receive 2,000 USD, an Nvidia graphics card, and a Qualcomm gift.