ActivityNet Challenge

Challenge Guidelines

To enter the competition, you need to create an account on Evaluation Tab. This account allows you to upload your results to the evaluation server and participate in the ActivityNet Challenge 2016. A maximum of one (1) submission is allowed per team per week. This limit will be strictly enforced. Only results that are submitted before the challenge deadline and posted to the leaderboard will be considered valid. Uploading JSON files using an invalid format for either classification or detection tasks will prompt an error from the ActivityNet server. You will also need to upload a notebook paper that describes your method in detail. After the server processes your submission, a file with your results will appear for download. Additionally, you will be able to compare your results to the state-of-the-art on the Leaderboard tab.
This challenge allows to use external data to train and tune parameters of algorithms. We are committed to keeping track of this practice. Therefore, each submission must explicitly cite the kind of external data used and which modules benefit from it.

1. Requirements

For each submission to the server, you have to attach the following files:

Submission Json File for either classification or detection task.
Notebook Paper or document that describes your method in detail. This document MUST be in PDF format.

2. Database

The dataset for the challenge consists of more than 648 hours of untrimmed videos from ActivityNet Release 1.3. There is a total of ~20K videos distributed among 200 activity categories. The distribution among training, validation and testing is ~50%, ~25%, and ~25% respectively. ActivityNet Release 1.3 can be downloaded from the download page or from its direct download link: ActivityNet Release 1.3.

3. Challenge Tasks

Our challenge includes two types of tasks as described below:

3.1. Untrimmed video classification

This task is intended to evaluate the ability of algorithms to predict activities in untrimmed video sequences. Here, videos can contain more than one activity, and typically large time lapses of the video are not related with any activity of interest.

3.2. Activity detection

This task is intended to evaluate the ability of algorithms to temporally localize activities in untrimmed video sequences. Here, videos can contain more than one activity instance, and mutiple activity categories can appear in the video.

4. Additional Data

4.1. Global features

ImagenetShuffle CNN features based on the pool5 layer of a Google inception net (GoogLeNet) on two frames per second. Features were mean-pooled across the frames followed by L1-normalization. [Download]

MBH Features The MBH features were generated with the aid of the Improved Trajectories executable with a provided implementation by the authors. Then, features were encoded using the GMM + Fisher Vectors pipeline. [Download]

4.2. Frame based features

C3D The publicly available pre-trained C3D model which has a temporal resolution of 16 frames was used to extract frame based features. This network was not fine-tuned on our data. We reduce the dimensionality of the activations from the second fully-connected layer (fc7) of our visual encoder from 4096 to 500 dimensions using PCA. [Download]

4.3. Temporal Activity Proposals

Agnostic Temporal Activity Proposals are provided to encourage the participation in the Activity Detection challenge. This proposals could be applied as a preliminary stage to split the untrimmed videos into high recall trimmed temporal segments. [Download]

5. Structure of Submission Json File

You have to bear in mind the following format for each submission process:

Classification
Detection

Untrimmed video classification

Please format your results as illustrated in the example below. You can also download this example classification submission file.

                              {
version: "VERSION 1.3",
results: {
  5n7NCViB5TU: [
      {
      label: "Discus throw", # At least one prediction per video is required.
      score: 1
      },
      {
      label: "Shot put",
      score: 0.777
      }
  ]
},
external_data: {
  used: true, # Boolean flag. True indicates used of external data.
  details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # String with details of your external data.
}
}

Activity Detection

Please format your results as illustrated in the example below. You can also download this example detection submission file.

{
version: "VERSION 1.3",
results: {
  5n7NCViB5TU: [
      {
      label: "Discus throw",
      score: 0.64,
      segment: [24.25,38.08]
      },
      {
      label: "Shot put".
      score: 0.77,
      segment: [11.25, 19.37]
      }
  ]
}
external_data: {
  used: true, # Boolean flag. True indicates used of external data.
  details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # String with details of your external data.
}
}

Note : Green comments above are illustrative and help us to provide inline detailed explanations. Please avoid them in your submissions.

6. Metrics

Two different metrics are used to evaluate perfomance in this challenge:

Mean Average Precision (mAP): Interpolated Average Precision (AP) is used as one of the metrics for evaluating the results on each activity category. The AP is averaged over all the activity categories. For more information on Interpolated Average Precision, please read: Evaluation of ranked retrieval results.
Top-k classification accuracy: This metric is computed as the fraction of test videos for which the correct label is among the top-k most confident predictions. In this challenge, we set k=3.

7. Use of External Data Policy

This challenge allows participants to use external data to train their algorithms or tune parameters. Each submission should explicitly cite the kind of external data used and which modules of the system benefit from it. This information must be available in the Notebook paper and the JSON File.
Some popular forms of external data usage include (but not confined to):

using additional videos or images to tuned parameters, and
using external modules like CNN or DPM trained with other datasets.

If your case is not listed above, please contact us as soon as possible.

8. Honor Code

This academic challenge aims to highlight automated algorithms that understand the audio-visual content of videos. To serve this purpose and to allow for fair competition, we request that ALL participants:

generate results on the testing set by analyzing audio-visual content only,
not use the testing set for training or parameter tuning, and
refrain from using any auxiliary information of the testing set (e.g. human annotations, URL metadata, etc.) other than the provided videos themselves.

If a submission is found to violate any of the above guidelines, the organizers of the challenge reserve the right to disqualify the corresponding participating team.

9. Awards

Each winner per challenge task will be rewarded with a GPU card sponsored by Nvidia Corp.