ActivityNet Challenge

Challenge Guidelines

To enter the competition, you need to create an account on the Evaluation Tab. Using a registered account you will be able to upload your results to the evaluation server and participate in the ActivityNet Challenge 2017. Please be advised that we have changed our submission policy this year: each participant is limited to 1 submission per task per week. The Evaluation Server will enforce a waiting time of 7 days between submissions of the same task. This gives each participant a total of 3 submissions per task before the evaluation server closes. Each team is limited to 4 submissions (in total) per task. Only results that are submitted during the challenge period (before the deadline) and posted to the leaderboard will be considered valid. Additionally, you will also need to upload a notebook paper that describes your method in detail. This challenge allows the use of external data to train and tune algorithm parameters. We are committed to keeping track of this practice. Therefore, each submission must explicitly cite the kind of external data used and which modules benefit from it.

Challenge Tasks

The ActivityNet challenge 2017 includes five different tasks as described below:

Task 1: Untrimmed Video Classification (ActivityNet)

This task is intended to evaluate the ability of algorithms to predict activities in untrimmed video sequences. Here, videos can contain more than one activity, and typically large time lapses of the video are not related with any activity of interest. [Details]

Task 2: Trimmed Action Recognition (Kinetics) [New]

This task is intended to evaluate the ability of algorithms to recognize activities in trimmed video sequences. Here, videos contain a single activity, and all the clips have a standard duration of ten seconds. For this task, participants will use the Kinetics dataset, a new large-scale benchmark for trimmed action classification. [Details]

Task 3: Temporal Action Proposals (ActivityNet) [New]

This task is intended to evaluate the ability of algorithms to generate high quality action proposals . The goal is to produce a set of candidate temporal segments that are likely to contain a human action. [Details]

Task 4: Temporal Action Localization (ActivityNet)

This task is intended to evaluate the ability of algorithms to temporally localize activities in untrimmed video sequences. Here, videos can contain more than one activity instance, and mutiple activity categories can appear in the video. [Details]

Task 5: Dense-Captioning Events in Videos (ActivityNet Captions) [New]

This task involves both detecting and describing events in a video. For this task, participants will use the ActivityNet Captions dataset, a new large-scale benchmark for dense-captioning events. [Details]

Additional Data

Global features (Task [1, 3-5])

ImagenetShuffle. We provide CNN features based on the pool5 layer of a Google inception net (GoogLeNet) at a rate of 2FPS. Features are mean-pooled across the frames followed by L1-normalization. [Download]

MBH Features. The MBH features are generated with the aid of the Improved Trajectories executable with a provided implementation by the authors. Then, features are encoded using the GMM + Fisher Vectors pipeline. [Download]

Frame based features (Task [1, 3-5])

C3D. The publicly available pre-trained C3D model, which has a temporal resolution of 16 frames, is used to extract frame based features. This network is not fine-tuned on the data in the challenge. We reduce the dimensionality of the activations from the second fully-connected layer (fc7) of the visual encoder from 4096 to 500 dimensions using PCA. [Download]

Temporal action proposals (Task [1, 3-5])

Agnostic Temporal Activity Proposals. We provide these proposals to encourage participation in the Activity Detection task. These proposals could be applied as a preliminary stage to split the untrimmed videos into high recall trimmed temporal segments. [Download]

Use of External Data Policy

This challenge allows participants to use external data to train their algorithms or tune parameters. Each submission should explicitly cite the kind of external data used and which modules of the system benefit from it. Some popular forms of external data usage include (but not confined to):

additional videos or images for tuning parameters, and
external modules like CNNs or DPMs trained with other datasets.

If your case is not listed above, please contact us as soon as possible.

Honor Code

This academic challenge aims to highlight automated algorithms that understand the audio-visual content of videos. To serve this purpose and to allow for fair competition, we request that ALL participants:

generate results on the testing set by analyzing audio-visual content only,
not use the testing set for training or parameter tuning, and
refrain from using any auxiliary information of the testing set (e.g. human annotations, URL metadata, etc.) other than the provided videos themselves.

If a submission is found to violate any of the above guidelines, the challenge organizers reserve the right to disqualify the violating team.

Most Innovative Solution

This year, we will award a Panasonic Lumix DC-GH5 camera (+lens) and an Nvidia graphics card to the participant with the most innovative solution. This solution does not necessarily have to be the winner of any task. A technical committee will make this decision based on the participants' submitted reports. So, please make sure to submit a detailed description of your solution on time.