ActivityNet Temporal Action Localization

Despite the recent advances in large-scale video analysis, temporal action localization remains as one of the most challenging unsolved problems in computer vision. This search problem hinders various real-world applications ranging from consumer video summarization to surveillance, crowd monitoring, and elderly care. Therefore, we are commited to push forward the development of efficient and accurate automated methods that can search and retrieve events and activities in video collections. This task is intended to encourage computer vision researchers to design high performance action localization systems.

For information related to this task, please contact:


The ActivityNet Version 1.3 dataset will be used for this challenge. The dataset consists of more than 648 hours of untrimmed videos from a total of ~20K videos. It contains 200 different daily activities such as: 'walking the dog', 'long jump', and 'vacuuming floor'. The distribution among training, validation, and testing is ~50%, ~25%, and 25% of the total videos respectively. The dataset annotations can be downloaded directly from here .

Evaluation Metric

The evaluation code used by the evaluation server can be found here.

Interpolated Average Precision (AP) is used as the metric for evaluating the results on each activity category. Then, the AP is averaged over all the activity categories (mAP). To determine if a detection is a true positive, we inspect the temporal intersection over union (\(\text{tIoU}\)) with a ground truth segment, and check whether or not it is greater or equal to a given threshold (e.g. \(\text{tIoU} > 0.5\)). The official metric used in this task is the average mAP, which is defined as the mean of all mAP values computed with tIoU thresholds between \( 0.5 \) and \( 0.95 \) (inclusive) with a step size of \( 0.05 \).


Please refer to last challenge summary for information about baselines and state-of-the-art methods.

Getting started

To encourage participation in this task, we teamed up with other researchers to make the following resources available:

[NEW] Pre-extracted TSP Features: [Download]

RGB frames extracted at 5FPS (~200GB).

Frame-level features for the frames above (~89GB).

Please take a look at the README for more details.

Submission Format

Please use the following JSON format when submitting your results for the challenge:

  version: "VERSION 1.3",
  results: {
    "5n7NCViB5TU": [
      label: "Discus throw",
      score: 0.64,
      segment: [24.25,38.08]
      label: "Shot put",
      score: 0.77,
      segment: [11.25, 19.37]
  external_data: {
    used: true, # Boolean flag. True indicates the use of external data.
    details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # This string details what kind of external data you used and how you used it.

The example above is illustrative. Comments must be removed in your submission. You can download here a sample submission file.