Task 2 – Temporal Action Localization

Despite the recent advances in large-scale video analysis, temporal action localization remains as one of the most challenging unsolved problems in computer vision. This search problem hinders various real-world applications ranging from consumer video summarization to surveillance, crowd monitoring, and elderly care. Therefore, we are commited to push forward the development of efficient and accurate automated methods that can search and retrieve events and activities in video collections. This task is intended to encourage computer vision researchers to design high performance action localization systems.

Dataset

The ActivityNet Version 1.3 dataset will be used for this challenge. The dataset consists of more than 648 hours of untrimmed videos from a total of ~20K videos. It contains 200 different daily activities such as: 'walking the dog', 'long jump', and 'vacuuming floor'. The distribution among training, validation, and testing is ~50%, ~25%, and 25% of the total videos respectively. The dataset annotations can be downloaded directly from here .

Evaluation Metric

The evaluation code used by the evaluation server can be found here.

Interpolated Average Precision (AP) is used as the metric for evaluating the results on each activity category. Then, the AP is averaged over all the activity categories (mAP). To determine if a detection is a true positive, we inspect the temporal intersection over union ($$\text{tIoU}$$) with a ground truth segment, and check whether or not it is greater or equal to a given threshold (e.g. $$\text{tIoU} > 0.5$$). The official metric used in this task is the average mAP, which is defined as the mean of all mAP values computed with tIoU thresholds between $$0.5$$ and $$0.95$$ (inclusive) with a step size of $$0.05$$.

Baselines

Please refer to last challenge summary for information about baselines and state-of-the-art methods.

Getting started

To encourage the participation on this task, we team up with other researchers to make the following resources available:

RGB frames extracted at 5FPS (~200GB).

Frame-level features for the frames above (~89GB).

Submission Format

Please use the following JSON format when submitting your results for the challenge:

              {
version: "VERSION 1.3",
results: {
"5n7NCViB5TU": [
{
label: "Discus throw",
score: 0.64,
segment: [24.25,38.08]
},
{
label: "Shot put",
score: 0.77,
segment: [11.25, 19.37]
}
]
},
external_data: {
used: true, # Boolean flag. True indicates the use of external data.
details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # This string details what kind of external data you used and how you used it.
}
}