Task 2 – Dense-Captioning Events in Videos

Most natural videos contain numerous events. For example, in a video of a 'man playing a piano', the video might also contain another 'man dancing' or 'a crowd clapping'. This challenge studies the task of dense-captioning events, which involves both detecting and describing events in a video. This challenge uses the ActivityNet Captions dataset, a new large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20K videos amounting to 849 video hours with 100K total descriptions, each with its unique start and end time.

For information related to this task, please contact: ranjaykrishna@gmail.com, shyamal@cs.stanford.edu


The ActivityNet Captions dataset will be used for this challenge. The dataset connects videos to a series of temporally annotated sentence descriptions. Each sentence covers a unique segment of the video, describing multiple events that occur. These events may occur over very long or short periods of time and are not limited in any capacity, allowing them to co-occur. On average, each of the 20K videos in Captivity Net contains \( 3.65 \) temporally localized sentences, resulting in a total of 100K sentences. We find that the number of sentences per video follows a relatively normal distribution. Furthermore, as the video duration increases, the number of sentences also increases. Each sentence has an average length of \( 13.48 \) words, which is also normally distributed.

Evaluation Metric

The evaluation code used by the evaluation server can be found here.

Inspired by the dense-image captioning metric, we use a similar metric to measure the joint ability of our model to both localize and caption events. This metric computes the average precision (AP) across tIoU thresholds of \( 0.3, 0.5, 0.7\), and \(0.9 \), when captioning the top 1000 proposals. We measure precision of our captions using traditional evaluation metrics: BlEU, METEOR and CIDEr.


Baseline results are available in the paper available here .

Getting started

To encourage the participation on this task, we team up with other researchers to make the following resources available:

RGB frames extracted at 5FPS (~200GB).

Frame-level features for the frames above (~89GB).

Please take a look at the README for more details.

Submission Format

Please use the following JSON format when submitting your results for the challenge:

  version: "VERSION 1.0",
  results: {
    "v_5n7NCViB5TU": [
        sentence: "One player moves all around the net holding the ball", # String description of an event.
        timestamp: [1.23,4.53] # The start and end times of the event (in seconds).
        sentence: "A small group of men are seen running around a basketball court playing a game".
        timestamp: [5.24, 18.23]
  external_data: {
    used: true, # Boolean flag. True indicates the use of external data.
    details: "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set", # This string details what kind of external data you used and how you used it.

The example above is illustrative. Comments must be removed in your submission. You can download here a sample submission file.