Challenge Description

Challenge Introduction

We are proud to announce that this year the challenge will host six diverse tasks which aim to push the limits of semantic visual understanding of videos as well as bridging visual content with human captions. Three out of the seven tasks are based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focus on trace evidence of activities in time in the form of proposals, class labels, and captions.

In this installment of the challenge, we will host three guest tasks which enrich the understanding of visual information in videos. These tasks focus on complementary aspects of the activity recognition problem at large scale and involve challenging and recently compiled activity/action datasets, including Kinetics (Google DeepMind), AVA (Berkeley and Google), and Moments in Time (MIT and IBM Research).

ActivityNet Tasks

Task 1

2nd trial

Temporal Action Proposals (ActivityNet)

This task is intended to evaluate the ability of algorithms to generate high quality action proposals. The goal is to produce a set of candidate temporal segments that are likely to contain a human action.

Task 2

3rd trial

Temporal Action Localization (ActivityNet)

This task is intended to evaluate the ability of algorithms to temporally localize activities in untrimmed video sequences. Here, videos can contain more than one activity instance, and mutiple activity categories can appear in the video.

Task 3

2nd trial

Dense-Captioning Events in Videos (ActivityNet Captions)

This task involves both detecting and describing events in a video. For this task, participants will use the ActivityNet Captions dataset, a new large-scale benchmark for dense-captioning events.

Guest Tasks

Task A

2nd trial

Trimmed Activity Recognition (Kinetics)

This task is intended to evaluate the ability of algorithms to recognize activities in trimmed video sequences. Here, videos contain a single activity, and all the clips have a standard duration of ten seconds. For this task, participants will use the Kinetics dataset, a large-scale benchmark for trimmed action classification.

Task B

New

Spatio-temporal Action Localization (AVA)

This task is intended to evaluate the ability of algorithms to localize human actions in space and time. Each labeled video segment can contain multiple subjects, each performing potentially multiple actions. The goal is to identify these subjects and actions over continuous 15-minute video clips extracted from movies. For this task, participants will use the new AVA atomic visual actions dataset.

Task C

New

Trimmed Event Recognition (Moments in Time)

This task is intended to evaluate the ability of algorithms to classify events in trimmed 3-second videos. Here, videos contain a single activity, and all clips have a standard duration of 3 seconds. There will be two tracks. The first track will use the Moments in Time dataset, a new large-scale dataset for video understanding, which has 800K videos in the training set. For the second track, participants will use the Moments in Time Mini dataset, a subset of Moments in Time with 100k videos provided in the training set.