We are proud to announce that this year the challenge will host seven diverse tasks which aim to push the limits of semantic visual understanding of videos as well as bridging visual content with human captions. Three out of the seven tasks are based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focus on trace evidence of activities in time in the form of proposals, class labels, and captions.
In this installment of the challenge, we will host four guest tasks which enrich the understanding of visual information in videos. These tasks focus on complementary aspects of the activity recognition problem at large scale and involve challenging and recently compiled video understanding datasets, including Kinetics (Google DeepMind), AVA (Google), EPIC-Kitchens (University of Bristol), and VIRAT (NIST)
This task is intended to evaluate the ability of algorithms to recognize activities in trimmed video sequences. Here, videos contain a single activity, and all the clips have a standard duration of ten seconds. For this task, participants will use the Kinetics dataset, a large-scale benchmark for trimmed action classification.
This task is intended to evaluate the ability of algorithms to localize human actions in space and time. Each labeled video segment can contain multiple subjects, each performing potentially multiple actions. The goal is to identify these subjects and actions over continuous 15-minute video clips extracted from movies. For this task, participants will use the new AVA atomic visual actions dataset.
This task is intended to evaluate the ability of algorithms to understand daily activities in egocentric videos. There will be three tracks focus on classifying actions on trimmed segments, detecting objects in egocentric videos and anticipating future actions. Executive summary.
This task seeks to encourage the development of robust automatic activity detection algorithms for an extended video. Challenge participants will develop algorithms to detect and temporally localize instances of 18 different activities. Executive summary.