This task is intended to evaluate the ability of algorithms to localize human actions in space and time. Each labeled video segment can contain multiple subjects, each performing potentially multiple actions. The goal is to identify these subjects and actions over continuous 15-minute video clips extracted from movies.
For this task, participants will use the new AVA atomic visual actions dataset. The long term goal of this dataset is to enable modeling of complex activities by building on top of current work in recognizing atomic actions.
This task will be divided into two challenges. Challenge #1 is strictly computer vision, i.e. participants are requested not to use signals derived from audio, metadata, etc. Challenge #2 lifts this restriction, allowing creative solutions that leverage any input modalities. We ask only that users document the additional data and features they use. Performance will be ranked separately for the two challenges.
For more information on this task, or questions, please subscribe to Google Group: ava-dataset-users.
The AVA Dataset version v2.1 will be used for this task. The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute movie clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per human occurring frequently. Clips are drawn from contiguous segments of movies, to open the door for temporal reasoning about activities. The dataset is split into 235 videos for training, 64 videos for validation, and 131 videos for test. More information about how to download the AVA dataset is available here.
The list of test videos is now available on the AVA website, along with details of which timestamps will be used for testing.
The official metric used in this task is the Frame-mAP at spatial IoU >= 0.5. Since action frequency in AVA follows the natural distribution, averaged across the top 60 most common action classes in AVA, listed here.
When submitting your results for this task, please use the same CSV format used for the ground truth AVA train/val files, with the addition of a score column for each box-label.
The format of a row is the following: video_id, middle_frame_timestamp, person_box, action_id, score
An example taken from the validation set is:
1j20qq1JyX4,0902,0.002,0.118,0.714,0.977,12,0.9 1j20qq1JyX4,0905,0.193,0.016,1.000,0.978,11,0.8 1j20qq1JyX4,0905,0.193,0.016,1.000,0.978,74,0.96 20TAGRElvfE,0907,0.285,0.559,0.348,0.764,17,0.72 ...