ActivityNet-Entities Object Localization Task aims to evaluate how grounded or faithful a description (could be generated or ground-truth) is to the video they describe.
An object word is first identified in the description and then localized in the video in the form of a spatial bounding box. The prediction is compared against the human annotation to determine the correctness and overall localization accuracy.
The challenge evaluation server will be similar to the existing Codalab server.
For any questions, please contact Luowei (luozhou[AT]umich.edu).
ActivityNet-Entities is based on the video description dataset ActivityNet Captions and augments it with 158k bounding box annotations, each grounding a noun phrase (NP). In this challenge, we will use pre-processed object-based annotations that link individual object words to their corresponding regions in the video. This gives 432 unique object categories.
The original dataset consists of 10k/2.5k/2.5k videos for training/validation/testing. There are 35k/8.6k/8.5k event segments & sentence descriptions and 105k/26.5k/26.1k bounding box annotations on each split. We are actively collecting a new “hidden” test set based on ActivityNet Captions test data where the video descriptions are not public. This enables us to evaluate both video description quality and object localization quality.
Regarding the bounding box annotation, we first uniformly sample 10 frames from each event segment and sparsely locate objects from the description in only one of the frames where the object can be clearly observed. More details regarding the annotation and download instructions could be found here.
Due to the sparsity of the annotation, we request all participants to submit the localization results on all object categories appeared in the target sentence on all 10 sampled frames. Only the prediction at the same frame as the GT annotation will be assessed and compared against the human annotation to determine the correctness (>50% IoU indicates correct and otherwise incorrect). Localization accuracy is computed per object category and then averaged by the number of unique object categories. Depending on the availability of the video description during inference, we divide the challenge into two sub-tasks:
Sub-task I: Grounding on GT Sentences (public test set). The same data as in the Activity-Entities test set, which comes from ActivityNet Captions val set. GT sentences are provided.
Sub-task II: Grounding on Generated Sentences (hidden test set). Ongoing and the video IDs will be announced during the challenge. Based on the ActivityNet Captions test set. GT sentences will NOT be provided and hence the user sentence prediction is required for evaluation.
Evaluation metrics are defined in detail in Sec. 5.1 and A.2 in [Zhou et al. CVPR 2019] . To determine the winner, we adopt the highest score on localization accuracy on Sub-task I and the highest score on F1_all on Sub-task II.
The baseline code GVD is available at: https://github.com/facebookresearch/grounded-video-description