This task is intended to evaluate the ability of algorithms to understand daily activities in egocentric videos. This task is divided into three challenges:
Challenge #1 - Object Detection: Detect and localise objects in individual images out of 290 classes, with a long-tail distribution.
Challenge #2 - Action Recognition: Given a start-end time in an untrimmed video, classify the varying-length segments into verb classes (125 verb classes, 331 noun classes).
Challenge #3 - Action Anticipation: Given an action segment, predict the action class (125 verb classes and 331 noun classes) by observing the video segment preceding the action start time by a preselected anticipation time duration of 1 second.
EPIC-KITCHENS is the largest egocentric video benchmark recorded by 32 participants in their native kitchen environments. The videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. All footage is shot using a head-mounted camera in Full HD and at 60fps. The dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. For each of the three tasks below, the test set is distinctly split into 28 previously seen and 4 unseen kitchens, allowing to assess generality to novel environments. All challenges are setup on CodaLab with baseline results, and open for submissions.