Abstract:Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.

EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos Referring to Procedural Texts

PREGO: online mistake detection in PRocedural EGOcentric videos

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

The Bystander Affect Detection (BAD) Dataset for Failure Detection in HRI

Oops! Predicting Unintentional Action in Video

CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

PARSE-Ego4D: Personal Action Recommendation Suggestions for Egocentric Videos

Anticipating Object State Changes in Long Procedural Videos

Object Aware Egocentric Online Action Detection

Egok360: A 360 Egocentric Kinetic Human Activity Video Dataset

Precise Affordance Annotation for Egocentric Action Video Datasets

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

Hexamethyidisiloxane: A 13-week subchronic whole-body vapor inhalation toxicity study in Fischer 344 rats.