EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts

Yuto Haneji,Taichi Nishimura,Hirotaka Kameko,Keisuke Shirai,Tomoya Yoshida,Keiya Kajimura,Koki Yamamoto,Taiyu Cui,Tomohiro Nishimoto,Shinsuke Mori
2024-10-07
Abstract:Mistake action detection from egocentric videos is crucial for developing intelligent archives that detect workers' errors and provide feedback. Previous studies have been limited to specific domains, focused on detecting mistakes from videos without procedural texts, and analyzed whether actions are mistakes. To address these limitations, in this paper, we propose the EgoOops dataset, which includes egocentric videos, procedural texts, and three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. EgoOops covers five procedural domains and includes 50 egocentric videos. The video-text alignment allows the model to detect mistakes based on both videos and procedural texts. The mistake labels and descriptions enable detailed analysis of real-world mistakes. Based on EgoOops, we tackle two tasks: video-text alignment and mistake detection. For video-text alignment, we enhance the recent StepFormer model with an additional loss for fine-tuning. Based on the alignment results, we propose a multi-modal classifier to predict mistake labels. In our experiments, the proposed methods achieve higher performance than the baselines. In addition, our ablation study demonstrates the effectiveness of combining videos and texts. We will release the dataset and codes upon publication.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the limitations of existing error - action - detection research, specifically including: 1. **Relying solely on video data**: Most previous studies only detect wrong actions from videos without using procedural texts. However, procedural activities are multimodal. People rely not only on visual information but also need to refer to text guidance when performing tasks. Therefore, in order to detect errors more accurately, it is very necessary to combine videos and texts. 2. **Coarse - grained annotation of error types**: Most works only analyze whether an action is wrong or not, without classifying errors in a fine - grained manner. In fact, errors can take many forms, such as using the wrong object, skipping steps, etc. Therefore, detailed error labels are required to analyze common error patterns in the real world. 3. **Domain limitations**: Existing datasets are mainly concentrated in fields such as assembly and cooking, lacking coverage of more diverse fields. In order to obtain a comprehensive understanding, it is necessary to collect data from a wide range of fields. To solve these problems, the author proposes a new dataset **EgoOops**, which has the following characteristics: - **First - person - view videos and procedural texts**: The EgoOops dataset contains 50 first - person - view (egocentric) videos and corresponding procedural texts, covering five different fields: circuit assembly, color - mixing experiment, ion - reaction experiment, toy building block construction, and cardboard handicraft making. - **Three types of annotations**: - Video - text alignment - Mistake labels - Descriptions explaining the errors Based on the EgoOops dataset, the author solves two application tasks: 1. **Video - text alignment**: By enhancing the StepFormer model and introducing an additional loss function (StepFormer++), the alignment between video paragraphs and procedural - text steps is achieved. 2. **Error - action detection**: Based on the alignment results, a multimodal classifier is trained to predict the label ("correct", "wrong", or "corrected") of each video segment. Through these improvements, the author hopes to develop an intelligent video archive system that can record workers' activities, detect their errors, and provide feedback, thereby improving work efficiency and safety.