GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

Jiafeng Liang,Shixin Jiang,Zekun Wang,Haojie Pan,Zerui Chen,Zheng Chu,Ming Liu,Ruiji Fu,Zhongyuan Wang,Bing Qin
2024-06-26
Abstract:There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issues present in existing educational video datasets, which include: 1. **Lack of Systematicity**: Existing datasets only focus on video-level specific step annotations, neglecting task-level guidelines, making it difficult for beginners to learn new tasks due to a lack of relevant experience. 2. **Incoherent Specific Steps**: Specific steps without guiding principles appear trivial and unsystematic, making it hard to provide clear instructional guidance. 3. **Increased Learning Difficulty**: Many educational videos related to the same task describe the same task but have significant differences in details and step sequences, increasing the learning difficulty for beginners. To solve the above problems, the authors propose the GUIDE (Guideline-Guided) dataset, which contains 3500 videos covering 560 instructional tasks in 8 domains of daily life. Each task is accompanied by guidelines representing the common patterns shared across all related videos, and on this basis, systematic specific step annotations are provided, including specific step descriptions corresponding to the guideline steps and their timestamps. Additionally, the paper defines three subtasks to evaluate the model's understanding of instructional videos: - **Step Captioning**: Requires the model to generate captions for specific steps from the video. - **Guideline Summarization**: Requires the model to mine common patterns from task-related videos and summarize them into guidelines. - **Guideline-Guided Captioning**: Requires the model to generate captions for specific steps under the guidance of the guidelines. Through these efforts, the authors hope that GUIDE can become a better benchmark for evaluating the model's understanding of instructional videos and accelerate the learning process for beginners.