Abstract:There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.

What problem does this paper attempt to address?

The paper aims to address the issues present in existing educational video datasets, which include: 1. **Lack of Systematicity**: Existing datasets only focus on video-level specific step annotations, neglecting task-level guidelines, making it difficult for beginners to learn new tasks due to a lack of relevant experience. 2. **Incoherent Specific Steps**: Specific steps without guiding principles appear trivial and unsystematic, making it hard to provide clear instructional guidance. 3. **Increased Learning Difficulty**: Many educational videos related to the same task describe the same task but have significant differences in details and step sequences, increasing the learning difficulty for beginners. To solve the above problems, the authors propose the GUIDE (Guideline-Guided) dataset, which contains 3500 videos covering 560 instructional tasks in 8 domains of daily life. Each task is accompanied by guidelines representing the common patterns shared across all related videos, and on this basis, systematic specific step annotations are provided, including specific step descriptions corresponding to the guideline steps and their timestamps. Additionally, the paper defines three subtasks to evaluate the model's understanding of instructional videos: - **Step Captioning**: Requires the model to generate captions for specific steps from the video. - **Guideline Summarization**: Requires the model to mine common patterns from task-related videos and summarize them into guidelines. - **Guideline-Guided Captioning**: Requires the model to generate captions for specific steps under the guidance of the guidelines. Through these efforts, the authors hope that GUIDE can become a better benchmark for evaluating the model's understanding of instructional videos and accelerate the learning process for beginners.

GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis

Multi-Sentence Grounding for Long-term Instructional Video

LVBench: An Extreme Long Video Understanding Benchmark

TutorialVQA: Question Answering Dataset for Tutorial Videos

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

VideoMCC: a New Benchmark for Video Comprehension

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Towards Event-oriented Long Video Understanding

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Video Instruction Tuning With Synthetic Data