Abstract:Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve-&-Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the challenges of step - location and text - instruction generation in understanding procedural videos (such as cooking tutorials). Specifically, the paper focuses on the following aspects: 1. **Step - location**: How to accurately locate the time period of each step in the video. 2. **Text - instruction generation**: How to generate detailed text instructions corresponding to the video steps. 3. **Data - annotation cost**: Manual step - annotation and instruction - writing are costly, which limits the size of existing datasets and affects the effective training of models. 4. **Requirement for large - scale pre - training resources**: Although pre - training with large but noisy video - transcription datasets can improve performance, it requires a large amount of computing resources. 5. **Text - style differences**: Video - transcription texts usually contain irrelevant content and have a different style from manually - written instructions, resulting in poor multi - modal learning effects. ### Solutions To solve the above problems, the paper proposes the following methods: 1. **Sieve&Swap technology**: - **Sieve**: Filter out irrelevant video - transcription texts. - **Swap**: Replace relevant transcription texts with manually - written instructions obtained from plain - text recipe datasets. 2. **Procedure Transformer (ProcX)**: - An end - to - end model for step - location and instruction - generation. - Combining pre - training and fine - tuning, it can achieve efficient training on smaller datasets. ### Main contributions 1. **Sieve&Swap method**: - Generates a smaller but higher - quality pre - training dataset, reducing text noise. - The dataset size is three orders of magnitude smaller than existing datasets (about 48,000 videos), but it can still effectively train the model. 2. **Sieve&Swap dataset**: - Used for pre - training models, solving the computing and storage problems brought by large - scale datasets. 3. **Procedure Transformer (ProcX)**: - Introduces key - aware deformable attention and contrastive transformers, improving the model's multi - modal learning ability. - Optimizes the prediction of confidence scores using IoU - aware confidence - score. ### Experimental results - Experiments on the YouCook2 and Tasty datasets show that the ProcX model pre - trained with Sieve&Swap achieves state - of - the - art performance with less training data. - Compared with the model pre - trained with the original transcription texts, the Sieve&Swap pre - trained model performs better in the coherence of instruction generation (SODA - C metric). Through these methods, the paper effectively solves the key problems in procedural video understanding and provides more efficient and higher - quality solutions.

Efficient Pre-training for Localized Instruction Generation of Videos

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

Procedure-Aware Pretraining for Instructional Video Understanding

Ingredient-enriched Recipe Generation from Cooking Videos

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Learning and Verification of Task Structure in Instructional Videos

Learning To Recognize Procedural Activities with Distant Supervision

GePSAn: Generative Procedure Step Anticipation in Cooking Videos

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

Order-Based Pre-training Strategies for Procedural Text Understanding

Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks

A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

Video-based Recipe Retrieval

Step Differences in Instructional Video

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

Multi-Sentence Grounding for Long-term Instructional Video

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark