Efficient Pre-training for Localized Instruction Generation of Videos

Anil Batra,Davide Moltisanti,Laura Sevilla-Lara,Marcus Rohrbach,Frank Keller
2024-07-21
Abstract:Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve-&-Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the challenges of step - location and text - instruction generation in understanding procedural videos (such as cooking tutorials). Specifically, the paper focuses on the following aspects: 1. **Step - location**: How to accurately locate the time period of each step in the video. 2. **Text - instruction generation**: How to generate detailed text instructions corresponding to the video steps. 3. **Data - annotation cost**: Manual step - annotation and instruction - writing are costly, which limits the size of existing datasets and affects the effective training of models. 4. **Requirement for large - scale pre - training resources**: Although pre - training with large but noisy video - transcription datasets can improve performance, it requires a large amount of computing resources. 5. **Text - style differences**: Video - transcription texts usually contain irrelevant content and have a different style from manually - written instructions, resulting in poor multi - modal learning effects. ### Solutions To solve the above problems, the paper proposes the following methods: 1. **Sieve&Swap technology**: - **Sieve**: Filter out irrelevant video - transcription texts. - **Swap**: Replace relevant transcription texts with manually - written instructions obtained from plain - text recipe datasets. 2. **Procedure Transformer (ProcX)**: - An end - to - end model for step - location and instruction - generation. - Combining pre - training and fine - tuning, it can achieve efficient training on smaller datasets. ### Main contributions 1. **Sieve&Swap method**: - Generates a smaller but higher - quality pre - training dataset, reducing text noise. - The dataset size is three orders of magnitude smaller than existing datasets (about 48,000 videos), but it can still effectively train the model. 2. **Sieve&Swap dataset**: - Used for pre - training models, solving the computing and storage problems brought by large - scale datasets. 3. **Procedure Transformer (ProcX)**: - Introduces key - aware deformable attention and contrastive transformers, improving the model's multi - modal learning ability. - Optimizes the prediction of confidence scores using IoU - aware confidence - score. ### Experimental results - Experiments on the YouCook2 and Tasty datasets show that the ProcX model pre - trained with Sieve&Swap achieves state - of - the - art performance with less training data. - Compared with the model pre - trained with the original transcription texts, the Sieve&Swap pre - trained model performs better in the coherence of instruction generation (SODA - C metric). Through these methods, the paper effectively solves the key problems in procedural video understanding and provides more efficient and higher - quality solutions.