Abstract:Our goal is to learn a video representation that is useful for downstream procedure understanding tasks in instructional videos. Due to the small amount of available annotations, a key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge such as the identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or the potential next steps given partial progress in its execution. Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph (PKG), where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We build a PKG by combining information from a text-based procedural knowledge database and an unlabeled instructional video corpus and then use it to generate training pseudo labels with four novel pre-training objectives. We call this PKG-based pre-training procedure and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings. Implementation is available at https://github.com/salesforce/paprika.

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Ingredient-enriched Recipe Generation from Cooking Videos

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

Knowledge Graph Extraction from Videos

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

Efficient Pre-training for Localized Instruction Generation of Videos

Recipe Generation from Unsegmented Cooking Videos

Video-based Recipe Retrieval

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

Learning To Recognize Procedural Activities with Distant Supervision

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

Knowledge-Aware Procedural Text Understanding with Multi-Stage Training

Semi-automatic annotation process for procedural texts: An application on cooking recipes

Exploring Object-Centered External Knowledge for Fine-Grained Video Paragraph Captioning

GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

Learning and Verification of Task Structure in Instructional Videos

Multi-modal Cooking Workflow Construction for Food Recipes

Learning Structural Representations for Recipe Generation and Food Retrieval

Procedure-Aware Pretraining for Instructional Video Understanding

Learning Program Representations for Food Images and Cooking Recipes