Abstract:Instructional videos are an important resource to learn procedural tasks from human demonstrations. However, the instruction steps in such videos are typically short and sparse, with most of the video being irrelevant to the procedure. This motivates the need to temporally localize the instruction steps in such videos, i.e. the task called key-step localization. Traditional methods for key-step localization require video-level human annotations and thus do not scale to large datasets. In this work, we tackle the problem with no human supervision and introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video. StepFormer is a transformer decoder that attends to the video with learnable queries, and produces a sequence of slots capturing the key-steps in the video. We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision. In particular, we supervise our system with a sequence of text narrations using an order-aware loss function that filters out irrelevant phrases. We show that our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization by a large margin on three challenging benchmarks. Moreover, our model demonstrates an emergent property to solve zero-shot multi-step localization and outperforms all relevant baselines at this task.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "StepFormer: Self - supervised Step Discovery and Localization in Instructional Videos" aims to solve the problem of automatically discovering and localizing key steps in instructional videos. Specifically, the paper proposes a self - supervised model, StepFormer, which can discover and localize key steps from long instructional videos without manual annotation. ### Background and motivation 1. **Importance of instructional videos**: - Instructional videos are important resources for learning procedural tasks and can effectively impart skills through human demonstrations. - However, these videos usually contain a large amount of content irrelevant to the task, such as title screens, close - ups of people, and product advertisements, making it difficult to extract key steps. 2. **Limitations of existing methods**: - **Fully - supervised methods**: Require precise annotation of the start and end times of each step, which is difficult to achieve on large - scale datasets. - **Weakly - supervised methods**: Rely on a set of steps or a partially ordered step description in the video and still require a large amount of manual annotation. - **Unsupervised methods**: Although not relying on any prior knowledge, existing unsupervised methods usually require video - level task labels and can only be applied to small datasets. ### Solutions 1. **StepFormer model**: - StepFormer is a Transformer - based decoder model that can generate a series of slots capturing key steps by focusing on key passages in the video through learned queries. - The model is trained using only the automatically generated subtitles of the video as a supervision signal, without any manual annotation. 2. **Training and inference**: - **Training**: The model is trained on the large - scale instructional video dataset HowTo100M, using video subtitles as the only supervision source. The order - aware loss function is used to ensure that the generated step slots follow the temporal order. - **Invention**: In the inference stage, the model only needs to input the video to generate ordered step slots, and align these slots with video frames through the Drop - DTW algorithm, thereby achieving step localization. 3. **Performance evaluation**: - The paper is evaluated on three standard instructional video benchmark datasets, CrossTask, ProceL, and COIN. - The experimental results show that StepFormer significantly outperforms all previous unsupervised and weakly - supervised methods in the unsupervised step detection and localization tasks. - In addition, the model also demonstrates the ability of zero - shot multi - step localization, that is, it can perform well on unseen datasets. ### Main contributions 1. **Propose a new self - supervised method, StepFormer**, for discovering and localizing key steps in instructional videos. 2. **Explicitly model the temporal order of steps** and design effective training and inference methods. 3. **Train only with video subtitles on a large - dataset** and achieve new state - of - the - art performance on multiple downstream datasets without any fine - tuning. ### Related work - **Research on step localization in instructional videos**: According to the type of supervision, it can be divided into fully - supervised, weakly - supervised, and unsupervised methods. - **Learning from visual - text information**: Utilizing the complementarity of visual and text information has become an important learning method in recent years. - **Sequence alignment**: Widely used in representation learning and step localization tasks, especially when dealing with multi - modal data. ### Technical methods 1. **StepFormer architecture**: - Input: A video N seconds long. - Output: K step slots, each of which is a vector capturing the ordered steps in the video. - Use the UniVL pre - trained model to extract video and text features and generate step slots through the Transformer decoder. 2. **Sequence alignment**: - Use the Drop - DTW algorithm for sequence alignment, allowing

StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos

Learning and Verification of Task Structure in Instructional Videos

Learning to Ground Instructional Articles in Videos through Narrations

Learning To Recognize Procedural Activities with Distant Supervision

Step Differences in Instructional Video

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Multi-Sentence Grounding for Long-term Instructional Video

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

Unsupervised Discovery of Actions in Instructional Videos

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Efficient Pre-training for Localized Instruction Generation of Videos

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Positional Information is All You Need: A Novel Pipeline for Self-Supervised SVDE from Videos

P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision

Hierarchical Video-Moment Retrieval and Step-Captioning

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

Order-Constrained Representation Learning for Instructional Video Prediction

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Reconstructing and grounding narrated instructional videos in 3D

Unsupervised Alignment of Natural Language Instructions with Video Segments