Abstract:Generic Boundary Detection (GBD) aims at locating the general boundaries that divide videos into semantically coherent and taxonomy-free units, and could serve as an important pre-processing step for long-form video understanding. Previous works often separately handle these different types of generic boundaries with specific designs of deep networks from simple CNN to LSTM. Instead, in this paper, we present Temporal Perceiver , a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs. Our core design is to introduce a small set of latent feature queries as anchors to compress the redundant video input into a fixed dimension via cross-attention blocks. Thanks to this fixed number of latent units, it reduces the quadratic complexity of attention operation to a linear form of input frames. Specifically, to explicitly leverage the temporal structure of videos, we construct two types of latent feature queries: boundary queries and context queries, which handle the semantic incoherence and coherence accordingly. Moreover, to guide the learning of latent feature queries, we propose an alignment loss on the cross-attention maps to explicitly encourage the boundary queries to attend on the top boundary candidates. Finally, we present a sparse detection head on the compressed representation, and directly output the final boundary detection results without any post-processing module. We test our Temporal Perceiver on a variety of GBD benchmarks. Our method obtains the state-of-the-art results on all benchmarks with RGB single-stream features: SoccerNet-v2 (81.9 percent average-mAP), Kinetics-GEBD (86.0 percent average-f1), TAPOS (73.2 percent average-f1), MovieScenes (51.9 percent AP and 53.1 percent $M_{iou}$ ) and MovieNet (53.3 percent AP and 53.2 percent $M_{iou}$ ), demonstrating the generalization ability of our Temporal Perceiver. To further pursue a general GBD model, we combine various tasks to train a class-agnostic Temporal perceiver and evaluate its performance across all benchmarks. Results show that the class-agnostic Perceiver achieves comparable detection accuracy but better generalization ability compared to dataset-specific counterparts.

Multimodal High-order Relation Transformer for Scene Boundary Detection

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection

Video Scene Detection Using Transformer Encoding Linker Network (TELNet)

Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection

Toward a Unified Transformer-Based Framework for Scene Graph Generation and Human-Object Interaction Detection

A New Approach For Video Scene Boundary Detection

RelTR: Relation Transformer for Scene Graph Generation

End-to-End Video Scene Graph Generation with Temporal Propagation Transformer

MRFTrans: Multimodal Representation Fusion Transformer for monocular 3D semantic scene completion

Online video visual relation detection with hierarchical multi-modal fusion

Video Visual Relation Detection Via Multi-modal Feature Fusion

High-Order Relation Learning Transformer for Satellite Video Object Tracking

Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal Transformers

Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection.

Scene Text Recognition via Transformer

Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection

Efficient Movie Scene Detection using State-Space Transformers

OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification