Abstract:This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.

HMTV: hierarchical multimodal transformer for video highlight query on baseball

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

MCT-VHD: Multi-modal contrastive transformer for video highlight detection

Hierarchical multimodal transformer to summarize videos

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Hierarchical Local-Global Transformer for Temporal Sentence Grounding.

Automatically extracting highlights for TV Baseball programs.

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion

Video Transformer based Video Quality Assessment with Spatiotemporally adaptive Token Selection and Assembly

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

A Semi-Automatic Feature Selecting Method For Sports Video Highlight Annotation

End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval