Abstract:The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatial-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatial-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Meanwhile, Global video representations are encoded using a multi-layer transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at: https://github.com/lionel-hing/BiC-Net.

Towards Robust Video Text Detection with Spatio-Temporal Attention Modeling and Text Cues Fusion

A new video text detection method.

Video Text Detection by Attentive Spatiotemporal Fusion of Deep Convolutional Features

Robust Video Text Detection Through Parametric Shape Regression, Propagation and Fusion.

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

A Robust Approach for Scene Text Detection and Tracking in Video.

Video Text Detection with Fully Convolutional Network and Tracking

A Multi-Level Feature Fusion Network for Scene Text Detection with Text Attention Mechanism

Video Text Tracking With a Spatio-Temporal Complementary Model

Scene Text Detection and Tracking in Video with Background Cues

Multi-Spectral Fusion Based Approach for Arbitrarily Oriented Scene Text Detection in Video Images

Coarse-to-fine dual-level attention for video-text cross modal retrieval

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

A Deep Convolutional Deblurring And Detection Neural Network For Localizing Text In Videos

CVTD: A Robust Car-Mounted Video Text Detector

Automatic video superimposed text detection based on Nonsubsampled Contourlet Transform

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Robust Scene Text Recognition Through Adaptive Image Enhancement

End-to-end video text detection with online tracking

Detecting both superimposed and scene text with multiple languages and multiple alignments in video