Abstract:Recently, the approaches of linguistic modeling for scene text recognition have gradually become mainstream, mainly consisting of a vision model (VM), a language model (LM), and an optional fusion module. These methods typically use LM and fusion modules to refine the results of VM-based predictions iteratively. However, the VM mainly consists of a Transformer on top of ResNet. It means the attention mechanism is only applied to the high layer of the VM, ignoring the internal image dependencies in the dense features at multiple scales. Therefore, the results in the VM become the performance bottleneck. Meanwhile, the visual and language features of these methods reside in their own space. In this way, it ignores the alignment before fusion, leading to a failure to achieve maximum information interaction. To address these issues, we propose Visual cOllaboration and duaL-stream fusion for scene TExt Recognition, VOLTER for short. Firstly, a multi-stage Local-Global Collaboration Vision Model (LGC-VM) is constructed to focus on both local and global features at multiple scales, breaking vision bottlenecks to provide a better vision prediction. Secondly, to explicitly align the feature space of VM and LM, we introduce a Vision-Language Contrastive (VLC) module by encouraging positive vision-language pairs to have similar representations. Moreover, a Dual-Stream Feature Enhancement (DSFE) module is proposed for the unidirectional interaction of visual-language features to alleviate the alignment problem of different modalities and execute fusion further. Extensive experiments on benchmark datasets demonstrate that the proposed framework can achieve state-of-the-art performance.

RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog.

Recurrent Attention Network with Reinforced Generator for Visual Dialog

Hierarchical Vision and Language Transformer for Efficient Visual Dialog

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Video Dialog Via Progressive Inference and Cross-Transformer.

Some Can Be Better Than All: Multimodal Star Transformer for Visual Dialog

UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog

Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs

HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Recursive Visual Attention in Visual Dialog

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Multi-View Attention Network for Visual Dialog

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Visual Dialog with Multi-turn Attentional Memory Network

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Modality-Balanced Models for Visual Dialogue

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition