Abstract:The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space. Additionally, they only extract visual features and perform cross-modal fusion at a single scale, neglecting objects with different characteristics. To address these issues, we propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs: (1)~A hierarchical multi-scale architecture that involves a Cross-Scale Aggregation module, which leverages joint multi-modal features extracted from multiple scales to recognize objects of varying sizes and appearances in images. (2)~Interactive Visual-Linguistic Attention, a novel attention mechanism module that tightly integrates cross-modal interaction, enabling the joint updating of visual, linguistic and multi-modal features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that HSVLT surpasses state-of-the-art methods with lower computational cost.

Hierarchical Vision and Language Transformer for Efficient Visual Dialog

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog

HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Hierarchical visual-semantic interaction for scene text recognition

Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs

Some Can Be Better Than All: Multimodal Star Transformer for Visual Dialog

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

CLVIN: Complete language-vision interaction network for visual question answering

EVLM: An Efficient Vision-Language Model for Visual Understanding

Towards Better Vision-Inspired Vision-Language Models

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

ViLTA: Enhancing Vision-Language Pre-training Through Textual Augmentation

HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

A-VL: Adaptive Attention for Large Vision-Language Models

What Makes for Hierarchical Vision Transformer?