Abstract:Vision-Language Models (VLMs) achieved strong performance on a variety of tasks (e.g., image-text retrieval, visual question answering). However, most VLMs rely on coarse-grained image-caption pairs for alignment, relying on data volume to resolve ambiguities and ground linguistic concepts in images. The richer semantic and syntactic structure within text is largely overlooked. To address this, we propose HIerarchically STructured Learning (HIST) that enhances VLM training without any additional supervision, by hierarchically decomposing captions into the constituent Subject, Noun Phrases, and Composite Phrases. Entailment between these constituent components allows us to formulate additional regularization constraints on the VLM attention maps. Specifically, we introduce two novel loss functions: (1) Subject Loss, which aligns image content with the subject of corresponding phrase, acting as an entailment of standard contrastive/matching losses at the Phrase level; (2) Addition Loss, to balance attention across multiple objects. HIST is general, and can be applied to any VLM for which attention between vision and language can be computed; we illustrate its efficacy on BLIP and ALBEF. HIST outperforms baseline VLMs, achieving up to +9.8% improvement in visual grounding, +6.3% in multi-object referring segmentation, +1.1% in image-text retrieval, and +0.2% in visual question answering, underscoring the value of structuring learning in VLMs.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that existing Vision - Language Models (VLMs) rely too much on coarse - grained image - caption pairs during the training process and ignore the richer semantic and syntactic structures in the text. Specifically: 1. **Problems of existing VLMs**: - Most VLMs rely on a large number of image - caption pairs for alignment and depend on the amount of data to resolve ambiguities and establish language concepts in the image. - They ignore the rich semantic and syntactic structures in the text. Especially when dealing with multi - object scenes, the model may focus too much on the most prominent object, resulting in poor alignment of other related objects. 2. **Solutions proposed by the paper**: - A new HIerarchically STructured Learning (HIST) framework is introduced. By hierarchically decomposing captions into constituent elements (such as topics, noun phrases, and compound phrases), VLM training is enhanced. - Two novel loss functions are proposed: 1. **Subject Loss**: It ensures that the image content is aligned with the topic of the corresponding phrase, as an implication of the standard contrast/matching loss at the phrase level. 2. **Addition Loss**: It balances the attention among multiple objects and encourages the model to pay attention to multiple objects simultaneously instead of only focusing on the most prominent object. 3. **Objectives**: - By introducing these loss functions, the HIST framework can improve the performance of VLM in tasks such as visual localization, multi - object reference segmentation, image - text retrieval, and visual question answering without additional supervision. In summary, this paper aims to improve the performance of VLM in image - text alignment by using the hierarchical structure information of the text, especially in dealing with complex multi - object scenes, and enhance the accuracy and robustness of the model.

Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses

Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Teaching Structured Vision&Language Concepts to Vision&Language Models

Locality Alignment Improves Vision-Language Models

Discriminative Fine-tuning of LVLMs

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Hierarchical Vision and Language Transformer for Efficient Visual Dialog

Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Visually-Augmented Language Modeling

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Towards Multimodal In-Context Learning for Vision & Language Models

ViLTA: Enhancing Vision-Language Pre-training Through Textual Augmentation

The Neglected Tails in Vision-Language Models

Improving the Efficiency of Visually Augmented Language Models

Natural Language Inference Improves Compositionality in Vision-Language Models

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Rethinking VLMs and LLMs for Image Classification

Unified Lexical Representation for Interpretable Visual-Language Alignment

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs