Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses

Jiayun Luo,Mir Rayat Imtiaz Hossain,Boyang Li,Leonid Sigal
2024-12-11
Abstract:Vision-Language Models (VLMs) achieved strong performance on a variety of tasks (e.g., image-text retrieval, visual question answering). However, most VLMs rely on coarse-grained image-caption pairs for alignment, relying on data volume to resolve ambiguities and ground linguistic concepts in images. The richer semantic and syntactic structure within text is largely overlooked. To address this, we propose HIerarchically STructured Learning (HIST) that enhances VLM training without any additional supervision, by hierarchically decomposing captions into the constituent Subject, Noun Phrases, and Composite Phrases. Entailment between these constituent components allows us to formulate additional regularization constraints on the VLM attention maps. Specifically, we introduce two novel loss functions: (1) Subject Loss, which aligns image content with the subject of corresponding phrase, acting as an entailment of standard contrastive/matching losses at the Phrase level; (2) Addition Loss, to balance attention across multiple objects. HIST is general, and can be applied to any VLM for which attention between vision and language can be computed; we illustrate its efficacy on BLIP and ALBEF. HIST outperforms baseline VLMs, achieving up to +9.8% improvement in visual grounding, +6.3% in multi-object referring segmentation, +1.1% in image-text retrieval, and +0.2% in visual question answering, underscoring the value of structuring learning in VLMs.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that existing Vision - Language Models (VLMs) rely too much on coarse - grained image - caption pairs during the training process and ignore the richer semantic and syntactic structures in the text. Specifically: 1. **Problems of existing VLMs**: - Most VLMs rely on a large number of image - caption pairs for alignment and depend on the amount of data to resolve ambiguities and establish language concepts in the image. - They ignore the rich semantic and syntactic structures in the text. Especially when dealing with multi - object scenes, the model may focus too much on the most prominent object, resulting in poor alignment of other related objects. 2. **Solutions proposed by the paper**: - A new HIerarchically STructured Learning (HIST) framework is introduced. By hierarchically decomposing captions into constituent elements (such as topics, noun phrases, and compound phrases), VLM training is enhanced. - Two novel loss functions are proposed: 1. **Subject Loss**: It ensures that the image content is aligned with the topic of the corresponding phrase, as an implication of the standard contrast/matching loss at the phrase level. 2. **Addition Loss**: It balances the attention among multiple objects and encourages the model to pay attention to multiple objects simultaneously instead of only focusing on the most prominent object. 3. **Objectives**: - By introducing these loss functions, the HIST framework can improve the performance of VLM in tasks such as visual localization, multi - object reference segmentation, image - text retrieval, and visual question answering without additional supervision. In summary, this paper aims to improve the performance of VLM in image - text alignment by using the hierarchical structure information of the text, especially in dealing with complex multi - object scenes, and enhance the accuracy and robustness of the model.