MATE: Meet At The Embedding -- Connecting Images with Long Texts

Young Kyun Jang,Junmo Kang,Yong Jae Lee,Donghyun Kim

2024-06-26

Abstract:While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the embeddings from the VLM text encoder with those from the LLM using extensive text pairs. This module is then employed to seamlessly align image embeddings closely with LLM embeddings. We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts (lengthy captions / documents). Extensive experimental results demonstrate that MATE effectively connects images with long texts, uncovering diverse semantic relationships.

Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the insufficient ability of existing vision - language models (VLMs) in handling cross - modal alignment between images and long texts (such as long - form descriptions or documents). Although existing VLMs have significantly improved the alignment ability between images and short - text descriptions, they mainly focus on short texts, which limits their ability to handle complex text interactions, especially for longer text content. Therefore, this paper proposes a new method - Meet At The Embedding (MATE), aiming to effectively connect images and long texts without additional image - long - text pairs by combining the capabilities of VLMs and large - language models (LLMs). Specifically, MATE achieves this goal through the following steps: 1. **Replace the text encoder**: Replace the text encoder in the VLM with a pre - trained LLM - based encoder to enhance the ability to understand long texts. 2. **Projection module**: Introduce a projection module. Through a multi - stage training method, gradually align the embedding spaces of the VLM and the LLM. First, use large - scale text pairs (such as short - text - long - text pairs) to align the VLM text encoder with the LLM encoder; then, use a small number of image - text pairs to align the image embeddings to the LLM embedding space. 3. **New benchmark tests**: Propose two new cross - modal retrieval benchmarks for evaluating the connection tasks between images and long texts (long - form descriptions / documents). Through these methods, MATE can effectively connect images and long texts, revealing diverse semantic relationships, and thus perform well in cross - modal retrieval tasks.

MATE: Meet At The Embedding -- Connecting Images with Long Texts

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Visual Context Window Extension: A New Perspective for Long Video Understanding

Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

Semantic Alignment for Multimodal Large Language Models

Multiway-Adapter: Adapting Multimodal Large Language Models for Scalable Image-Text Retrieval

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

InfMLLM: A Unified Framework for Visual-Language Tasks.

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Dense Connector for MLLMs

M^2Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models