Abstract:Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page: <a class="link-external link-https" href="https://mccartney01.github.io/SAM" rel="external noopener nofollow">this https URL</a>.

Multimodal Food Image Classification with Large Language Models

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Multimodal Large Language Models: A Survey

Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

Probing Multimodal Large Language Models for Global and Local Semantic Representations

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Multimodal Large Language Models for Bioimage Analysis

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

Unified Generative and Discriminative Training for Multi-modal Large Language Models

FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

Improving Multimodal Large Language Models Using Continual Learning

Semantic Alignment for Multimodal Large Language Models

Food Classification using Joint Representation of Visual and Textual Data

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training