Abstract:Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page: <a class="link-external link-https" href="https://mccartney01.github.io/SAM" rel="external noopener nofollow">this https URL</a>.

Context-Based Semantic-Aware Alignment for Semi-Supervised Multi-Label Learning

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Semi-Supervised Dual Relation Learning for Multi-Label Classification

Learning From Semi-Supervised Weak-Label Data

Semi-supervised Active Learning Based on Semantic-aware Crop Consistency

SLED: Semantic Label Embedding Dictionary Representation for Multi-label Image Annotation

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

Semantic context learning with large-scale weakly-labeled image set.

Semantic-Aware Dual Contrastive Learning for Multi-label Image Classification

Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning

Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels

Multi-label learning with semantic embeddings

Semi-supervised semantic segmentation with directional context-aware consistency

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

Semantic Alignment for Multimodal Large Language Models

A Multi-Level Label-Aware Semi-Supervised Framework for Remote Sensing Scene Classification.

Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition

Active Learning with Label Correlation Exploration for Multi-Label Image Classification

Weakly Supervised Multi-Label Learning Via Label Enhancement

Effective Multi-Modal Multi-Label Learning for Automatic Image Annotation.

Enhancing Weakly Supervised Semantic Segmentation with Multi-label Contrastive Learning and LLM Features Guidance