Abstract:Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score). Project page: <a class="link-external link-https" href="https://mccartney01.github.io/SAM" rel="external noopener nofollow">this https URL</a>.

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation

On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding

Semantic enrichment towards efficient speech representations

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

Semantic Alignment for Multimodal Large Language Models

mSLAM: Massively multilingual joint pre-training for speech and text

Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding

Unsupervised Cross-Lingual Sentence Representation Learning via Linguistic Isomorphism

Crossing language identification: Multilingual ASR framework based on semantic dataset creation & Wav2Vec 2.0

Towards Robust Speech Representation Learning for Thousands of Languages

DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Universal Multimodal Representation for Language Understanding

Unified Lexical Representation for Interpretable Visual-Language Alignment

Exploring Multilingual Syntactic Sentence Representations

FC-MTLF: A Fine- and Coarse-grained Multi-Task Learning Framework for Cross-Lingual Spoken Language Understanding.

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data