Abstract:Large Language Models (LLMs) have demonstrated exceptional text understanding. Existing works explore their application in text embedding tasks. However, there are few works utilizing LLMs to assist multimodal representation tasks. In this work, we investigate the potential of LLMs to enhance multimodal representation in multimodal item-to-item (I2I) recommendations. One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks. However, pre-training MLLMs usually requires collecting high-quality, web-scale multimodal data, resulting in complex training procedures and high costs. This leads the community to rely heavily on open-source MLLMs, hindering customized training for representation scenarios. Therefore, we aim to design an end-to-end training method that customizes the integration of any existing LLMs and vision encoders to construct efficient multimodal representation models. Preliminary experiments show that fine-tuned LLMs in this end-to-end method tend to overlook image content. To overcome this challenge, we propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation. We propose two ways to enhance the focus on visual information. The first method is based on the prompt viewpoint, which separates multimodal content into visual content and textual content. NoteLLM-2 adopts the multimodal In-Content Learning method to teach LLMs to focus on both modalities and aggregate key information. The second method is from the model architecture, utilizing a late fusion mechanism to directly fuse visual information into textual information. Extensive experiments have been conducted to validate the effectiveness of our method.

From Image to Video, what do we need in multimodal LLMs?

Audio-Visual LLM for Video Understanding

VideoLLM: Modeling Video Sequence with Large Language Models

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Understanding Long Videos with Multimodal Language Models

Probing Multimodal Large Language Models for Global and Local Semantic Representations

A Survey of Multimodal Large Language Model from A Data-centric Perspective

InfMLLM: A Unified Framework for Visual-Language Tasks.

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

Multimodal Large Language Models: A Survey

NoteLLM-2: Multimodal Large Representation Models for Recommendation

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

E5-V: Universal Embeddings with Multimodal Large Language Models