Abstract:The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.

From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Large Multimodal Agents: A Survey

Improving Causal Reasoning in Large Language Models: A Survey

A Survey on Multimodal Large Language Models

Efficient Multimodal Large Language Models: A Survey

Cross-modal Large Language Models : Progress and Prospects

A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

A Survey on Interpretable Cross-modal Reasoning

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Attention Heads of Large Language Models: A Survey

A Survey of Multimodal Large Language Model from A Data-centric Perspective

A Survey on Benchmarks of Multimodal Large Language Models

A Survey on Evaluation of Multimodal Large Language Models

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Towards Reasoning in Large Language Models: A Survey

A Survey on Multimodal Benchmarks: In the Era of Large AI Models