Abstract:Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at <a class="link-external link-https" href="https://github.com/Liuziyu77/MMDU" rel="external noopener nofollow">this https URL</a>.

Multi-turn Classroom Dialogue Dataset: Assessing Student Performance from One-on-one Conversations

Assessing Student Performance with Multi-granularity Attention from Online Classroom Dialogue

S2M: Converting Single-Turn to Multi-Turn Datasets for Conversational Question Answering

MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

The NCTE Transcripts: A Dataset of Elementary Math Classroom Transcripts

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

MUTLA: A Large-Scale Dataset for Multimodal Teaching and Learning Analytics

Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

Multi-Scale Audio Spectrogram Transformer for Classroom Teaching Interaction Recognition

Multi-Task Learning based Online Dialogic Instruction Detection with Pre-trained Language Models

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

Riding an emotional roller-coaster: A multimodal study of young child's math problem solving activities

Developing a Tutoring Dialog Dataset to Optimize LLMs for Educational Use

Utilizing Natural Language Processing for Automated Assessment of Classroom Discussion

Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning