Abstract:Recent advancements in artificial intelligence have sparked interest in scientific assistants that could support researchers across the full spectrum of scientific workflows, from literature review to experimental design and data analysis. A key capability for such systems is the ability to process and reason about scientific information in both visual and textual forms - from interpreting spectroscopic data to understanding laboratory setups. Here, we introduce MaCBench, a comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation. Through a systematic evaluation of leading models, we find that while these systems show promising capabilities in basic perception tasks - achieving near-perfect performance in equipment identification and standardized data extraction - they exhibit fundamental limitations in spatial reasoning, cross-modal information synthesis, and multi-step logical inference. Our insights have important implications beyond chemistry and materials science, suggesting that developing reliable multimodal AI scientific assistants may require advances in curating suitable training data and approaches to training those models.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore the limitations of multimodal language models (MLLMs) in chemical and materials science research. Specifically, the authors evaluate the performance of these models in handling real - world chemical and materials science tasks by introducing a comprehensive benchmarking framework named MaCBench. These tasks cover three core aspects: data extraction, experimental understanding, and result interpretation. #### Main problems include: 1. **Multimodal information processing ability**: - Can current multimodal language models effectively integrate visual and textual information to support complex scientific workflows? For example, extracting information from literature, conducting experiments, analyzing data, etc. 2. **Spatial reasoning and cross - modal information synthesis**: - How do the models perform in tasks that require spatial reasoning? For example, identifying isomeric relationships in molecular structures or analyzing crystal structures. - Can the models effectively synthesize information between different modalities? For example, combining image information with textual descriptions for reasoning. 3. **Multi - step logical reasoning**: - How do the models perform in tasks that require multi - step logical reasoning? For example, in X - ray diffraction (XRD) pattern analysis, not only identifying the highest peak but also ranking the relative intensities of multiple peaks. 4. **Pattern matching vs. scientific understanding**: - Do the models rely on pattern matching in the training data, or do they truly possess the ability of scientific reasoning? For example, does the model's recognition of some common crystal structures simply because these structures appear more frequently on the Internet? #### Specific goals of the MaCBench framework: - **Data extraction**: Evaluate the model's ability to extract information from literature, including extracting data from tables and charts, interpreting chemical structures, etc. - **Experimental execution**: Evaluate the model's understanding of laboratory protocols, such as identifying equipment, evaluating safety conditions, understanding crystal structures, etc. - **Data interpretation**: Evaluate the model's ability to interpret various types of scientific data, such as spectral analysis, electron - structure interpretation, etc. Through systematic evaluation and ablation studies, the authors reveal the fundamental limitations of current multimodal language models in handling complex scientific tasks and point out the directions for future improvement. This not only helps in developing more reliable AI research assistants but also provides an important reference for building automated systems that can truly assist scientists in creative work. ### Conclusion Although current multimodal language models perform well on simple tasks, they still have significant limitations in complex tasks that require the integration of visual and conceptual understanding. To achieve true scientific reasoning, future improvements need to innovate not only in the model architecture but also in the diversity and quality of the training data, especially enhancing the model's spatial reasoning ability and cross - modal information synthesis ability.

Probing the limitations of multimodal language models for chemistry and materials research

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Are large language models superhuman chemists?

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

ChemDFM-X: Towards Large Multimodal Model for Chemistry

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis

LMM Chemical Research with Document Retrieval

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

A Survey on Benchmarks of Multimodal Large Language Models

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecule

ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

A multi-agent-driven robotic AI chemist enabling autonomous chemical research on demand

MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling