Abstract:Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of poor performance of large - scale multimodal models (LMMs) when handling domain - specific tasks. Although existing LMMs perform well in a wide range of general tasks, they are mainly trained on web - scale datasets dominated by natural images, so their performance will decline in domain - specific tasks that require a large amount of domain - prior knowledge. Specifically, these domain - specific tasks include: 1. **Multimodal reasoning**: For example, mathematical reasoning tasks, involving diagrams, tables, geometric figures and function images, etc. 2. **Visual content extraction**: For example, extracting structured information from diagrams, tables and documents. In addition, there are challenges in directly integrating domain - specific expert models into general models, mainly due to the representation gap and optimization imbalance problems. To solve these problems, the author introduced the Chimera framework to enhance the performance of existing LMMs in specific domains through an extensible and low - cost multimodal pipeline. ### Key points of the solution 1. **Progressive training strategy**: Through a step - by - step training strategy, integrate the features from expert models into the input of the general LMM. 2. **General - Specialist Collaborative Masking mechanism (GSCM)**: To solve the optimization imbalance problem caused by well - aligned general visual encoders, the GSCM mechanism is proposed to promote better model fusion. 3. **Routing module**: During the inference process, decide whether to call the corresponding domain - expert model according to the visual input, so as to realize a multi - functional model that performs well in domains such as diagrams, tables, mathematics and documents. Through these methods, Chimera can achieve state - of - the - art performance in challenging benchmarks such as multimodal reasoning and visual content extraction, and approach or exceed the performance of expert models in multiple domain - specific tasks.

Chimera: Improving Generalist Model with Domain-Specific Experts

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Focus On What Matters: Separated Models For Visual-Based RL Generalization

Mixture-of-Experts Learner for Single Long-Tailed Domain Generalization

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Chameleon: Mixed-Modal Early-Fusion Foundation Models

GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Model Composition for Multimodal Large Language Models

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

Scalable Multi-Domain Adaptation of Language Models using Modular Experts