Chimera: Improving Generalist Model with Domain-Specific Experts

Tianshuo Peng,Mingsheng Li,Hongbin Zhou,Renqiu Xia,Renrui Zhang,Lei Bai,Song Mao,Bin Wang,Conghui He,Aojun Zhou,Botian Shi,Tao Chen,Bo Zhang,Xiangyu Yue
2024-12-09
Abstract:Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of poor performance of large - scale multimodal models (LMMs) when handling domain - specific tasks. Although existing LMMs perform well in a wide range of general tasks, they are mainly trained on web - scale datasets dominated by natural images, so their performance will decline in domain - specific tasks that require a large amount of domain - prior knowledge. Specifically, these domain - specific tasks include: 1. **Multimodal reasoning**: For example, mathematical reasoning tasks, involving diagrams, tables, geometric figures and function images, etc. 2. **Visual content extraction**: For example, extracting structured information from diagrams, tables and documents. In addition, there are challenges in directly integrating domain - specific expert models into general models, mainly due to the representation gap and optimization imbalance problems. To solve these problems, the author introduced the Chimera framework to enhance the performance of existing LMMs in specific domains through an extensible and low - cost multimodal pipeline. ### Key points of the solution 1. **Progressive training strategy**: Through a step - by - step training strategy, integrate the features from expert models into the input of the general LMM. 2. **General - Specialist Collaborative Masking mechanism (GSCM)**: To solve the optimization imbalance problem caused by well - aligned general visual encoders, the GSCM mechanism is proposed to promote better model fusion. 3. **Routing module**: During the inference process, decide whether to call the corresponding domain - expert model according to the visual input, so as to realize a multi - functional model that performs well in domains such as diagrams, tables, mathematics and documents. Through these methods, Chimera can achieve state - of - the - art performance in challenging benchmarks such as multimodal reasoning and visual content extraction, and approach or exceed the performance of expert models in multiple domain - specific tasks.