Abstract:Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts -- spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner's guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: <a class="link-external link-https" href="https://github.com/ExplainableML/fomo_in_flux" rel="external noopener nofollow">this https URL</a>.

Continual Learning Meets Multimodal Foundation Models: Fundamentals and Advances

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodal Continual Learning Using Online Dictionary Updating.

Recent Advances of Foundation Language Models-based Continual Learning: A Survey

A survey of multimodal federated learning: background, applications, and perspectives

HEMM: Holistic Evaluation of Multimodal Foundation Models

Multimodal federated learning: Concept, methods, applications and future directions

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

A Practitioner's Guide to Continual Multimodal Pretraining

Recent Advances of Continual Learning in Computer Vision: An Overview

Continual Learning: Applications and the Road Forward

Foundations of Multisensory Artificial Intelligence

Toward Robust Multimodal Learning using Multimodal Foundational Models

A Theory of Multimodal Learning

A unified framework for multi-modal federated learning

Continual Learning for Multimodal Data Fusion of a Soft Gripper

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation