Abstract:Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at \url{<a class="link-external link-https" href="https://github.com/awaisrauf/Awesome-CV-Foundational-Models" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to construct foundational models in the field of computer vision that can understand and process complex visual scenes. These foundational models need to be capable of learning across different modalities (such as vision, text, audio, etc.) and, with the support of large-scale training data, achieve contextual reasoning, generalization ability, and prompting capability during testing. Specifically, the paper focuses on the following aspects: 1. **Multimodal Fusion**: How to design model architectures to effectively combine information from different modalities, such as vision and text, to improve the model's understanding and application scope. 2. **Training Objectives**: Exploring different training objectives such as contrastive learning and generative learning, and how they affect the model's performance. 3. **Pre-training Datasets**: Discussing the selection and construction of large-scale pre-training datasets and their impact on model performance. 4. **Fine-tuning Mechanisms**: Investigating how to adapt pre-trained models to specific tasks through fine-tuning, especially in scenarios with limited labeled data. 5. **Prompt Engineering**: Exploring how to guide models to complete specific tasks through prompts, particularly in zero-shot or few-shot learning scenarios. The paper also discusses the current challenges faced by foundational models, including difficulties in evaluation and benchmarking, lack of real-world understanding, limitations in contextual understanding, model biases, vulnerability to adversarial attacks, and issues of interpretability. Finally, the paper systematically reviews the latest advancements in this field in recent years and proposes future research directions.

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision

Towards Foundation Models for 3D Vision: How Close Are We?

Foundation Models for Video Understanding: A Survey

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

A Survey for Foundation Models in Autonomous Driving

Foundation Models Meet Visualizations: Challenges and Opportunities

Foundation Models for Remote Sensing and Earth Observation: A Survey

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey

Foundation Models for Decision Making: Problems, Methods, and Opportunities

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Foundation models in robotics: Applications, challenges, and the future

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Training and Serving System of Foundation Models: A Comprehensive Survey

AI Foundation Models in Remote Sensing: A Survey

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Robot Learning in the Era of Foundation Models: A Survey