Abstract:For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to $\textit{do more with less}$, such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient $\textit{machine}$ learning also take inspiration from $\textit{human}$ learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the performance of Vision - Language Models (VLMs) in multi - modal tasks (such as image + text) and uni - modal tasks (such as pure text) through Curriculum Learning (CL), Pretraining, and model type selection when data and computing resources are limited. Specifically, the paper focuses on the following issues: 1. **Performance improvement with limited data**: How to effectively improve the performance of VLMs when there is only a small amount of data (for example, 100 million words, which is equivalent to the vocabulary that a 13 - year - old child is exposed to). 2. **Application of curriculum learning**: Explore whether curriculum learning can improve the performance of VLMs when data is limited. Curriculum learning simulates the human learning process by gradually increasing the task difficulty, hoping to improve the learning efficiency of the model. 3. **Influence of pretraining**: Research whether, after pretraining with pure - text data and then adapting to multi - modal data, the performance of the model can be further improved on certain evaluation tasks. 4. **Comparison of different model types**: Compare two different VLM architectures (GIT and Flamingo) and analyze their performance differences under different conditions, especially the influence of the parameter update mechanism on performance. ### Main variables The paper mainly explores three variables: - **Curriculum learning**: Optimize the learning process by adjusting the difficulty level of data presentation. - **Pretraining**: Use pure - text data for pretraining and then perform fine - tuning on multi - modal data. - **Model type**: Select different VLM architectures (such as GIT and Flamingo) and compare their performance differences. ### Experimental design To verify the effects of these variables, the author conducted the following experiments: - Use the multi - modal data set provided by the BabyLM challenge (including approximately 2.9 million image - caption pairs). - Train two VLM models: GIT and Flamingo. - Design four model variants: two baseline models (standard training) and two curriculum - learning models. - Evaluate the performance of the models on multiple benchmark data sets, including multi - modal tasks such as Winoground, VQAv2, DevBench, and uni - modal tasks such as BLIMP, (Super)GLUE, EWOK. ### Results The experimental results show: - **Curriculum learning**: In multi - modal tasks, curriculum learning significantly improves the performance of the model, especially when combined with text pretraining. - **Pretraining**: Text pretraining significantly improves the performance of the model, especially on the VQAv2 and DevBench data sets. - **Model type**: The GIT model performs well on multiple tasks, which may be related to its ability to update the visual encoder parameters. In general, this paper aims to reveal how to optimize the performance of VLMs through curriculum learning, pretraining, and selection of appropriate model architectures under the conditions of limited data and computing resources through a systematic experimental design.

Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training

Acquiring Linguistic Knowledge from Multimodal Input

CLIMB: Curriculum Learning for Infant-inspired Model Building

Multimodal Pretraining from Monolingual to Multilingual

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies

Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

On the Performance of Multimodal Language Models

Efficient Multimodal Learning from Data-centric Perspective

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Towards Multimodal In-Context Learning for Vision & Language Models

Curriculum Learning with Quality-Driven Data Selection

Pre-training LLMs using human-like development data corpus

Curriculum learning for language modeling

Irreducible Curriculum for Language Model Pretraining

Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

DevBench: A multimodal developmental benchmark for language learning

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

eP-ALM: Efficient Perceptual Augmentation of Language Models

VILA: On Pre-training for Visual Language Models