Abstract:Large language models are effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 14 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (image classification, visual QA, and object localization). We observe that many-shot ICL, including up to almost 2,000 demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. We also find open-weights multimodal foundation models like Llama 3.2-Vision do not benefit from the demonstrating examples, highlighting an important gap between open and closed multimodal foundation models. Given the high inference costs required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro learns more quickly than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at <a class="link-external link-https" href="https://github.com/stanfordmlgroup/ManyICL" rel="external noopener nofollow">this https URL</a> .

Focused Large Language Models are Stable Many-Shot Learners

Large Language Models Know What Makes Exemplary Contexts

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Many-Shot In-Context Learning

Scaling In-Context Demonstrations with Structured Attention

Why Larger Language Models Do In-context Learning Differently?

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

Reducing Distraction in Long-Context Language Models by Focused Learning

Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks

SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer Shots

Iterative Forward Tuning Boosts In-Context Learning in Language Models

ParaICL: Towards Robust Parallel In-Context Learning

Task-Level Thinking Steps Help Large Language Models for Challenging Classification Task

Improving In-context Learning via Bidirectional Alignment

In-Context Learning Demonstration Selection via Influence Analysis

Many-Shot In-Context Learning in Multimodal Foundation Models

Misconfidence-based Demonstration Selection for LLM In-Context Learning

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

FocusLLM: Precise Understanding of Long Context by Dynamic Condensing

Investigating the Learning Behaviour of In-Context Learning: A Comparison with Supervised Learning

Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning