Abstract:Today's largest foundation models have increasingly general capabilities, yet when used as agents, they often struggle with simple reasoning and decision-making tasks, even though they possess good factual knowledge of the task and how to solve it. In this paper, we present a benchmark to pressure-test these models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether they can learn from a large number of expert demonstrations in their context. We evaluate a wide range of state-of-the-art frontier models as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We measure the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, o1-mini, and o1-preview under increasing amounts of expert demonstrations in the context $\unicode{x2013}$ from no demonstrations up to 512 full episodes, pushing these models' multimodal long-context reasoning capabilities to their limits. Across our tasks, today's frontier models rarely manage to fully reach expert performance, showcasing the difficulty of our benchmark. Presenting more demonstrations often has little effect, but some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. Overall, our results suggest that even today's most capable models often struggle to imitate desired behavior by generalizing purely from in-context demonstrations. To help quantify the impact of other approaches and future innovations aiming to tackle this problem, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although the current state - of - the - art large foundation models (LMs) have a wide range of capabilities, they often perform poorly when acting as agents to perform simple reasoning and decision - making tasks. Even if these models possess good task knowledge and knowledge of solutions, they still have difficulty making effective decisions in practical operations. Specifically, the paper proposes and evaluates a new benchmark, named LMAct, to test the multi - modal decision - making capabilities of these models in handling long - context (up to one million tokens) and explore whether they can learn from a large number of expert demonstrations. The research covers a variety of simple interactive decision - making tasks, such as playing Tic - Tac - Toe, chess, Atari games, navigating grid worlds, solving crossword puzzles, and controlling simulated cheetahs. ### Main problems 1. **Multi - modal decision - making capabilities in long - context**: How do the current leading - edge models perform when handling long - context (for example, up to one million tokens)? 2. **Ability to learn from a large number of expert demonstrations**: Can these models improve their decision - making capabilities by observing a large amount of expert behavior? 3. **Impact of different prompt formats**: What is the impact of different prompt formats (such as text - or image - encoded observations, chain - of - thought prompts, etc.) on model performance? ### Research objectives - Evaluate the performance of the current state - of - the - art multi - modal foundation models in various interactive decision - making tasks. - Explore whether increasing the number of expert demonstrations can significantly improve the model's decision - making capabilities. - Analyze the impact of different prompt formats on model performance to optimize prompt design. - Provide an open - source benchmark platform (LMAct) so that future research can be improved and innovated on this basis. Through the exploration of these problems, the paper aims to reveal the limitations of current large - language models in long - context multi - modal decision - making tasks and provide directions for future improvements.

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Are Human-generated Demonstrations Necessary for In-context Learning?

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

Misconfidence-based Demonstration Selection for LLM In-Context Learning

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Scaling In-Context Demonstrations with Structured Attention

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

Unifying Demonstration Selection and Compression for In-Context Learning

What Makes Good In-context Demonstrations for Code Intelligence Tasks with LLMs?

Enhancing In-Context Learning via Implicit Demonstration Augmentation

Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Demonstration Notebook: Finding the Most Suited In-Context Learning Example from Interactions

It Takes One to Tango but More Make Trouble? In-Context Training with Different Number of Demonstrations

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

In-Context Learning with Iterative Demonstration Selection

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Imitator Learning: Achieve Out-of-the-Box Imitation Ability in Variable Environments

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

MileBench: Benchmarking MLLMs in Long Context