LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Anian Ruoss,Fabio Pardo,Harris Chan,Bonnie Li,Volodymyr Mnih,Tim Genewein
2024-12-02
Abstract:Today's largest foundation models have increasingly general capabilities, yet when used as agents, they often struggle with simple reasoning and decision-making tasks, even though they possess good factual knowledge of the task and how to solve it. In this paper, we present a benchmark to pressure-test these models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether they can learn from a large number of expert demonstrations in their context. We evaluate a wide range of state-of-the-art frontier models as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We measure the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, o1-mini, and o1-preview under increasing amounts of expert demonstrations in the context $\unicode{x2013}$ from no demonstrations up to 512 full episodes, pushing these models' multimodal long-context reasoning capabilities to their limits. Across our tasks, today's frontier models rarely manage to fully reach expert performance, showcasing the difficulty of our benchmark. Presenting more demonstrations often has little effect, but some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. Overall, our results suggest that even today's most capable models often struggle to imitate desired behavior by generalizing purely from in-context demonstrations. To help quantify the impact of other approaches and future innovations aiming to tackle this problem, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Although the current state - of - the - art large foundation models (LMs) have a wide range of capabilities, they often perform poorly when acting as agents to perform simple reasoning and decision - making tasks. Even if these models possess good task knowledge and knowledge of solutions, they still have difficulty making effective decisions in practical operations. Specifically, the paper proposes and evaluates a new benchmark, named LMAct, to test the multi - modal decision - making capabilities of these models in handling long - context (up to one million tokens) and explore whether they can learn from a large number of expert demonstrations. The research covers a variety of simple interactive decision - making tasks, such as playing Tic - Tac - Toe, chess, Atari games, navigating grid worlds, solving crossword puzzles, and controlling simulated cheetahs. ### Main problems 1. **Multi - modal decision - making capabilities in long - context**: How do the current leading - edge models perform when handling long - context (for example, up to one million tokens)? 2. **Ability to learn from a large number of expert demonstrations**: Can these models improve their decision - making capabilities by observing a large amount of expert behavior? 3. **Impact of different prompt formats**: What is the impact of different prompt formats (such as text - or image - encoded observations, chain - of - thought prompts, etc.) on model performance? ### Research objectives - Evaluate the performance of the current state - of - the - art multi - modal foundation models in various interactive decision - making tasks. - Explore whether increasing the number of expert demonstrations can significantly improve the model's decision - making capabilities. - Analyze the impact of different prompt formats on model performance to optimize prompt design. - Provide an open - source benchmark platform (LMAct) so that future research can be improved and innovated on this basis. Through the exploration of these problems, the paper aims to reveal the limitations of current large - language models in long - context multi - modal decision - making tasks and provide directions for future improvements.