Abstract:Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations in their context window. In this work, we ask: Can LLMs and VLMs generate their own examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience from sub-optimal demonstrations and human feedback. Given a task demonstration that may contain inefficiencies or mistakes, a VLM abstracts the trajectory into a generalized program of thoughts by correcting inefficient actions and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These programs of thought are iteratively improved through human feedback while the agent executes the trajectory in a similar environment. The resulting examples significantly improve decision-making in retrieval-augmented LLM and VLM agents. Moreover, as the agent's library of examples grows, it becomes more efficient, relying less on human feedback and requiring fewer environment interactions per demonstration. Our ICAL agent surpasses the SOTA in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over few-shot GPT4V. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on manual prompt engineering and consistently outperforms in-context learning from action plans that lack such programs of thought.

Semantic HELM: A Human-Readable Memory for Reinforcement Learning

Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

A Machine with Short-Term, Episodic, and Semantic Memory Systems

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

Remember the Past for Better Future: Memory-Augmented Offline RL

A Machine With Human-Like Memory Systems

GLIMMER: generalized late-interaction memory reranker

RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models

In-Memory Learning: A Declarative Learning Framework for Large Language Models

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

"My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents

AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

Reflexion: an autonomous agent with dynamic memory and self-reflection

Interpretable Concept-Based Memory Reasoning

Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience

Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents

Learning Symbolic Task Representation from a Human-Led Demonstration: A Memory to Store, Retrieve, Consolidate, and Forget Experiences

Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

Mastering Memory Tasks with World Models