Abstract:We introduce a new benchmark, LLF-Bench (Learning from Language Feedback Benchmark; pronounced as "elf-bench"), to evaluate the ability of AI agents to interactively learn from natural language feedback and instructions. Learning from language feedback (LLF) is essential for people, largely because the rich information this feedback provides can help a learner avoid much of trial and error and thereby speed up the learning process. Large Language Models (LLMs) have recently enabled AI agents to comprehend natural language -- and hence AI agents can potentially benefit from language feedback during learning like humans do. But existing interactive benchmarks do not assess this crucial capability: they either use numeric reward feedback or require no learning at all (only planning or information retrieval). LLF-Bench is designed to fill this omission. LLF-Bench is a diverse collection of sequential decision-making tasks that includes user recommendation, poem writing, navigation, and robot control. The objective of an agent is to interactively solve these tasks based on their natural-language instructions and the feedback received after taking actions. Crucially, to ensure that the agent actually "learns" from the feedback, LLF-Bench implements several randomization techniques (such as paraphrasing and environment randomization) to ensure that the task isn't familiar to the agent and that the agent is robust to various verbalizations. In addition, LLF-Bench provides a unified OpenAI Gym interface for all its tasks and allows the users to easily configure the information the feedback conveys (among suggestion, explanation, and instantaneous performance) to study how agents respond to different types of feedback. Together, these features make LLF-Bench a unique research platform for developing and testing LLF agents.

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

TaskBench: Benchmarking Large Language Models for Task Automation

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

LEMMA: Learning Language-Conditioned Multi-Robot Manipulation

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study

LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation

AgentBench: Evaluating LLMs as Agents

LLF-Bench: Benchmark for Interactive Learning from Language Feedback

RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models

ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models

Generalizable Long-Horizon Manipulations with Large Language Models

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?