Abstract:Current robot autonomy struggles to operate beyond the assumed Operational Design Domain (ODD), the specific set of conditions and environments in which the system is designed to function, while the real-world is rife with uncertainties that may lead to failures. Automating recovery remains a significant challenge. Traditional methods often rely on human intervention to manually address failures or require exhaustive enumeration of failure cases and the design of specific recovery policies for each scenario, both of which are labor-intensive. Foundational Vision-Language Models (VLMs), which demonstrate remarkable common-sense generalization and reasoning capabilities, have broader, potentially unbounded ODDs. However, limitations in spatial reasoning continue to be a common challenge for many VLMs when applied to robot control and motion-level error recovery. In this paper, we investigate how optimizing visual and text prompts can enhance the spatial reasoning of VLMs, enabling them to function effectively as black-box controllers for both motion-level position correction and task-level recovery from unknown failures. Specifically, the optimizations include identifying key visual elements in visual prompts, highlighting these elements in text prompts for querying, and decomposing the reasoning process for failure detection and control generation. In experiments, prompt optimizations significantly outperform pre-trained Vision-Language-Action Models in correcting motion-level position errors and improve accuracy by 65.78% compared to VLMs with unoptimized prompts. Additionally, for task-level failures, optimized prompts enhanced the success rate by 5.8%, 5.8%, and 7.5% in VLMs' abilities to detect failures, analyze issues, and generate recovery plans, respectively, across a wide range of unknown errors in Lego assembly.

Failures Are Fated, But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models

LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

What could go wrong? Discovering and describing failure modes in computer vision

RoboFail: Analyzing Failures in Robot Learning Policies

Failure Modes in Machine Learning Systems

Adaptive Testing of Computer Vision Models

Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents

Catastrophic Forgetting in Deep Learning: A Comprehensive Taxonomy

Not all Failure Modes are Created Equal: Training Deep Neural Networks for Explicable (Mis)Classification

Failure-Proof Non-Contrastive Self-Supervised Learning

fAIlureNotes: Supporting Designers in Understanding the Limits of AI Models for Computer Vision Tasks

Task Success is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors

Systematic Evaluation of Deep Learning Models for Log-based Failure Prediction

Training Efficiency and Robustness in Deep Learning

Leveraging generative models to characterize the failure conditions of image classifiers

Identifying and Exploiting Structures for Reliable Deep Learning

Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning

DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation

Automating Robot Failure Recovery Using Vision-Language Models With Optimized Prompts

Mass-Producing Failures of Multimodal Systems with Language Models