Abstract:In dialogue, the addressee may initially misunderstand the speaker and respond erroneously, often prompting the speaker to correct the misunderstanding in the next turn with a Third Position Repair (TPR). The ability to process and respond appropriately to such repair sequences is thus crucial in conversational AI systems. In this paper, we first collect, analyse, and publicly release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task that is, by design, rife with referential ambiguity. We employ this dataset to evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs and thus recover from miscommunication. We find that, compared to humans, all models significantly underperform in this task. We then show that VLMs can benefit from specialised losses targeting relevant tokens during fine-tuning, achieving better performance and generalising better to new scenarios. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings where repairs are common, and highlight the need to design training regimes and objectives that facilitate learning from interaction. Our code and data are available at <a class="link-external link-http" href="http://www.github.com/JChiyah/blockworld-repairs" rel="external noopener nofollow">this http URL</a>

What problem does this paper attempt to address?

The paper attempts to address the issue of handling user corrections in dialogue systems. Specifically, when one party in a conversation misunderstands the other and responds based on this misunderstanding, the dialogue system needs to correctly handle the user's correction information (referred to as Third Position Repair, TPR) to restore proper communication. The paper evaluates the performance of existing Vision and Language Models (VLM) in handling TPR sequences by constructing a multimodal dataset **BLOCK WORLD - REPAIRS** and explores how to improve these models' performance through specialized training methods. ### Main Research Questions 1. **Dataset Construction**: Collect, analyze, and publicly release the **BLOCK WORLD - REPAIRS** dataset, which contains multimodal TPR sequences occurring in an instruction-following manipulation task. 2. **Model Performance Evaluation**: Use this dataset to evaluate the ability of several state-of-the-art VLMs to handle and respond to TPR sequences, particularly their ability to recover from misunderstandings. 3. **Improving Model Performance**: Explore specific loss functions and fine-tuning methods to enhance VLM performance in handling TPR sequences. ### Background - **Natural Language Understanding (NLU)**: NLU is not just a one-way passive process but an interactive process where people continuously collaborate in everyday conversations to achieve mutual understanding and coordination. - **Repair Mechanism**: Repair mechanisms are crucial for handling misunderstandings in conversations, including self-corrections made by users upon discovering misunderstandings. - **Multimodal Dialogue**: In multimodal environments, repair mechanisms are especially important as the combination of visual and language information can provide richer context, helping systems better understand and correct misunderstandings. ### Methods 1. **Dataset Construction**: Based on the Block World task, extend the original dataset to include short dialogues containing TPR. 2. **Experimental Setup**: Evaluate VLM performance in handling initial instructions and TPR sequences, using different loss functions for fine-tuning to optimize the model's repair capabilities. 3. **Performance Evaluation**: Analyze model performance in handling TPR sequences through human baselines and comparisons with different models, particularly the differences in zero-shot and fine-tuned performance. ### Results - **Zero-Shot Performance**: Most models perform better than random baselines in zero-shot settings but still have a significant gap compared to humans. - **Fine-Tuning Effect**: Fine-tuning improves model performance in handling TPR sequences, especially when using specific loss functions (e.g., calculating loss only for user turns). - **Error Analysis**: Models still struggle with handling some references that are easily recognizable by humans, indicating that current models need improvement in repair mechanisms for complex multimodal tasks. ### Conclusion - **Model Deficiencies**: Existing VLMs still have significant deficiencies in handling TPR sequences, especially in multimodal collaborative scenarios. - **Future Directions**: More effective training methods and objective functions need to be designed to improve model performance in handling repair mechanisms, bringing them closer to human-level performance.

Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models

Detecting and Correcting Speech Repairs

Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in Blocks World

CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues

RePair: Automated Program Repair with Process-based Feedback

CREF: An LLM-based Conversational Software Repair Framework for Programming Tutors

Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack

Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor

Creating and Repairing Robot Programs in Open-World Domains

LLMR: Real-time Prompting of Interactive Worlds using Large Language Models

ReplanVLM: Replanning Robotic Tasks with Visual Language Models

"No, to the Right" -- Online Language Corrections for Robotic Manipulation via Shared Autonomy

A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models

ChatterBox: Multi-round Multimodal Referring and Grounding

Repairing Bugs in Python Assignments Using Large Language Models

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

Benchmarking Educational Program Repair

Learning to generate and corr- uh I mean repair language in real-time

Can 3D Vision-Language Models Truly Understand Natural Language?

Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies