Abstract:We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Data and videos available at <a class="link-external link-https" href="https://robovqa.github.io" rel="external noopener nofollow">this https URL</a>

VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos

From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

3D Concept Learning and Reasoning from Multi-View Images

RAVEN: A Dataset for Relational and Analogical Visual rEasoNing

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

A Unified View of Abstract Visual Reasoning Problems

AVoE: A Synthetic 3D Dataset on Understanding Violation of Expectation for Artificial Cognition

Multi-modal Action Chain Abductive Reasoning

Joint Answering and Explanation for Visual Commonsense Reasoning

Visual Reasoning: from State to Transformation

AVQA: A Dataset for Audio-Visual Question Answering on Videos

Visual Explanation by High-Level Abduction: On Answer-Set Programming Driven Reasoning about Moving Objects

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Video Question Answering: Datasets, Algorithms and Challenges