Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video

Zachary Chavis,Hyun Soo Park,Stephen J. Guy
2024-07-19
Abstract:Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models lack the spatial understanding necessary for robotics applications where the agent must reason about the affordances provided by the 3D world around them. We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions to predict a task's spatial affordance, that is the location where a person would go to accomplish the task. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our learning-based approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the problem of predicting the spatial location (i.e., spatial function) where humans perform tasks in a robotic environment, particularly when the prediction is made using only egocentric images. Specifically, the authors propose a method that combines natural language descriptions of tasks with images captured from a first-person perspective to predict the relative location where the task occurs. This approach aims to enable robots to use first-person perception to navigate to the actual location of new tasks specified in natural language, and it demonstrates superior performance over existing baseline models on new tasks and new viewpoints in known environments. Additionally, with a small number of demonstrations, this method can be fine-tuned in new environments, further improving accuracy.