CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot

Artem Lykov,Mikhail Litvinov,Mikhail Konenkov,Rinat Prochii,Nikita Burtsev,Ali Alridha Abdulkarim,Artem Bazhenov,Vladimir Berman,Dzmitry Tsetserukou
2024-01-18
Abstract:This paper introduces CognitiveDog, a pioneering development of quadruped robot with Large Multi-modal Model (LMM) that is capable of not only communicating with humans verbally but also physically interacting with the environment through object manipulation. The system was realized on Unitree Go1 robot-dog equipped with a custom gripper and demonstrated autonomous decision-making capabilities, independently determining the most appropriate actions and interactions with various objects to fulfill user-defined tasks. These tasks do not necessarily include direct instructions, challenging the robot to comprehend and execute them based on natural language input and environmental cues. The paper delves into the intricacies of this system, dataset characteristics, and the software architecture. Key to this development is the robot's proficiency in navigating space using Visual-SLAM, effectively manipulating and transporting objects, and providing insightful natural language commentary during task execution. Experimental results highlight the robot's advanced task comprehension and adaptability, underscoring its potential in real-world applications. The dataset used to fine-tune the robot-dog behavior generation model is provided at the following link:
Robotics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to create a quadruped robot system capable of performing complex tasks through visual and language understanding. Specifically, the paper introduces a system called CognitiveDog, which is based on large multimodal models (LMM). This system not only can communicate with humans through language but also physically interact with the environment through object manipulation. CognitiveDog aims to enhance the robot's autonomous decision-making ability, enabling it to understand and execute tasks based on natural language input and environmental cues without direct instructions. The paper mainly focuses on the following aspects: 1. **Multimodal Understanding and Interaction**: How to enable the robot to understand its surroundings by analyzing environmental images and processing natural language instructions, and to take appropriate actions based on this understanding. 2. **Autonomous Task Execution**: How to allow the robot to autonomously complete user-defined tasks without explicit guidance. 3. **System Architecture and Model Selection**: A detailed introduction to the system's software architecture, including the selection and integration of large language models (such as Mistral 7B) and vision-language models (such as MiniGPT4-v2). 4. **Experimental Evaluation**: A series of experiments were conducted to validate CognitiveDog's performance in different tasks and environments, particularly its generalization ability and emerging capabilities (such as symbolic understanding, reasoning, and face recognition). Overall, the paper aims to advance the autonomy and adaptability of quadruped robots in practical applications, enabling them to work effectively in complex and dynamic environments.