Abstract:We present a framework for robots to learn novel visual concepts and tasks via in-situ linguistic interactions with human users. Previous approaches have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies by enabling them to learn novel concepts and solve unseen robotics tasks with them. To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept to its parent nodes within a concept hierarchy. This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. We present two sets of results. Firstly, we compare Hi-Viscont with the baseline model (FALCON) on visual question answering(VQA) in three domains. While being comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an improvement of over 9% on non-leaf concepts on average. We compare our model's performance against the baseline FALCON model. Our framework achieves 33% improvements in success rate metric, and 19% improvements in the object level accuracy compared to the baseline model. With both of these results we demonstrate the ability of our model to learn tasks and concepts in a continual learning setting on the robot.

Winning the ICCV'2021 VALUE Challenge: Task-aware Ensemble and Transfer Learning with Visual Concepts

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Technical Report for CVPR 2022 LOVEU AQTC Challenge

Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning

A CLIP-Enhanced Method for Video-Language Understanding

First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

1st Place in ICCV 2023 Workshop Challenge Track 1 on Resource Efficient Deep Learning for Computer Vision: Budgeted Model Training Challenge

Champion Solution for the WSDM2023 Toloka VQA Challenge

A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Towards Difficulty-Agnostic Efficient Transfer Learning for Vision-Language Models

Vision Language Models are In-Context Value Learners

The Solution for the CVPR2023 NICE Image Captioning Challenge

Winning the CVPR'2022 AQTC Challenge: A Two-stage Function-centric Approach

A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

Fashion-IQ 2020 Challenge 2nd Place Team's Solution

Interactive Visual Task Learning for Robots

CLIP4Caption ++: Multi-CLIP for Video Caption

Learning Video-Text Aligned Representations for Video Captioning

Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores