Learning Visual Affordance Grounding from Demonstration Videos

Hongchen Luo,Wei Zhai,Jing Zhang,Yang Cao,Dacheng Tao
DOI: https://doi.org/10.1109/tnnls.2023.3298638
IF: 14.255
2023-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which benefits many applications, such as robot grasping and action recognition. Prevailing methods predominantly depend on the appearance feature of the objects to segment each region of the image, which encounters the following two problems: 1) there are multiple possible regions in an object that people interact with and 2) there are multiple possible human interactions in the same object region. To address these problems, we propose a hand-aided affordance grounding network (HAG-Net) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net adopts a dual-branch structure to process the demonstration video and object image data. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the long short-term memory (LSTM) network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved state-of-the-art results for affordance grounding. The source code is available at: https://github.com/lhc1224/HAG-Net.
What problem does this paper attempt to address?