Hyper-Transformer for Amodal Completion

Jianxiong Gao,Xuelin Qian,Longfei Liang,Junwei Han,Yanwei Fu
2024-05-30
Abstract:Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information. Learning shape priors is crucial for effective amodal completion, but traditional methods often rely on two-stage processes or additional information, leading to inefficiencies and potential error accumulation. To address these shortcomings, we introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN). This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks. Specifically, H-TAN uses a dual-branch structure to extract multi-scale features from both images and masks. The multi-scale features from the image branch guide the hyper transformer in learning shape priors and in generating the weights for dynamic convolution tailored to each instance. The dynamic convolution head then uses the features from the mask branch to predict precise amodal masks. We extensively evaluate our model on three benchmark datasets: KINS, COCOA-cls, and D2SA, where H-TAN demonstrated superior performance compared to existing methods. Additional experiments validate the effectiveness and stability of the novel hyper transformer in our framework.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of **Amodal Object Completion**. Specifically, amodal object completion is a complex task in computer vision that aims to predict the invisible parts of objects when they are partially occluded. Traditional methods often rely on a two-stage process or additional information, leading to inefficiencies and potential error accumulation. To overcome these issues, the authors propose a new framework called the **Hyper-Transformer Amodal Network (H-TAN)**. ### Main Problems and Challenges 1. **Shape Prior Learning**: Effective amodal completion requires learning the shape priors of objects, but traditional methods often rely on a two-stage process or additional information, leading to inefficiencies and error accumulation. 2. **Multi-Scale Feature Extraction**: It is necessary to extract multi-scale features from images and masks to accurately predict amodal masks. 3. **Dynamic Convolution Head**: A dynamic convolution head is needed that can generate weights according to the specific needs of each instance to improve the accuracy of amodal mask prediction. ### Solutions 1. **Dual-Branch Structure**: H-TAN adopts a dual-branch structure to extract multi-scale features from images and masks respectively. The image branch uses an adapted ResNet to extract multi-scale features, while the mask branch combines these features with mask details through skip connections, gradually refining the feature maps of the amodal mask. 2. **Hyper-Transformer**: A hyper-transformer is introduced to generate the weights of the dynamic convolution head using image features. The hyper-transformer processes features through cross-attention and self-attention mechanisms, ultimately generating weights for the dynamic convolution head. 3. **Dynamic Convolution Head**: The dynamic convolution head uses the feature maps extracted from the mask branch, combined with the weights generated by the hyper-transformer, to accurately predict the amodal mask. ### Experimental Results The authors conducted extensive experiments on three benchmark datasets (KINS, COCOA-cls, and D2SA), and the results show that H-TAN outperforms existing methods on all datasets, especially in handling occluded parts. Additionally, ablation studies validated the effectiveness of each component, particularly the contributions of the hyper-transformer and multi-scale fusion module to performance. ### Contributions 1. Proposed H-TAN, a new amodal segmentation framework that combines an innovative hyper-transformer and dynamic convolution head for learning object shape priors and predicting amodal masks. 2. Adopted a dual-branch structure for feature extraction, with the image branch using an adapted ResNet to extract multi-scale features and the mask branch gradually refining the feature maps of the amodal mask through skip connections. 3. Experimental results demonstrate that H-TAN achieves a new state-of-the-art level in the task of amodal completion. In summary, this paper effectively addresses key issues in amodal object completion by proposing the H-TAN framework, improving the accuracy and efficiency of amodal mask prediction.