Abstract:Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information. Learning shape priors is crucial for effective amodal completion, but traditional methods often rely on two-stage processes or additional information, leading to inefficiencies and potential error accumulation. To address these shortcomings, we introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN). This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks. Specifically, H-TAN uses a dual-branch structure to extract multi-scale features from both images and masks. The multi-scale features from the image branch guide the hyper transformer in learning shape priors and in generating the weights for dynamic convolution tailored to each instance. The dynamic convolution head then uses the features from the mask branch to predict precise amodal masks. We extensively evaluate our model on three benchmark datasets: KINS, COCOA-cls, and D2SA, where H-TAN demonstrated superior performance compared to existing methods. Additional experiments validate the effectiveness and stability of the novel hyper transformer in our framework.

What problem does this paper attempt to address?

The paper attempts to address the problem of **Amodal Object Completion**. Specifically, amodal object completion is a complex task in computer vision that aims to predict the invisible parts of objects when they are partially occluded. Traditional methods often rely on a two-stage process or additional information, leading to inefficiencies and potential error accumulation. To overcome these issues, the authors propose a new framework called the **Hyper-Transformer Amodal Network (H-TAN)**. ### Main Problems and Challenges 1. **Shape Prior Learning**: Effective amodal completion requires learning the shape priors of objects, but traditional methods often rely on a two-stage process or additional information, leading to inefficiencies and error accumulation. 2. **Multi-Scale Feature Extraction**: It is necessary to extract multi-scale features from images and masks to accurately predict amodal masks. 3. **Dynamic Convolution Head**: A dynamic convolution head is needed that can generate weights according to the specific needs of each instance to improve the accuracy of amodal mask prediction. ### Solutions 1. **Dual-Branch Structure**: H-TAN adopts a dual-branch structure to extract multi-scale features from images and masks respectively. The image branch uses an adapted ResNet to extract multi-scale features, while the mask branch combines these features with mask details through skip connections, gradually refining the feature maps of the amodal mask. 2. **Hyper-Transformer**: A hyper-transformer is introduced to generate the weights of the dynamic convolution head using image features. The hyper-transformer processes features through cross-attention and self-attention mechanisms, ultimately generating weights for the dynamic convolution head. 3. **Dynamic Convolution Head**: The dynamic convolution head uses the feature maps extracted from the mask branch, combined with the weights generated by the hyper-transformer, to accurately predict the amodal mask. ### Experimental Results The authors conducted extensive experiments on three benchmark datasets (KINS, COCOA-cls, and D2SA), and the results show that H-TAN outperforms existing methods on all datasets, especially in handling occluded parts. Additionally, ablation studies validated the effectiveness of each component, particularly the contributions of the hyper-transformer and multi-scale fusion module to performance. ### Contributions 1. Proposed H-TAN, a new amodal segmentation framework that combines an innovative hyper-transformer and dynamic convolution head for learning object shape priors and predicting amodal masks. 2. Adopted a dual-branch structure for feature extraction, with the image branch using an adapted ResNet to extract multi-scale features and the mask branch gradually refining the feature maps of the amodal mask through skip connections. 3. Experimental results demonstrate that H-TAN achieves a new state-of-the-art level in the task of amodal completion. In summary, this paper effectively addresses key issues in amodal object completion by proposing the H-TAN framework, improving the accuracy and efficiency of amodal mask prediction.

Hyper-Transformer for Amodal Completion

DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation

HA-Transformer: Harmonious aggregation from local to global for object detection

Amodal Ground Truth and Completion in the Wild

AMANet: Adaptive Multi-Path Aggregation for Learning Human 2D-3D Correspondences

Image Amodal Completion: A Survey

Cmf-transformer: cross-modal fusion transformer for human action recognition

Meta-Transformer: A Unified Framework for Multimodal Learning

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

Open-World Amodal Appearance Completion

Adaptive Masked Autoencoder Transformer for Image Classification

Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

Hierarchical Interactive Multimodal Transformer for Aspect-Based Multimodal Sentiment Analysis

Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion

MxT: Mamba x Transformer for Image Inpainting

High-Fidelity and Efficient Pluralistic Image Completion with Transformers

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection

Coarse-to-Fine Amodal Segmentation with Shape Prior

Structure-Aware Cross-Modal Transformer for Depth Completion