Tube-Embedded Transformer for Pixel Prediction

Xiaoya Zhang,Shumin Zhang,Zhen Cui,Zechao Li,Jin Xie,Jian Yang
DOI: https://doi.org/10.1109/TMM.2022.3147664
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Multi-task pixel-level learning, which aims to exploit the inter-task interactions to improve the learning of each task, is an important but challenging issue in visual perception and multimedia applications. Measuring the inter-task correlation and intra-task specificity, we propose a tube-embedded transformer (TET) framework for robust multi-task pixel prediction. To facilitate inter-task interactions, we aggregate and project all tasks into a shared tube pool to generate the latent multi-task representation during the coarse-to-fine decoding stages. The resulting task-tube interactions replace the two-by-two task-task interactions to reduce the model complexity significantly. In addition, we introduce the transformer mechanism to adaptively transfer tube features to the target task. Concretely, on the one hand, multi-task features aggregate in the tube to generate the shared feature representation bases; on the other hand, based on the task-tube association and complementarity, the tube outputs the query entry and the weighting coefficients of the target task. Experimentally, on the joint learning of semantic segmentation, depth estimation, and surface normal estimation, the comparison experiments show the superiority of the TET multi-task learning method over other state-of-the-art approaches, and the ablation experiments verify the effectiveness of the TET mechanism.
What problem does this paper attempt to address?