Enhancing Unimodal Features Matters: A Multimodal Framework for Building Extraction.

Xiaofeng Shi,Junyu Gao,Yuan
DOI: https://doi.org/10.1109/tgrs.2024.3392631
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:In recent years, deep learning and multimodal data have substantially propelled the development of building extraction models. However, prevailing multimodal methods are difficult to cope with two challenges: 1) modal laziness: the training error is minimized before the model has learned extensive unimodal patterns and 2) modal imbalance: the backpropagation process is easily dominated by a certain modality. As a result, the unimodal features learning is insufficient, leading to limited performance of the model when dealing with the intricate foreground and background contexts surrounding the buildings. In this article, we deal with this problem from the perspective of algorithm and model evaluation. At the algorithmic level, we propose a unimodal feature enhancement (UFE) framework. Specifically, UFE is model-agnostic, comprising two distinct components: adaptive gradient enhancement (AGE) for modal laziness and consistency constraint loss (CCL) for modal imbalance. AGE dynamically modulates the original gradient by monitoring the representation effects of unimodal features and multimodal fusion features. CCL imposes mutual constraints on diverse modal branches at the semantic level to reconcile the optimization process. At the model evaluation level, a new metric, named unimodal utilization ratio (UUR), is presented to assess models through the learning efficacy of unimodal features. The experimental results including the variants of UUR on two building extraction datasets demonstrate a substantial performance improvement by UFE. Moreover, UFE also exhibits its adaptability when integrated with various model components and its generalization on other multimodal image-related tasks.
What problem does this paper attempt to address?