Abstract:In the change detection (CD) task, the UNet architecture has achieved superior results. However, due to the inherent limitation of convolution operations, UNet is inadequate in learning global context and long-range spatial relations. Transformers can capture long-range feature dependencies, but the lack of low-level details may result in limited localization capabilities. Therefore, this article proposes an end-to-end encoding-decoding hybrid transformer model for CD, TransUNetCD, which has the advantages of both transformers and UNet. The model encodes the tokenized image patches from the convolutional neural network (CNN) feature map to extract rich global context information. The decoder upsamples the encoded features, connects them with higher-resolution multiscale features through skip connections to learn local-global semantic features, and restores the full spatial resolution of the feature map to achieve precise localization. The model proposed in this article not only solves the problem that redundant information is generated when extracting low-level features under the UNet framework, but also solves the problem that the relationship between each feature layer cannot be fully modeled and the optimal feature difference representation cannot be obtained. On this basis, we introduce a difference enhancement module to generate a difference feature map containing rich change information. By weighting each pixel and selectively aggregating features, the effectiveness of the network and the accuracy of extracting changing features are improved. The results on multiple datasets demonstrate that, compared to state-of-the-art methods, the TransUNetCD can further reduce false alarms and missed alarms, and the edge of the changing area is more accurate. The model has the highest score in each metric than other baseline models and has a robust generalization ability.

TransMarker: A Pure Vision Transformer for Facial Landmark Detection.

Transformer Union Convolution Network for Visual Object Tracking

TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images

End-to-End Spatial Transform Face Detection and Recognition

Lantra: Taming Transformers for Robust Facial Landmark Detection

Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection

Hybrid Token Transformer for Deep Face Recognition

Lightweight facial landmark detection network based on improved MobileViT

Towards Accurate Facial Landmark Detection via Cascaded Transformers

1DFormer: a Transformer Architecture Learning 1D Landmark Representations for Facial Landmark Tracking

TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network

Unifying Global-Local Representations in Salient Object Detection with Transformer

Precise Facial Landmark Detection by Reference Heatmap Transformer

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

IFTSDNet: An Interact-Feature Transformer Network With Spatial Detail Enhancement Module for Change Detection

HA-Transformer: Harmonious aggregation from local to global for object detection

Conformer: Local Features Coupling Global Representations for Visual Recognition

WaterFormer: A Global–Local Transformer for Underwater Image Enhancement With Environment Adaptor

Supervised Transformer Network for Efficient Face Detection

3-D Facial Landmarks Detection for Intelligent Video Systems