Lantra: Taming Transformers for Robust Facial Landmark Detection

Wenyan Wu,Yici Cai,Qiang Zhou
DOI: https://doi.org/10.2139/ssrn.4121072
2022-01-01
SSRN Electronic Journal
Abstract:We present Landmark Transformer (LanTra), a powerful method leveraging vision transformers as a backbone in place of convolutional networks, for the facial landmark detection task. Specifically, we deploy a symmetric Encoder-Decoder architecture, with pure transformer networks as encoder and convolutional networks as decoder. In the Transformer-Encoder, the input image is split into several patches, which are seen as tokens and fed into various stages of multi-head self-attention modules. While in the Convolution-Decoder, the output representations of each stage of transformer-encoder are fused into the corresponding stage of the decoder with the proposed Structural-Aware Attention module, which is designed for the feature dimension adaptation and facial structure attention. Owing to the global receptive field and unchanging feature resolution in the proposed transformer-based encoder, we can successfully extract rich context information while maintaining the local information undamaged, which is critical for the robustness of the prediction results. Extensive experiments demonstrate the state-of-the-art results of our method on several popular academic datasets, i.e., WFLW, COFW and 300W, in which we show a tremendous potential of vision transformers in the facial landmark detection task.
What problem does this paper attempt to address?