TransMarker: A Pure Vision Transformer for Facial Landmark Detection.

Wenyan Wu,Yici Cai,Qiang Zhou
DOI: https://doi.org/10.1109/icpr56361.2022.9956248
2022-01-01
Abstract:Recent years, Convolution Neural Networks (CNNs) have achieved impressive results in facial landmark detection task. Especially, the u-shaped architecture, also known as U-net, has become the de-facto standard and achieved tremendous success. However, due to the locality property of convolution operation, it has a limitation in modeling global and long-range semantic information interaction, which is essential in localization tasks. In this work, we propose a Unet-like pure transformer method TransMarker, in which we give a new perspective to tackle facial landmark detection task in a sequence-to-sequence manner. We first split the input image into non-overlapping patches, which are seen as tokens in NLP tasks. Then, we feed the image patches into a symmetric u-shaped Encoder-Decoder architecture for local-global semantic feature learning. In addition, we introduce a Dense Skip-Connection schema to leverage the multi-level information within different resolutions. Note that, unlike conventional U-net architecture, we design the network with pure Transformer blocks, without any conventional operations. Extensive experiments demonstrate the state-of-the-art performance of our method on several standard datasets, i.e., WFLW, COFW and 300W, which remarkably outperform previous convolutional-based methods.
What problem does this paper attempt to address?