Abstract:A flexible 3D landmark detector training using only 2D ground truth, and contains a spatial transformer network for built‐in face detection. In this paper, we examine three important issues in the practical use of state‐of‐the‐art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require a face normalization step as a pre‐process, often accomplished by a separately trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre‐trained network performs optimal face normalization for the task of landmark detection. Thus, we instead analyse the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, jointly learning an optimal face normalization and landmark detection by a single neural network. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space rather than directly in 2D can further improve accuracy. To convert the predicted 3D landmarks into screen‐space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Third, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a sub‐optimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state‐of‐the‐art on standard benchmarks.

Lantra: Taming Transformers for Robust Facial Landmark Detection

TransMarker: A Pure Vision Transformer for Facial Landmark Detection.

Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection

Real-Time Facial Landmark Detection by Attention-driven Lightweight Network

Towards Accurate Facial Landmark Detection via Cascaded Transformers

1DFormer: a Transformer Architecture Learning 1D Landmark Representations for Facial Landmark Tracking

DATR: Domain-adaptive transformer for multi-domain landmark detection

Precise Facial Landmark Detection by Reference Heatmap Transformer

Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation

Enhancing Landmark Detection in Cluttered Real-World Scenarios with Vision Transformers

Wavelet Tree Transformer: Multihead Attention With Frequency-Selective Representation and Interaction for Remote Sensing Object Detection

TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network

Unconstrained Fashion Landmark Detection via Hierarchical Recurrent Transformer Networks

Lightweight facial landmark detection network based on improved MobileViT

Case report: adverse granulomatous reaction (Granuloma formation) and pseudomonas superinfection after lip augmentation by the new filler DermaLive®

Unifying Global-Local Representations in Salient Object Detection with Transformer

Infinite 3D Landmarks: Improving Continuous 2D Facial Landmark Detection

Face Transformer for Recognition

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

3-D Facial Landmarks Detection for Intelligent Video Systems

Facial Expression Recognition Based on Fine-Tuned Channel–Spatial Attention Transformer