Abstract:Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By integrating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of performance degradation of facial landmark detection (FLD) on partially invisible faces (such as faces under occlusion, extreme lighting conditions or extreme head postures). Specifically, existing FLD methods experience a significant performance drop when dealing with these challenging scenarios because the features extracted from non - visible regions are corrupted. To address this issue, the authors propose a new Transformer model named **ORFormer**. This model can identify and recover the missing features in non - visible regions by introducing learnable "messenger tokens". The specific improvements are as follows: 1. **Identifying non - visible regions**: Each image patch token is associated with an additional learnable token - the messenger token. The messenger token aggregates features from all other patches, thereby evaluating the consistency between a certain patch and other patches. By comparing the similarity between the regular embedding and the messenger embedding, non - visible regions can be identified. 2. **Recovering missing features**: For occluded patches, ORFormer uses the features aggregated by the messenger tokens for recovery. The recovered features are used to generate high - quality heatmaps to support downstream FLD tasks. 3. **Improving robustness**: Through the above mechanism, ORFormer can generate robust heatmaps under extreme conditions (such as partial occlusion, extreme lighting or head postures) and integrate them into existing FLD methods, thereby improving the overall performance. ### Main contributions 1. **Proposing a new occlusion - robust Transformer**: ORFormer uses messenger tokens to simulate potential occlusions and recover missing features, enabling the Transformer to detect and handle non - visible tokens in a general - purpose way. 2. **Application in robust heatmap generation**: The high - quality heatmaps generated by ORFormer provide supplementary information for existing FLD methods, enhancing their ability to handle partially invisible faces. 3. **Superior performance on benchmark datasets**: Experimental results show that ORFormer outperforms the existing state - of - the - art FLD methods on multiple benchmark datasets (such as WFLW and COFW), especially when dealing with extreme situations. ### Summary This paper solves the problem of performance degradation of existing FLD methods when dealing with partially invisible faces by introducing ORFormer, providing a robust and efficient solution.

ORFormer: Occlusion-Robust Transformer for Accurate Facial Landmark Detection

Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition

1DFormer: a Transformer Architecture Learning 1D Landmark Representations for Facial Landmark Tracking

Feature Completion Transformer for Occluded Person Re-identification

CephalFormer: Incorporating Global Structure Constraint into Visual Features for General Cephalometric Landmark Detection

Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection

Precise Facial Landmark Detection by Reference Heatmap Transformer

Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification

Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion

Landmarks-assisted Collaborative Deep Framework for Automatic 4D Facial Expression Recognition.

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Lantra: Taming Transformers for Robust Facial Landmark Detection

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

FRCE: Transformer-based Feature Reconstruction and Cross-Enhancement for Occluded Person Re-Identification

LLRFaceFormer: Lightweight Face Transformer for Real-World Low-Resolution Recognition

Facial Expression Recognition Based on Fine-Tuned Channel–Spatial Attention Transformer

FER-former: Multi-modal Transformer for Facial Expression Recognition

SFRA: spatial fusion regression augmentation network for facial landmark detection

RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection