Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation

Zong-Wei Hong,Yu-Chen Lin
2024-04-09
Abstract:The domain of computer vision has experienced significant advancements in facial-landmark detection, becoming increasingly essential across various applications such as augmented reality, facial recognition, and emotion analysis. Unlike object detection or semantic segmentation, which focus on identifying objects and outlining boundaries, faciallandmark detection aims to precisely locate and track critical facial features. However, deploying deep learning-based facial-landmark detection models on embedded systems with limited computational resources poses challenges due to the complexity of facial features, especially in dynamic settings. Additionally, ensuring robustness across diverse ethnicities and expressions presents further obstacles. Existing datasets often lack comprehensive representation of facial nuances, particularly within populations like those in Taiwan. This paper introduces a novel approach to address these challenges through the development of a knowledge distillation method. By transferring knowledge from larger models to smaller ones, we aim to create lightweight yet powerful deep learning models tailored specifically for facial-landmark detection tasks. Our goal is to design models capable of accurately locating facial landmarks under varying conditions, including diverse expressions, orientations, and lighting environments. The ultimate objective is to achieve high accuracy and real-time performance suitable for deployment on embedded systems. This method was successfully implemented and achieved a top 6th place finish out of 165 participants in the IEEE ICME 2024 PAIR competition.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to achieve efficient and accurate facial landmark detection on resource-constrained embedded systems. Specifically, the paper aims to address the following issues: 1. **Deploying complex deep learning models on embedded systems with limited computational resources**: Facial landmark detection requires high precision and real-time performance, but the computational power of embedded systems is limited, making it challenging to directly deploy large deep learning models. 2. **Robustness across multiple races and expressions**: Existing datasets often fail to comprehensively cover variations in different races and expressions, leading to insufficient generalization ability of the models in practical applications. To address these issues, the authors propose a knowledge distillation-based approach. By transferring the knowledge from large models (such as Swin Transformer) to smaller models (such as MobileViT-v2), they develop a lightweight yet efficient deep learning model. This approach not only improves the model's accuracy but also ensures its real-time performance on embedded devices. Experimental results show that the proposed MobileViT-v2 model performs excellently on the validation set and achieved 6th place in the IEEE ICME 2024 PAIR competition.