Unsupervised NIR-VIS Face Recognition Via Homogeneous-to-Heterogeneous Learning and Residual-Invariant Enhancement
Yiming Yang,Weipeng Hu,Haifeng Hu
DOI: https://doi.org/10.1109/tifs.2023.3346176
IF: 7.231
2023-01-01
IEEE Transactions on Information Forensics and Security
Abstract:Near-Infrared and Visible light (NIR-VIS) face recognition methods have achieved remarkable success in the fields of security surveillance, criminal investigation, and multimedia information retrieval. But the existing methods heavily rely on carefully annotated labels, leading to expensive manual labelling consumption and deployment flexibility. This motivates us to design unsupervised methods to address NIR-VIS recognition without relying on label information. To this end, we propose a novel homogeneous-to-HEterogeneous learning and Residual-invariant Enhancement (HERE) network for Unsupervised NIR-VIS Heterogeneous Face Recognition (NIR-VIS-UHFR). As the name suggests, the optimization of HERE follow a ”homogeneous-to-heterogeneous learning” strategy to fully explore complementary and common semantic information across different modalities. During the homogeneous learning phase, Modality-Adversarial Contrastive Learning (MACL) leverages the collaboration of modality contrastive learning and adversarial learning. On the one hand, MACL learns compact and discriminative intra-modal representations for NIR and VIS data, respectively. On the other hand, MACL guarantees that NIR-VIS data conform to the common feature distribution in a shared feature space, effectively reducing modal differences even in the absence of identity information between modalities. In the heterogeneous learning phase, K-reciprocal-Encoding-based Cross-modal Labeling (KECL) is introduced as robust pseudo label estimation to fully explore cross-modal relationships and group cross-modal features into clusters. With the pseudo labels provided by KECL, Refined cross-modal Contrastive Learning (RCL) is developed with modality-invariant averaging initialization and dynamic focus weighting strategies to extract modality-invariant features. Finally, Residual-invariant Representations Enhancement (RRE) mines partial features under the cross-modal face for robust matching. Compared to supervised methods, our unsupervised HERE demonstrates comparable performance on multiple datasets, greater scalability and practicality in deployment by reducing data acquisition requirements and costs.