Face Mask Removal with Region-attentive Face Inpainting

Minmin Yang
2024-09-11
Abstract:During the COVID-19 pandemic, face masks have become ubiquitous in our lives. Face masks can cause some face recognition models to fail since they cover significant portion of a face. In addition, removing face masks from captured images or videos can be desirable, e.g., for better social interaction and for image/video editing and enhancement purposes. Hence, we propose a generative face inpainting method to effectively recover/reconstruct the masked part of a face. Face inpainting is more challenging compared to traditional inpainting, since it requires high fidelity while maintaining the identity at the same time. Our proposed method includes a Multi-scale Channel-Spatial Attention Module (M-CSAM) to mitigate the spatial information loss and learn the inter- and intra-channel correlation. In addition, we introduce an approach enforcing the supervised signal to focus on masked regions instead of the whole image. We also synthesize our own Masked-Faces dataset from the CelebA dataset by incorporating five different types of face masks, including surgical mask, regular mask and scarves, which also cover the neck area. The experimental results show that our proposed method outperforms different baselines in terms of structural similarity index measure, peak signal-to-noise ratio and l1 loss, while also providing better outputs qualitatively. The code will be made publicly available. Code is available at GitHub.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation of face recognition models due to wearing masks during the COVID - 19 pandemic and the need to remove masks in image or video editing. Specifically, the paper proposes a generative face inpainting method to effectively restore and reconstruct the parts of the face occluded by masks. ### Problem Background 1. **Face Recognition Challenges**: Masks cover important areas of the face, causing many face recognition models to fail. 2. **Social Interaction and Image Editing Requirements**: Removing masks in images or videos can improve social interaction and meet the needs of image/video editing and enhancement. ### Solutions To address these challenges, the paper proposes the following innovations: 1. **Multi - scale Channel - Spatial Attention Module (M - CSAM)**: - **Function**: Alleviate the loss of spatial information and learn the correlations between and within channels. - **Structure**: Combines the spatial pyramid structure and the channel - spatial attention mechanism, improving the representativeness of the network. 2. **Regional Attention Supervision**: - **Method**: By focusing the supervision signal only on the area occluded by the mask, rather than the entire image, thus limiting the variance of the generated content. - **Advantage**: Reduces the uncertainty of the generated content and improves the inpainting quality. 3. **Synthetic Masked - Faces Dataset**: - **Source**: Generated from the CelebA dataset, containing five different types of masks (such as medical masks, ordinary masks, scarves, etc.). - **Use**: Used for training and evaluating the proposed face inpainting method. 4. **Improved Encoder - Decoder Structure**: - **Feature**: Adopts gated convolution layers, which can dynamically select features and gradually adaptively fill the occluded areas. - **Advantage**: Compared with traditional convolution operations, gated convolution can better handle occluded pixels and avoid visual artifacts and structural distortion. ### Experimental Results The experimental results show that the proposed method outperforms several baseline methods in terms of structural similarity index (SSIM), peak signal - to - noise ratio (PSNR), and ℓ1 loss, and also provides better output effects qualitatively. ### Formula Summary - **Formula for Synthetically Generated Image**: \[ I_{\text{syn}}=(1 - I_m)\times I_{\text{inp}}+I_m\times I_r \] where \(I_{\text{syn}}\) is the synthetically generated image, \(I_m\) is the binary mask image, \(I_{\text{inp}}\) is the input image, and \(I_r\) is the output of the generator. - **Perceptual Loss Formula**: \[ L_p=\sum_{i}W_i\times H_i\times C_i\left\|F^{\text{syn}}_i - F^{\text{gt}}_i\right\|_2 \] where \(F^{\text{syn}}_i\) and \(F^{\text{gt}}_i\) are the activation maps of the \(i\)-th layer of the VGG - 16 backbone network respectively. - **Style Loss Formula**: \[ L_s=\sum_{i}C_i\times C_i\left\|G^{\text{syn}}_i - G^{\text{gt}}_i\right\|_1 \] where \(G^{\text{syn}}_i\) and \(G^{\text{gt}}_i\) are the Gram matrices calculated from the \(i\)-th layer activation maps. - **Total Loss Function**: \[ L=\lambda_rL_r+\lambda_pL_p+\lambda_sL_s+\lambda_{\text{adv}}L_{\text{adv}} \]