Feature Reconstruction Disentangling for Pose-invariant Face Recognition Supplementary Material

Xi Peng,Xiang Yu,Kihyuk Sohn,Dimitris N. Metaxas,Manmohan Chandraker
Abstract:Pose-variant face generation We designed a network to predict 3DMM parameters from a single face image. The design is mainly based on VGG16 [4]. We use the same number of convolutional layers as VGG16 but replacing all max pooling layers with stride-2 convolutional operations. The fully connected (fc) layers are also different: we first use two fc layers, each of which has 1024 neurons, to connect with the convolutional modules; then, a fc layer of 30 neurons is used for identity parameters, a fc layer of 29 neurons is used for expression parameters, and a fc layer of 7 neurons is used for pose parameters. Different from [8] uses 199 parameters to represent the identity coefficients, we truncate the number of identity eigenvectors to 30 which preserves 90% of variations. This truncation leads to fast convergence and less overfitting. For texture, we only generate non-frontal faces from frontal ones, which significantly mitigate the hallucinating texture issue caused by self occlusion and guarantee high-fidelity reconstruction. We apply the Z-Buffer algorithm used in [8] to prevent ambiguous pixel intensities due to same image plane position but different depths. Rich feature embedding The design of the rich embedding network is mainly based on the architecture of CASIA-net [6] since it is wildly used in former approach and achieves strong performance in face recognition. During training, CASIA+MultiPIE or CASIA+300WLP are used. As shown in Figure 3 of the main submission, after the convolutional layers of CASIA-net, we use a 512-d FC for the rich feature embedding, which is further branched into a 256-d identity feature and a 128-d non-identity feature. The 128-d non-identity feature is further connected with a 136-d landmark prediction and a 7-d pose prediction. Notice that in the face generation network, the number of pose parameters is 7 instead of 3 because we need to uniquely depict the projection matrix from the 3D model and the 2D face shape in image domain, which includes scale, pitch, yaw, roll, x translation, y translation, and z translations. Disentanglement by feature reconstruction Once the rich embedding network is trained, we feed genius pair that share the same identity but different viewpoints into the network to obtain the corresponding rich embedding, identity and non-identity features. To disentangle the identity and pose factors, we concatenate the identity and non-identity features and roll though two 512-d fully connected layers to output a reconstructed rich embedding depicted by 512 neurons. Both self and cross reconstruction loss are designed to eventually push the two identity features close to each other. At the same time, a cross-entropy loss is applied on the near-frontal identity feature to maintain the discriminative power of the learned representation. The disentanglement of the identity and pose is finally achieved by the proposed feature reconstruction based metric learning.
What problem does this paper attempt to address?