Abstract:Despite decades of research, much is still unknown about the computations carried out in the human face processing network. Recently, deep networks have been proposed as a computational account of human visual processing, but while they provide a good match to neural data throughout visual cortex, they lack interpretability. We introduce a method for interpreting brain activity using a new class of deep generative models, disentangled representation learning models, which learn a low-dimensional latent space that "disentangles" different semantically meaningful dimensions of faces, such as rotation, lighting, or hairstyle, in an unsupervised manner by enforcing statistical independence between dimensions. We find that the majority of our model's learned latent dimensions are interpretable by human raters. Further, these latent dimensions serve as a good encoding model for human fMRI data. We next investigate the representation of different latent dimensions across face-selective voxels. We find that low- and high-level face features are represented in posterior and anterior face-selective regions, respectively, corroborating prior models of human face recognition. Interestingly, though, we find identity-relevant and irrelevant face features across the face processing network. Finally, we provide new insight into the few "entangled" (uninterpretable) dimensions in our model by showing that they match responses in the ventral stream and carry information about facial identity. Disentangled face encoding models provide an exciting alternative to standard "black box" deep learning approaches for modeling and interpreting human brain data. We use a class of interpretable deep neural network models, disentangled variational autoencoders (dVAEs), to analyze human fMRI data. We find that a dVAE learns human interpretable dimensions of faces, such as lighting, expression, and hairstyle, and provides as good a match to human fMRI data as matched, non-disentangled models. Our disentangled encoding approach allows us to map different disentangled features to ROI and voxel activity. A decoding analysis confirms that the model separates identity relevant and irrelevant information and reveals that the remaining entangled dimensions contain identity-relevant information. Together these results highlight the use of disentangled models for more interpretable fMRI encoding than standard deep learning models.

Cross-VAE: Towards Disentangling Expression from Identity for Human Faces

Facial Landmark Disentangled Network with Variational Autoencoder

Realistic Face Reenactment Via Self-Supervised Disentangling of Identity and Pose

Joint Structured Sparsity Regularized Multiview Dimension Reduction for Video-Based Facial Expression Recognition.

Disentangling Identity and Pose for Facial Expression Recognition

DynamicVAE: Decoupling Reconstruction Error and Disentangled Representation Learning

Facial Expression Recognition by Expression-Specific Representation Swapping

Dual-channel feature disentanglement for identity-invariant facial expression recognition

3D Face Modeling via Weakly-supervised Disentanglement Network joint Identity-consistency Prior

Variance-Aware Bi-Attention Expression Transformer for Open-Set Facial Expression Recognition in the Wild

Disentangled Speech Representation Learning for One-Shot Cross-Lingual Voice Conversion Using SS-Vae

Toward Identity-Invariant Facial Expression Recognition: Disentangled Representation via Mutual Information Perspective

A discriminative multiscale feature extraction network for facial expression recognition in the wild

Disentangled Variational Autoencoder for Emotion Recognition in Conversations

Disentanglement for Discriminative Visual Recognition

Semantically Disentangled Variational Autoencoder for Modeling 3D Facial Details

Identity-Enhanced Network for Facial Expression Recognition

Facial Representation Extraction by Mutual Information Maximization and Correlation Minimization.

Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification

Disentangled deep generative models reveal coding principles of the human face processing network

A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition