Abstract:We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.

What problem does this paper attempt to address?

The paper primarily addresses the problem of enabling large Visual Language Models (VLM) to understand and generate three-dimensional (3D) human faces, particularly through image and text inputs. Specifically, the contributions of the paper are as follows: 1. **Proposing FaceGPT**: This is a self-supervised learning framework designed to train VLMs to understand 3D faces and infer 3D faces from image and text inputs. Unlike traditional 3D face reconstruction methods, which often lack semantic reasoning capabilities, FaceGPT embeds 3D Morphable Model (3DMM) parameters into the token space of VLMs, allowing the generation of 3D faces from textual and visual inputs. 2. **Self-Supervised Learning**: FaceGPT can be trained without the need for expensive 3D face annotations, learning from real-world images in a self-supervised manner. This approach utilizes image-based reconstruction loss to guide the learning process. 3. **Comprehensive Capabilities**: FaceGPT not only achieves high-quality 3D face reconstruction but also retains general visual instruction-following capabilities. Additionally, it can generate 3D faces based on complex textual inputs, opening new directions in the field of face analysis. 4. **Summary of Contributions**: - For the first time, it enables VLMs to learn detailed understanding of 3D faces in a fully self-supervised manner. - Demonstrates that VLMs can learn text-based face reconstruction without supervision. - Experiments show that FaceGPT is competitive in 3D face reconstruction and excels in general visual instruction following and text-based face generation. 5. **Experimental Results**: - In the 3D face reconstruction task, FaceGPT shows comparable quality to the current best specialized methods. - In the text-based 3D face reconstruction task, FaceGPT significantly outperforms baseline methods, demonstrating the advantage of directly embedding 3DMM parameters into the VLM token space. - In the instruction-following capability evaluation, FaceGPT maintains good performance, especially in handling face-related instructions. In summary, this research effectively addresses the problem of enabling VLMs to understand and generate 3D faces by proposing the FaceGPT framework and demonstrates its effectiveness across various tasks.

FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

Realistic Face Reenactment Via Self-Supervised Disentangling of Identity and Pose

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

How Good Is ChatGPT at Face Biometrics? A First Look Into Recognition, Soft Biometrics, and Explainability

PoseGPT: Chatting about 3D Human Pose

On Learning 3D Face Morphable Model from In-the-wild Images

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images

ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning

Learning Full-Head 3D GANs from a Single-View Portrait Dataset

Fast-GANFIT: Generative Adversarial Network for High Fidelity 3D Face Reconstruction

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

3D-Mask-GAN:Unsupervised Single-View 3D Object Reconstruction

SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

FaceLift: Semi-supervised 3D Facial Landmark Localization

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars

Unsupervised Style-based Explicit 3D Face Reconstruction from Single Image

Single Image, Any Face: Generalisable 3D Face Generation

Beyond 3DMM Space: Towards Fine-Grained 3D Face Reconstruction