FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

Haoran Wang,Mohit Mendiratta,Christian Theobalt,Adam Kortylewski
2024-06-11
Abstract:We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the problem of enabling large Visual Language Models (VLM) to understand and generate three-dimensional (3D) human faces, particularly through image and text inputs. Specifically, the contributions of the paper are as follows: 1. **Proposing FaceGPT**: This is a self-supervised learning framework designed to train VLMs to understand 3D faces and infer 3D faces from image and text inputs. Unlike traditional 3D face reconstruction methods, which often lack semantic reasoning capabilities, FaceGPT embeds 3D Morphable Model (3DMM) parameters into the token space of VLMs, allowing the generation of 3D faces from textual and visual inputs. 2. **Self-Supervised Learning**: FaceGPT can be trained without the need for expensive 3D face annotations, learning from real-world images in a self-supervised manner. This approach utilizes image-based reconstruction loss to guide the learning process. 3. **Comprehensive Capabilities**: FaceGPT not only achieves high-quality 3D face reconstruction but also retains general visual instruction-following capabilities. Additionally, it can generate 3D faces based on complex textual inputs, opening new directions in the field of face analysis. 4. **Summary of Contributions**: - For the first time, it enables VLMs to learn detailed understanding of 3D faces in a fully self-supervised manner. - Demonstrates that VLMs can learn text-based face reconstruction without supervision. - Experiments show that FaceGPT is competitive in 3D face reconstruction and excels in general visual instruction following and text-based face generation. 5. **Experimental Results**: - In the 3D face reconstruction task, FaceGPT shows comparable quality to the current best specialized methods. - In the text-based 3D face reconstruction task, FaceGPT significantly outperforms baseline methods, demonstrating the advantage of directly embedding 3DMM parameters into the VLM token space. - In the instruction-following capability evaluation, FaceGPT maintains good performance, especially in handling face-related instructions. In summary, this research effectively addresses the problem of enabling VLMs to understand and generate 3D faces by proposing the FaceGPT framework and demonstrates its effectiveness across various tasks.