Capturing the Unseen: Vision-Free Facial Motion Capture Using Inertial Measurement Units

Youjia Wang,Yiwen Wu,Hengan Zhou,Hongyang Lin,Xingyue Peng,Jingyan Zhang,Yingsheng Zhu,Yingwenqi Jiang,Yatu Zhang,Lan Xu,Jingya Wang,Jingyi Yu
2024-09-19
Abstract:We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in full-body MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for vision-free facial MoCap.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to achieve high - precision Facial Motion Capture (MoCap) without relying on visual signals**. Specifically, the paper proposes a brand - new facial motion - capture technology named CAPUS (Capturing the Unseen), which uses Inertial Measurement Units (IMUs) as the sensing modality and overcomes the limitations of traditional vision - based methods in occlusion, fast movement, and low - light environments. The following are the specific problems that the paper attempts to solve: 1. **Limitations of Visual Signals** - Traditional vision - based facial motion - capture methods (such as 3DDFA, DECA, and Apple ARKit) perform poorly in the face of occlusion (for example, when eating or drinking), fast movement, or low - light environments. - Visual sensors are difficult to capture subtle muscle - movement changes, especially in cases of high - speed movement or small - scale expression changes. 2. **Privacy Protection Requirements** - In the digital age, privacy protection has become an important issue. Vision - based facial - capture methods may involve the leakage of users' portrait rights, and CAPUS provides a more secure solution by completely abandoning visual input. 3. **Portability and Flexibility** - Existing IMU devices are usually large in size and are not suitable for direct application in facial capture. CAPUS has designed a lightweight and miniaturized IMU device that can be comfortably attached to the human face surface without affecting natural facial movements. 4. **Reliability in Complex Scenarios** - CAPUS aims to solve the deficiencies of traditional methods in complex scenarios, for example: - Parts of the face are severely occluded (such as the mouth being occluded by food). - Performers need to freely use body language without having to hold a camera aimed at their faces. - Capture subtle expression changes, especially dynamic information related to muscle speed. ### Core Contributions of the Paper To solve the above problems, the paper makes the following key contributions: 1. **The First IMU - Based Facial Motion - Capture System** CAPUS is the first system that can use IMU data to recover human facial expressions, providing a brand - new non - visual facial - capture method. 2. **Lightweight IMU Design** A custom - designed IMU device for facial capture has been designed, using flexible electronic materials, with a weight of only 2.7% of that of the commercial Xsens IMU and an area reduced to 5.4%. 3. **Multi - Modal Dataset** A multi - modal dataset containing aligned IMU signals, visual data, audio signals, ARKit parameters, emotion labels, and text content has been created, providing comprehensive data support for model training. 4. **Transformer Diffusion Network** A neural - network pipeline based on Transformer Diffusion has been proposed, which can directly infer Blendshape parameters from IMU data, thereby enhancing the performance of the facial - motion - capture system. ### Formula Summary The key formulas involved in the paper are as follows: - Blendshape model representation: $$ M(W)=B_0+\sum_{k = 1}^m w_k B_k $$ where $B_0$ is the neutral face, $B_k$ is the Blendshape basis vector, $m$ is the number of Blendshapes, and $W=\{w_1, w_2,\dots,w_m\}$ is the set of Blendshape weights. - Denoising process formula: $$ x_{t - 1}=\psi(\text{em}(t),x_t,C) $$ where $\psi$ is the denoising network, $\text{em}(t)$ is the noise embedding, $x_t$ is the noisy Blendshape parameter, and $C$ is the IMU conditional signal. - Training loss function: $$ L = ||W_\Psi - W||_1 $$ where $W_\Psi$ is the predicted Blendshape parameter and $W$ is the real one.