Abstract:We present Capturing the Unseen (CAPUS), a novel facial motion capture (MoCap) technique that operates without visual signals. CAPUS leverages miniaturized Inertial Measurement Units (IMUs) as a new sensing modality for facial motion capture. While IMUs have become essential in full-body MoCap for their portability and independence from environmental conditions, their application in facial MoCap remains underexplored. We address this by customizing micro-IMUs, small enough to be placed on the face, and strategically positioning them in alignment with key facial muscles to capture expression dynamics. CAPUS introduces the first facial IMU dataset, encompassing both IMU and visual signals from participants engaged in diverse activities such as multilingual speech, facial expressions, and emotionally intoned auditions. We train a Transformer Diffusion-based neural network to infer Blendshape parameters directly from IMU data. Our experimental results demonstrate that CAPUS reliably captures facial motion in conditions where visual-based methods struggle, including facial occlusions, rapid movements, and low-light environments. Additionally, by eliminating the need for visual inputs, CAPUS offers enhanced privacy protection, making it a robust solution for vision-free facial MoCap.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to achieve high - precision Facial Motion Capture (MoCap) without relying on visual signals**. Specifically, the paper proposes a brand - new facial motion - capture technology named CAPUS (Capturing the Unseen), which uses Inertial Measurement Units (IMUs) as the sensing modality and overcomes the limitations of traditional vision - based methods in occlusion, fast movement, and low - light environments. The following are the specific problems that the paper attempts to solve: 1. **Limitations of Visual Signals** - Traditional vision - based facial motion - capture methods (such as 3DDFA, DECA, and Apple ARKit) perform poorly in the face of occlusion (for example, when eating or drinking), fast movement, or low - light environments. - Visual sensors are difficult to capture subtle muscle - movement changes, especially in cases of high - speed movement or small - scale expression changes. 2. **Privacy Protection Requirements** - In the digital age, privacy protection has become an important issue. Vision - based facial - capture methods may involve the leakage of users' portrait rights, and CAPUS provides a more secure solution by completely abandoning visual input. 3. **Portability and Flexibility** - Existing IMU devices are usually large in size and are not suitable for direct application in facial capture. CAPUS has designed a lightweight and miniaturized IMU device that can be comfortably attached to the human face surface without affecting natural facial movements. 4. **Reliability in Complex Scenarios** - CAPUS aims to solve the deficiencies of traditional methods in complex scenarios, for example: - Parts of the face are severely occluded (such as the mouth being occluded by food). - Performers need to freely use body language without having to hold a camera aimed at their faces. - Capture subtle expression changes, especially dynamic information related to muscle speed. ### Core Contributions of the Paper To solve the above problems, the paper makes the following key contributions: 1. **The First IMU - Based Facial Motion - Capture System** CAPUS is the first system that can use IMU data to recover human facial expressions, providing a brand - new non - visual facial - capture method. 2. **Lightweight IMU Design** A custom - designed IMU device for facial capture has been designed, using flexible electronic materials, with a weight of only 2.7% of that of the commercial Xsens IMU and an area reduced to 5.4%. 3. **Multi - Modal Dataset** A multi - modal dataset containing aligned IMU signals, visual data, audio signals, ARKit parameters, emotion labels, and text content has been created, providing comprehensive data support for model training. 4. **Transformer Diffusion Network** A neural - network pipeline based on Transformer Diffusion has been proposed, which can directly infer Blendshape parameters from IMU data, thereby enhancing the performance of the facial - motion - capture system. ### Formula Summary The key formulas involved in the paper are as follows: - Blendshape model representation: $$ M(W)=B_0+\sum_{k = 1}^m w_k B_k $$ where $B_0$ is the neutral face, $B_k$ is the Blendshape basis vector, $m$ is the number of Blendshapes, and $W=\{w_1, w_2,\dots,w_m\}$ is the set of Blendshape weights. - Denoising process formula: $$ x_{t - 1}=\psi(\text{em}(t),x_t,C) $$ where $\psi$ is the denoising network, $\text{em}(t)$ is the noise embedding, $x_t$ is the noisy Blendshape parameter, and $C$ is the IMU conditional signal. - Training loss function: $$ L = ||W_\Psi - W||_1 $$ where $W_\Psi$ is the predicted Blendshape parameter and $W$ is the real one.

Capturing the Unseen: Vision-Free Facial Motion Capture Using Inertial Measurement Units

HybridCap: Inertia-Aid Monocular Capture of Challenging Human Motions

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

Motion Capture from Inertial and Vision Sensors

Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera

Imocap: Motion Capture from Internet Videos

DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras

HiFECap: Monocular High-Fidelity and Expressive Capture of Human Performances

MulayCap: Multi-layer Human Performance Capture Using A Monocular Video Camera

3D Deformation Capture Via a Configurable Self-Sensing IMU Sensor Network

SportsCap: Monocular 3D Human Motion Capture and Fine-grained Understanding in Challenging Sports Videos

Towards Unstructured Unlabeled Optical Mocap: A Video Helps!

A Scalable and Wearable Self-Sensing IMU Sensor Network for Personalized Human Motion and Deformation Capture

Video Tracked Facial Expression Animation

InMyFace: Inertial and mechanomyography-based sensor fusion for wearable facial activity recognition

Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera

IMU2Face: Real-time Gesture-driven Facial Reenactment

Hierarchical facial expression animation by motion capture data

Fusion Poser: 3D Human Pose Estimation Using Sparse IMUs and Head Trackers in Real Time

ChallenCap: Monocular 3D Capture of Challenging Human Performances using Multi-Modal References

Reconstructing 3D human pose and shape from a single image and sparse IMUs