Abstract:This paper introduces ELMO, a real-time upsampling motion capture framework designed for a single LiDAR sensor. Modeled as a conditional autoregressive transformer-based upsampling motion generator, ELMO achieves 60 fps motion capture from a 20 fps LiDAR point cloud sequence. The key feature of ELMO is the coupling of the self-attention mechanism with thoughtfully designed embedding modules for motion and point clouds, significantly elevating the motion quality. To facilitate accurate motion capture, we develop a one-time skeleton calibration model capable of predicting user skeleton offsets from a single-frame point cloud. Additionally, we introduce a novel data augmentation technique utilizing a LiDAR simulator, which enhances global root tracking to improve environmental understanding. To demonstrate the effectiveness of our method, we compare ELMO with state-of-the-art methods in both image-based and point cloud-based motion capture. We further conduct an ablation study to validate our design principles. ELMO's fast inference time makes it well-suited for real-time applications, exemplified in our demo video featuring live streaming and interactive gaming scenarios. Furthermore, we contribute a high-quality LiDAR-mocap synchronized dataset comprising 20 different subjects performing a range of motions, which can serve as a valuable resource for future research. The dataset and evaluation code are available at {\blue \url{<a class="link-external link-https" href="https://movin3d.github.io/ELMO_SIGASIA2024/" rel="external noopener nofollow">this https URL</a>}}

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the motion discontinuity problem caused by the low frame rate (20 fps) of a single LiDAR sensor in real - time motion capture. Specifically, the authors propose the ELMO framework, which aims to convert the 20 - fps LiDAR point cloud sequence into 60 - fps motion capture data through up - sampling techniques, thereby achieving high - quality real - time motion capture. This technology is particularly suitable for application scenarios that require high frame rates and low latency, such as live streaming and interactive games. ### Problems Solved by the Paper 1. **Low Frame Rate Problem**: The frame rate of a single LiDAR sensor is usually low (20 fps), which leads to discontinuity between frames in motion capture in real - time applications, affecting the user experience. 2. **Self - Occlusion Problem**: A single LiDAR sensor is prone to self - occlusion when capturing human motion, resulting in the inability to accurately capture the positions of some joints. 3. **Global Translation Tracking**: Existing motion capture methods have deficiencies in global translation tracking, especially in complex environments. ### Solutions 1. **Conditional Autoregressive Transformer Architecture**: ELMO adopts a generator architecture based on a conditional autoregressive transformer, generating future motion postures by combining past motion and current point cloud data. 2. **Embedding Module Design**: A special embedding module is designed to extract joint features and point cloud features, and the relationships between these features are learned through the self - attention mechanism, improving the motion quality. 3. **Skeleton Calibration Model**: A one - time skeleton calibration model is developed, which can predict the user's skeleton offset from a single - frame point cloud, ensuring the accuracy of the initial joint positions. 4. **Data Augmentation Technique**: A data augmentation technique based on the LiDAR simulator is introduced, enhancing the diversity of the data set and the environmental understanding ability through global rotation and collision detection. ### Technical Details - **Input and Output**: - Input: A 20 - fps LiDAR point cloud sequence. - Output: 60 - fps motion capture data. - **Embedding Module**: - **Motion Embedding**: Use the spatio - temporal graph convolution block (ST - GCN) to extract joint features and use the 1D convolution block to extract root node features. - **Point Cloud Embedding**: Group the point cloud into multiple local body regions through far - point sampling (FPS) and the nearest neighbor (k - NN), and use Mini - PointNet to extract point cloud features. - **Up - sampling Generator**: - Adopt a conditional autoregressive model, learn the relationship between point cloud features and motion features through the self - attention mechanism, and generate future motion postures. - Use three different marker types (combined markers, mask markers, and prediction markers) for the construction and processing of the input sequence. - **Motion Prior**: - Generate a latent vector through the motion distribution encoder to help the generator predict reasonable postures in the case of self - occlusion. - **Data Processing**: - Use a statistical outlier removal algorithm to filter out irrelevant noise points. - Ensure the consistency of the number of input point clouds through far - point sampling or randomly generating point cloud data. ### Experimental Verification - **Comparative Experiments**: Comparative experiments were carried out with existing image - based and point - cloud - based motion capture methods to verify the performance advantages of ELMO. - **Ablation Experiments**: The effectiveness of each design module was verified through ablation experiments. - **Application Scenarios**: The practical application effects of ELMO in live streaming and interactive games were demonstrated. ### Contributions 1. Proposed the first motion capture framework that uses a single LiDAR to achieve real - time up - sampling. 2. Designed novel embedding and generator architectures, effectively improving the quality of motion capture. 3. Introduced a data augmentation technique based on the LiDAR simulator, enhancing the global translation tracking performance. 4. Released a high - quality LiDAR - motion - capture synchronized data set, including various actions performed by 20 subjects.

ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling

MOVIN: Real-time Motion Capture using a Single LiDAR

ELiOT : End-to-end Lidar Odometry using Transformer Framework

Robust Keyframe-based Dense SLAM with an RGB-D Camera.

Event-Based Motion Capture System for Online Multi-Quadrotor Localization and Tracking

An Accurate, Robust Visual Odometry and Detail-Preserving Reconstruction System

LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds

FR-LIO: Fast and Robust Lidar-Inertial Odometry by Tightly-Coupled Iterated Kalman Smoother and Robocentric Voxels

RoMo: A Robust Solver for Full-body Unlabeled Optical Motion Capture

Learning Motion Priors for 4D Human Body Capture in 3D Scenes

LIMOT: A Tightly-Coupled System for LiDAR-Inertial Odometry and Multi-Object Tracking

D-LIOM: Tightly-coupled Direct LiDAR-Inertial Odometry and Mapping

ENCODE: a Deep Point Cloud ODometry Network

E-LOAM: LiDAR Odometry and Mapping with Expanded Local Structural Information

DAMO: A Deep Solver for Arbitrary Marker Configuration in Optical Motion Capture

Noise-in, Bias-out: Balanced and Real-time MoCap Solving

An Efficient LiDAR SLAM With Angle-Based Feature Extraction and Voxel-Based Fixed-Lag Smoothing

DK-SLAM: Monocular Visual SLAM with Deep Keypoint Learning, Tracking and Loop-Closing

Improved LiDAR Odometry and Mapping using Deep Semantic Segmentation and Novel Outliers Detection