ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers

Luoyu Mei,Shuai Wang,Yun Cheng,Ruofeng Liu,Zhimeng Yin,Wenchao Jiang,Shuai Wang,Wei Gong
DOI: https://doi.org/10.24963/ijcai.2024/131
2024-09-02
Abstract:Semantic recognition is pivotal in virtual reality (VR) applications, enabling immersive and interactive experiences. A promising approach is utilizing millimeter-wave (mmWave) signals to generate point clouds. However, the high computational and memory demands of current mmWave point cloud models hinder their efficiency and reliability. To address this limitation, our paper introduces ESP-PCT, a novel Enhanced Semantic Performance Point Cloud Transformer with a two-stage semantic recognition framework tailored for VR applications. ESP-PCT takes advantage of the accuracy of sensory point cloud data and optimizes the semantic recognition process, where the localization and focus stages are trained jointly in an end-to-end manner. We evaluate ESP-PCT on various VR semantic recognition conditions, demonstrating substantial enhancements in recognition efficiency. Notably, ESP-PCT achieves a remarkable accuracy of 93.2% while reducing the computational requirements (FLOPs) by 76.9% and memory usage by 78.2% compared to the existing Point Transformer model simultaneously. These underscore ESP-PCT's potential in VR semantic recognition by achieving high accuracy and reducing redundancy. The code and data of this project are available at \url{<a class="link-external link-https" href="https://github.com/lymei-SEU/ESP-PCT" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the high computational and memory demands in virtual reality (VR) semantic recognition. Specifically: 1. **Reducing Redundancy**: Current millimeter-wave (mmWave) point cloud models exhibit significant temporal and spatial redundancy when processing point cloud data, leading to low computational efficiency. The paper proposes a new framework, ESP-PCT (Enhanced Semantic Performance Point Cloud Transformer), which optimizes the utilization of computational resources by efficiently compressing spatiotemporal redundancy. 2. **Improving Accuracy**: ESP-PCT employs a two-stage framework that first locates key areas (such as VR controllers) and then applies an attention mechanism to these selected points. This approach allows the model to focus on semantically discriminative regions, thereby enhancing recognition accuracy. 3. **Enhancing Robustness**: ESP-PCT maintains high recognition accuracy across various occlusion scenarios, including no occlusion, wood occlusion, brick occlusion, and combined occlusion, performing excellently under different environmental conditions. 4. **Reducing Computational Cost**: Compared to existing methods, ESP-PCT significantly reduces the computational load (FLOPs) and memory usage while maintaining high recognition accuracy. For example, in a no-occlusion scenario, ESP-PCT achieves a 97.6% application type recognition accuracy and a 92.8% button recognition accuracy, with a computational load of only 0.9 FLOPs(G) and 693 parameters. Through these improvements, ESP-PCT provides a more efficient and flexible solution for VR semantic recognition.