Learning Disentangled Representations for Perceptual Point Cloud Quality Assessment via Mutual Information Minimization

Ziyu Shan,Yujie Zhang,Yipeng Liu,Yiling Xu
2024-11-13
Abstract:No-Reference Point Cloud Quality Assessment (NR-PCQA) aims to objectively assess the human perceptual quality of point clouds without relying on pristine-quality point clouds for reference. It is becoming increasingly significant with the rapid advancement of immersive media applications such as virtual reality (VR) and augmented reality (AR). However, current NR-PCQA models attempt to indiscriminately learn point cloud content and distortion representations within a single network, overlooking their distinct contributions to quality information. To address this issue, we propose DisPA, a novel disentangled representation learning framework for NR-PCQA. The framework trains a dual-branch disentanglement network to minimize mutual information (MI) between representations of point cloud content and distortion. Specifically, to fully disentangle representations, the two branches adopt different philosophies: the content-aware encoder is pretrained by a masked auto-encoding strategy, which can allow the encoder to capture semantic information from rendered images of distorted point clouds; the distortion-aware encoder takes a mini-patch map as input, which forces the encoder to focus on low-level distortion patterns. Furthermore, we utilize an MI estimator to estimate the tight upper bound of the actual MI and further minimize it to achieve explicit representation disentanglement. Extensive experimental results demonstrate that DisPA outperforms state-of-the-art methods on multiple PCQA datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the representation learning problem in No - Reference Point Cloud Quality Assessment (NR - PCQA). Specifically, the goal of NR - PCQA is to objectively evaluate the perceptual quality of point clouds without the original high - quality point cloud as a reference. With the rapid development of immersive media applications (such as virtual reality and augmented reality), this problem is becoming increasingly important. However, existing NR - PCQA models have the following problems when learning point cloud content and distortion representations: 1. **Single network structure**: Most existing methods attempt to learn the representations of point cloud content and distortion simultaneously in a single network, ignoring their different contributions to quality information. 2. **Feature entanglement**: Since point cloud content and distortion patterns are highly entangled in the representation space, the model performance is limited. 3. **Data imbalance**: The high - dimensional nature of point cloud content makes it very difficult to learn its representation, and existing PCQA datasets are very limited in terms of content, which is prone to overfitting. To solve these problems, the authors propose a new decoupled representation learning framework called DisPA (Disentangled Perceptual Assessment), which separates the representations of point cloud content and distortion by minimizing mutual information (MI). The main contributions of DisPA include: - **Two - branch structure**: Two independent encoders are used to learn the representations of point cloud content and distortion respectively. - **Pre - training strategy**: The content - aware encoder is pre - trained by a masked auto - encoding strategy to capture semantic information. - **Local distortion map generation**: Local distortion maps are generated by mesh sampling, forcing the distortion - aware encoder to focus on low - level distortion patterns. - **MI regularization**: A mutual information estimator is used to estimate and minimize the mutual information between content and distortion representations, achieving explicit representation decoupling. Through these methods, DisPA can achieve better performance than existing methods on multiple PCQA datasets and better follow the perceptual mechanism of the human visual system.