Plain-PCQA: No-Reference Point Cloud Quality Assessment by Analysis of Plain Visual and Geometrical Components
Xiongli Chai,Feng Shao,Baoyang Mu,Hangwei Chen,Qiuping Jiang,Yo-Sung Ho
DOI: https://doi.org/10.1109/tcsvt.2024.3350180
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:In reviewing the research progress in Point Cloud Quality Assessment (PCQA), two main pathways have emerged, i.e., 2D projections and 3D point descriptors. The former primarily focuses on visual information, while the latter concentrates on crucial geometrical information in three-dimensional space. However, the current studies lack a thorough investigation of the impact of visual components and seldom pay special attention to plane-point fusion strategies. To comprehensively represent features and effectively tackle various types of impairments, we propose an end-to-end learning paradigm, only considering plain visual and geometrical factors called Plain-PCQA, for quantitatively evaluating objective metrics of 3D dense point clouds associated with human perception. Firstly, we explore a sophisticated preprocessing technique. The entire point clouds are packaged into six projections by moving virtual cameras, which can conveniently increase the visual samples during the training stage. Given the high resolution of the projected image, we have opted for a relatively lightweight network, namely ResNet-18, as the backbone to enable higher resolution input data. Five cropped patches from the projected image are collectively fed into this network. In light of the presence of some invalid information in the projections, a mask weight is devised to calculate the significance of each patch based on its effective informational content. Secondly, dual neural networks, comprising of a No-Reference (NR) branch and a Degraded-Reference (DR) branch, are designed with fundamental visual components to provide quantitative quality metrics. Specifically, the NR branch utilizes the feature output of each block in the Vision Transformer (ViT) model to obtain long-range low-level and high-level visual NR quality. The DR branch employs KLT (Karhunen-Loève Transform) to acquire the principal component information of an image as the macro-structural image, and then feeds the difference between input images and macro-structural images into a network for DR quality extraction. Thirdly, a Plane-Point Interaction Transformer (P2IT) is presented by incorporating texture and semantic features in 2D projections and geometrical features in 3D spaces to characterize the complete features with a connected 2D-3D feature representation. With these elaborately designed deep features, the proposed model can achieve competitive performances relying solely on plain visual and geometrical components. The experimental results demonstrate the potential of the proposed approach in multiple representative databases, which surpasses existing state-of-the-art methods significantly.