Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment

Junyong You,Zheng Zhang
DOI: https://doi.org/10.48550/arXiv.2203.14557
2022-08-20
Abstract:Visual (image, video) quality assessments can be modelled by visual features in different domains, e.g., spatial, frequency, and temporal domains. Perceptual mechanisms in the human visual system (HVS) play a crucial role in generation of quality perception. This paper proposes a general framework for no-reference visual quality assessment using efficient windowed transformer architectures. A lightweight module for multi-stage channel attention is integrated into Swin (shifted window) Transformer. Such module can represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. Meanwhile, representative features for image quality perception in the spatial and frequency domains can also be derived from the IQA model, which are then fed into another windowed transformer architecture for video quality assessment (VQA). The VQA model efficiently reuses attention information across local windows to tackle the issue of expensive time and memory complexities of original transformer. Experimental results on both large-scale IQA and VQA databases demonstrate that the proposed quality assessment models outperform other state-of-the-art models by large margins. The complete source code will be published on Github.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to conduct image and video quality assessment without reference images or videos (No - Reference Image and Video Quality Assessment, NR - IQA and NR - VQA). Specifically, the paper proposes a general framework based on an efficient windowed Transformer architecture for no - reference visual quality assessment. The following are the main contributions of the paper: 1. **Image Quality Assessment (IQA)**: - A Multi - Stage Channel Attention (MSCA) module is proposed and integrated into the Swin Transformer to simulate the contrast sensitivity mechanism in the human visual system. - By extracting features at different resolution scales, the accuracy of image quality perception is improved and a solid foundation is provided for video quality assessment. 2. **Video Quality Assessment (VQA)**: - The Locally Shared Attention (LSA) mechanism is used to process the quality features of video frames, and then the global Transformer encoder is used to generate video quality predictions. - By dividing the video sequence into non - overlapping segments, processing the quality features within each segment, and finally generating the overall video quality score through pooling operations in the time domain. ### Specific Problems Solved by the Paper 1. **No - Reference Image Quality Assessment**: - Existing no - reference image quality assessment methods mainly focus on specific types of distortion, such as compression and transmission artifacts. These methods usually rely on hand - designed feature engineering and are difficult to generalize to multiple distortion types. - The MSCA - Swin Transformer model proposed in this paper can better simulate the perception mechanism of the human visual system through the multi - stage channel attention mechanism, thereby improving the accuracy of no - reference image quality assessment. 2. **No - Reference Video Quality Assessment**: - Video quality assessment is more complex than image quality assessment because it needs to consider both spatial and temporal characteristics simultaneously. Existing methods usually require a large amount of computing resources and are difficult to be directly applied to long - video sequences. - The LSAT - VQA model proposed in this paper effectively reduces the computational complexity while maintaining high - quality assessment performance through the locally shared attention mechanism and the global Transformer encoder. ### Technical Details 1. **MSCA - Swin Transformer**: - **Multi - Stage Channel Attention**: By applying the channel attention mechanism at different stages of the Swin Transformer, the contrast sensitivity mechanism of the human visual system is simulated. - **Adaptive Spatial Average Pooling**: To avoid additional distortion introduced by image scaling, adaptive spatial average pooling is used to adapt images of any resolution to a fixed - size input. - **Feature Fusion**: By extracting features at different stages and fusing them, the representational ability and accuracy of the model are improved. 2. **LSAT - VQA**: - **Locally Shared Attention**: Share the same attention block within a video segment to reduce the model complexity and simulate the uneven contribution of different frames to the segment quality. - **Global Transformer Encoder**: Process the feature vectors of all segments through the global Transformer encoder to generate the final video quality score. - **Zero - Padding and Masking Operations**: Process video sequences of different lengths, and exclude the influence of padded frames through zero - padding and masking operations. ### Experimental Results The paper conducted experiments on large - scale IQA and VQA databases, and the results show that the proposed models significantly outperform other state - of - the - art models in performance. ### Summary By combining the multi - stage channel attention mechanism and the locally shared attention mechanism, this paper proposes an efficient and accurate no - reference image and video quality assessment framework. This framework not only improves the accuracy of assessment but also significantly reduces the computational complexity, providing a feasible solution for practical applications.