Abstract:Visual (image, video) quality assessments can be modelled by visual features in different domains, e.g., spatial, frequency, and temporal domains. Perceptual mechanisms in the human visual system (HVS) play a crucial role in generation of quality perception. This paper proposes a general framework for no-reference visual quality assessment using efficient windowed transformer architectures. A lightweight module for multi-stage channel attention is integrated into Swin (shifted window) Transformer. Such module can represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. Meanwhile, representative features for image quality perception in the spatial and frequency domains can also be derived from the IQA model, which are then fed into another windowed transformer architecture for video quality assessment (VQA). The VQA model efficiently reuses attention information across local windows to tackle the issue of expensive time and memory complexities of original transformer. Experimental results on both large-scale IQA and VQA databases demonstrate that the proposed quality assessment models outperform other state-of-the-art models by large margins. The complete source code will be published on Github.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to conduct image and video quality assessment without reference images or videos (No - Reference Image and Video Quality Assessment, NR - IQA and NR - VQA). Specifically, the paper proposes a general framework based on an efficient windowed Transformer architecture for no - reference visual quality assessment. The following are the main contributions of the paper: 1. **Image Quality Assessment (IQA)**: - A Multi - Stage Channel Attention (MSCA) module is proposed and integrated into the Swin Transformer to simulate the contrast sensitivity mechanism in the human visual system. - By extracting features at different resolution scales, the accuracy of image quality perception is improved and a solid foundation is provided for video quality assessment. 2. **Video Quality Assessment (VQA)**: - The Locally Shared Attention (LSA) mechanism is used to process the quality features of video frames, and then the global Transformer encoder is used to generate video quality predictions. - By dividing the video sequence into non - overlapping segments, processing the quality features within each segment, and finally generating the overall video quality score through pooling operations in the time domain. ### Specific Problems Solved by the Paper 1. **No - Reference Image Quality Assessment**: - Existing no - reference image quality assessment methods mainly focus on specific types of distortion, such as compression and transmission artifacts. These methods usually rely on hand - designed feature engineering and are difficult to generalize to multiple distortion types. - The MSCA - Swin Transformer model proposed in this paper can better simulate the perception mechanism of the human visual system through the multi - stage channel attention mechanism, thereby improving the accuracy of no - reference image quality assessment. 2. **No - Reference Video Quality Assessment**: - Video quality assessment is more complex than image quality assessment because it needs to consider both spatial and temporal characteristics simultaneously. Existing methods usually require a large amount of computing resources and are difficult to be directly applied to long - video sequences. - The LSAT - VQA model proposed in this paper effectively reduces the computational complexity while maintaining high - quality assessment performance through the locally shared attention mechanism and the global Transformer encoder. ### Technical Details 1. **MSCA - Swin Transformer**: - **Multi - Stage Channel Attention**: By applying the channel attention mechanism at different stages of the Swin Transformer, the contrast sensitivity mechanism of the human visual system is simulated. - **Adaptive Spatial Average Pooling**: To avoid additional distortion introduced by image scaling, adaptive spatial average pooling is used to adapt images of any resolution to a fixed - size input. - **Feature Fusion**: By extracting features at different stages and fusing them, the representational ability and accuracy of the model are improved. 2. **LSAT - VQA**: - **Locally Shared Attention**: Share the same attention block within a video segment to reduce the model complexity and simulate the uneven contribution of different frames to the segment quality. - **Global Transformer Encoder**: Process the feature vectors of all segments through the global Transformer encoder to generate the final video quality score. - **Zero - Padding and Masking Operations**: Process video sequences of different lengths, and exclude the influence of padded frames through zero - padding and masking operations. ### Experimental Results The paper conducted experiments on large - scale IQA and VQA databases, and the results show that the proposed models significantly outperform other state - of - the - art models in performance. ### Summary By combining the multi - stage channel attention mechanism and the locally shared attention mechanism, this paper proposes an efficient and accurate no - reference image and video quality assessment framework. This framework not only improves the accuracy of assessment but also significantly reduces the computational complexity, providing a feasible solution for practical applications.

Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment

Human Visual Perception Based Image Quality Assessment for Video Prediction

ARET-IQA: an Aspect-Ratio-Embedded Transformer for Image Quality Assessment

MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion

Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine Strategy

Video Quality Assessment Based on Swin Transformer with Spatio-Temporal Feature Fusion and Data Augmentation

VTAMIQ: Transformers for Attention Modulated Image Quality Assessment

Auxiliary Information Guided Self-Attention for Image Quality Assessment

Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token

DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment

Video Transformer based Video Quality Assessment with Spatiotemporally adaptive Token Selection and Assembly

Integrates Spatiotemporal Visual Stimuli for Video Quality Assessment

Structured Computational Modeling of Human Visual System for No-reference Image Quality Assessment

Image Quality Assessment with Transformers and Multi-Metric Fusion Modules

Transformer for Image Quality Assessment

Boosting Image Quality Assessment Through Efficient Transformer Adaptation with Local Feature Enhancement

Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Global-Local Progressive Integration Network for Blind Image Quality Assessment

Video Quality Assessment for Spatio-Temporal Resolution Adaptive Coding

A Spatial-Temporal Video Quality Assessment Method via Comprehensive HVS Simulation

Data-Efficient Image Quality Assessment with Attention-Panel Decoder