Abstract:Existing face super-resolution (FSR) methods have made significant advancements, but they primarily super-resolve face with limited visual information, original pixel-wise space in particular, commonly overlooking the pluralistic clues, like the higher-order depth and semantics, as well as non-visual inputs (text caption and description). Consequently, these methods struggle to produce a unified and meaningful representation from the input face. We suppose that introducing the language-vision pluralistic representation into unexplored potential embedding space could enhance FSR by encoding and exploiting the complementarity across language-vision prior. This motivates us to propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR. Specifically, besides directly absorbing knowledge from original input, we introduce the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths. These priors are then employed to guide the more critical feature representation, facilitating realistic and high-quality face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when existing face super - resolution (FSR) methods handle low - resolution (LR) face images, they mainly rely on limited visual information (such as the original pixel space), while ignoring high - level depth and semantic information, as well as non - visual inputs (such as text descriptions). Therefore, these methods are difficult to generate unified and meaningful face representations and perform poorly in terms of reconstruction quality and perceptual quality. Specifically, the paper points out: 1. **Limitations of existing methods**: Existing FSR methods rely too much on specific visual priors (such as face parsing maps, heat maps, etc.), and have difficulties in optimizing uncertainty and generalization ability. In addition, these methods mainly focus on visual perception and ignore non - visual language - text information, resulting in incomplete scene representations and affecting the reconstruction performance of face images. 2. **Introducing language - visual priors**: To overcome the above problems, the paper proposes a new framework LLV - FSR, which enhances the FSR task by combining large - scale visual - language models and high - level visual priors. Specifically, LLV - FSR uses pre - trained visual - language models to generate multiple prior information, including image captions, descriptions, face semantic masks and depth maps, and then uses this prior information to guide more critical feature representations, thereby achieving high - quality face super - resolution. ### Main contributions 1. **First attempt to combine large - scale language models and high - level visual priors**: LLV - FSR introduces the powerful capabilities of large - language models and high - level visual priors into the FSR task. 2. **Design an effective language - visual prior fusion module**: Through a carefully designed language - visual prior fusion block (LVPFB), make full use of the complementary information in the language - visual representation and alleviate the ill - conditioned nature of the FSR problem. 3. **Experimental results show superior performance**: The experimental results show that LLV - FSR has reached the state - of - the - art level in both visual quality and quantitative indicators, and the PSNR on the MMCelebA - HQ dataset is 0.43 dB higher than that of the best existing method. ### Method overview The main process of LLV - FSR is as follows: 1. **Extract visual features**: First, extract initial visual features from low - resolution face images through convolutional layers. 2. **Generate language - visual priors**: Use pre - trained large - scale visual - language models (such as BLIP2, ChatGPT, SAM, DAM) to generate text captions, descriptions, semantic masks and depth maps as prior information. 3. **Fuse prior information**: Fuse these prior information with visual features through the language - visual prior fusion block (LVPFB) to make full use of their complementarity. 4. **Generate final results**: Finally, generate high - resolution face images through the feature reconstruction layer. In this way, LLV - FSR not only improves the reconstruction quality, but also enhances the realism and naturalness of face images.

LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Face Super-Resolution Via Bilayer Contextual Representation

Spatial-Frequency Mutual Learning for Face Super-Resolution

FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors.

OPLS-SR: A Novel Face Super-Resolution Learning Method Using Orthonormalized Coherent Features

CFGPFSR: A Generative Method Combining Facial and GAN Priors for Face Super-Resolution

Rethinking Prior-Guided Face Super-Resolution: A New Paradigm with Facial Component Prior.

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

MSFSR: A Multi-Stage Face Super-Resolution with Accurate Facial Representation Via Enhanced Facial Boundaries.

MSRFSR: Multi-Stage Refining Face Super-Resolution With Iterative Collaboration Between Face Recovery and Landmark Estimation

Multi-level landmark-guided deep network for face super-resolution

Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

A Unified Framework to Super-Resolve Face Images of Varied Low Resolutions

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

Deep Selective Combinatorial Embedding and Consistency Regularization for Light Field Super-resolution

Light Field Spatial Super-resolution via Deep Combinatorial Geometry Embedding and Structural Consistency Regularization

Robust Face Super-Resolution Via Position Relation Model Based on Global Face Context.

Super-Resolving Face Image by Facial Parsing Information

FoPru: Focal Pruning for Efficient Large Vision-Language Models