Abstract:This paper presents UniVST, a unified framework for localized video style transfer. It operates without the need for training, offering a distinct advantage over existing methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) An AdaIN-guided style transfer mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding window smoothing strategy that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in video outputs. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in video style transfer, specifically including: 1. **Lack of fine - grained control**: Existing methods usually have relatively rough style control over the main objects when performing video style transfer. This rough control may cause the model to be unable to understand the details, resulting in unexpected style transfer effects (see Figure 1(a)). 2. **Balance between content fidelity and style richness**: In video style transfer, a delicate balance needs to be found between maintaining the accuracy of the original video content and adding artistic styles. Over - emphasizing style richness may blur the original layout, while over - emphasizing content fidelity may lead to unobvious style transfer effects (see Figure 1(b)). 3. **Temporal consistency problem**: Different from image style transfer, video style transfer needs to consider the coherence between frames. Directly applying image style transfer techniques to videos may lead to inter - frame inconsistency, manifested as flickering and artifacts (see Figure 1(c)). To solve these problems, the paper proposes **UniVST** (Unified Video Style Transfer), which is a unified framework without training and is specifically used for local video style transfer. Its main contributions include: - **Point - matching mask propagation strategy**: By using the feature maps in the DDIM inversion process to capture correlations, the model architecture is simplified and the need for a tracking model is avoided. - **AdaIN - guided style transfer mechanism**: This mechanism operates in the latent space and attention layers, ensuring a harmonious balance between content fidelity and style richness and reducing the loss of local details. - **Sliding - window smoothing strategy**: Based on the optical - flow method, it optimizes the predicted noise and updates the latent space, significantly improving the temporal consistency of the edited video and reducing artifacts. Through these innovations, UniVST outperforms existing methods in both quantitative and qualitative evaluations, especially in maintaining the style of main objects, ensuring temporal consistency, and detail preservation. ### Formula summary 1. **DDIM denoising formula**: \[ Z_{t - 1}=\sqrt{\alpha_{t - 1}}Z_t+\sqrt{1-\alpha_{t - 1}}\epsilon_\theta(Z_t, t, C) \] where \( Z_t\rightarrow0 \) is the estimate of \( Z_0 \) at time step \( t \): \[ Z_t\rightarrow0=\frac{Z_t-\sqrt{1-\alpha_t}\epsilon_\theta(Z_t, t, C)}{\sqrt{\alpha_t}} \] 2. **AdaIN operation formula**: \[ \text{AdaIN}(Z_t, Z_s^t)=\sigma(Z_s^t)\left(\frac{Z_t-\mu(Z_t)}{\sigma(Z_t)}\right)+\mu(Z_s^t) \] where \( \mu(\cdot) \) and \( \sigma(\cdot) \) represent the mean and standard deviation respectively. 3. **Sliding - window smoothing formula**: \[ \bar{P}_i^t\leftarrow\frac{1}{2m + 1}\sum_{j = i - m}^{i + m}\text{Warp}(P_i^t, P_j^t) \] The updated latent representation is: \[ Z_{t - 1}\leftarrow\sqrt{\alpha_{t - 1}}\bar{Z}_t\rightarrow0+\sqrt{1-\alpha_{t - 1}}\bar{\epsilon}_t \] These formulas and technical means work together to enable UniVST to achieve significant improvements in video style transfer tasks.

UniVST: A Unified Framework for Training-free Localized Video Style Transfer

Correlation-based and Content-Enhanced Network for Video Style Transfer

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Learning Structure-Aware Transformations for Arbitrary Image Style Transfer

Collaborative Distillation for Ultra-Resolution Universal Style Transfer

Two Birds, One Stone: A Unified Framework for Joint Learning of Image and Video Style Transfers

TeSTNeRF: Text-Driven 3D Style Transfer Via Cross-Modal Learning.

Artistic Style Transfer with Internal-external Learning and Contrastive Learning

Diverse Image Style Transfer Via Invertible Cross-Space Mapping

GLStyleNet: Exquisite Style Transfer Combining Global and Local Pyramid Features

Diversified Patch-based Style Transfer with Shifted Style Normalization

Real-time Localized Photorealistic Video Style Transfer

Stable Video Style Transfer Based on Partial Convolution with Depth-Aware Supervision

Real-time Arbitrary Video Style Transfer

Consistent Video Style Transfer Via Compound Regularization.

Learning Self-Supervised Space-Time CNN for Fast Video Style Transfer

Universal Photorealistic Style Transfer: A Lightweight and Adaptive Approach

Control Method of Perceptual Elements Based on Universal Neural Style Transfer

A Unified Arbitrary Style Transfer Framework via Adaptive Contrastive Learning

Unified Style Transfer

Towards efficient image and video style transfer via distillation and learnable feature transformation