UniVST: A Unified Framework for Training-free Localized Video Style Transfer

Quanjian Song,Mingbao Lin,Wengyi Zhan,Shuicheng Yan,Liujuan Cao
2024-10-26
Abstract:This paper presents UniVST, a unified framework for localized video style transfer. It operates without the need for training, offering a distinct advantage over existing methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) An AdaIN-guided style transfer mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding window smoothing strategy that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in video outputs. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in video style transfer, specifically including: 1. **Lack of fine - grained control**: Existing methods usually have relatively rough style control over the main objects when performing video style transfer. This rough control may cause the model to be unable to understand the details, resulting in unexpected style transfer effects (see Figure 1(a)). 2. **Balance between content fidelity and style richness**: In video style transfer, a delicate balance needs to be found between maintaining the accuracy of the original video content and adding artistic styles. Over - emphasizing style richness may blur the original layout, while over - emphasizing content fidelity may lead to unobvious style transfer effects (see Figure 1(b)). 3. **Temporal consistency problem**: Different from image style transfer, video style transfer needs to consider the coherence between frames. Directly applying image style transfer techniques to videos may lead to inter - frame inconsistency, manifested as flickering and artifacts (see Figure 1(c)). To solve these problems, the paper proposes **UniVST** (Unified Video Style Transfer), which is a unified framework without training and is specifically used for local video style transfer. Its main contributions include: - **Point - matching mask propagation strategy**: By using the feature maps in the DDIM inversion process to capture correlations, the model architecture is simplified and the need for a tracking model is avoided. - **AdaIN - guided style transfer mechanism**: This mechanism operates in the latent space and attention layers, ensuring a harmonious balance between content fidelity and style richness and reducing the loss of local details. - **Sliding - window smoothing strategy**: Based on the optical - flow method, it optimizes the predicted noise and updates the latent space, significantly improving the temporal consistency of the edited video and reducing artifacts. Through these innovations, UniVST outperforms existing methods in both quantitative and qualitative evaluations, especially in maintaining the style of main objects, ensuring temporal consistency, and detail preservation. ### Formula summary 1. **DDIM denoising formula**: \[ Z_{t - 1}=\sqrt{\alpha_{t - 1}}Z_t+\sqrt{1-\alpha_{t - 1}}\epsilon_\theta(Z_t, t, C) \] where \( Z_t\rightarrow0 \) is the estimate of \( Z_0 \) at time step \( t \): \[ Z_t\rightarrow0=\frac{Z_t-\sqrt{1-\alpha_t}\epsilon_\theta(Z_t, t, C)}{\sqrt{\alpha_t}} \] 2. **AdaIN operation formula**: \[ \text{AdaIN}(Z_t, Z_s^t)=\sigma(Z_s^t)\left(\frac{Z_t-\mu(Z_t)}{\sigma(Z_t)}\right)+\mu(Z_s^t) \] where \( \mu(\cdot) \) and \( \sigma(\cdot) \) represent the mean and standard deviation respectively. 3. **Sliding - window smoothing formula**: \[ \bar{P}_i^t\leftarrow\frac{1}{2m + 1}\sum_{j = i - m}^{i + m}\text{Warp}(P_i^t, P_j^t) \] The updated latent representation is: \[ Z_{t - 1}\leftarrow\sqrt{\alpha_{t - 1}}\bar{Z}_t\rightarrow0+\sqrt{1-\alpha_{t - 1}}\bar{\epsilon}_t \] These formulas and technical means work together to enable UniVST to achieve significant improvements in video style transfer tasks.