Abstract:Style transfer on images has achieved significant advances in recent years, with the deep convolutional neural network (CNN). Directly applying image style transfer algorithms to each frame of a video independently often leads to flickering and unstable results. In this work, we present a self-supervised space-time convolutional neural network (CNN) based method for online video style transfer, named as VTNet, which is end-to-end trained from nearly unlimited unlabeled video data to produce temporally coherent stylized videos in real-time. Specifically, our VTNet transfer the style of a reference image to the source video frames, which is formed by the temporal prediction branch and the stylizing branch. The temporal prediction branch is used to capture discriminative spatiotemporal features for temporal consistency, pretrained in an adversarial manner from unlabeled video data. The stylizing branch is used to transfer the style image to a video frame with the guidance from the temporal prediction branch to ensure temporal consistency. To guide the training of VTNet, we introduce the style-coherence loss net (SCNet), which assembles the content loss, the style loss, and the new designed coherence loss. These losses are computed based on high-level features extracted from a pretrained VGG-16 network. The content loss is used to preserve high-level abstract contents of the input frames, and the style loss introduces new colors and patterns from the style image. Instead of using optical flow to explicitly redress the stylized video frames, we design the coherence loss to make the stylized video inherit the dynamics and motion patterns from the source video to remove temporal flickering. Extensive subjective and objective evaluations on various styles demonstrate that the proposed method achieves favorable results against the state-of-the-arts with high efficiency.

Contrastive disentanglement for self-supervised motion style transfer

Artistic Style Transfer with Internal-external Learning and Contrastive Learning

Correlation-based and Content-Enhanced Network for Video Style Transfer

Learning Structure-Aware Transformations for Arbitrary Image Style Transfer

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Style Permutation for Diversified Arbitrary Style Transfer

Diverse Image Style Transfer Via Invertible Cross-Space Mapping

Optimal Transport of Deep Feature for Image Style Transfer

Multi-Source Style Transfer Via Style Disentanglement Network

MoST: Motion Style Transformer between Diverse Action Contents

A non-definitive auto-transfer mechanism for arbitrary style transfers

CLAST: Contrastive Learning for Arbitrary Style Transfer

Unpaired motion style transfer from video to animation

Diffusion-based Human Motion Style Transfer with Semantic Guidance

Transductive Learning for Unsupervised Text Style Transfer

Diffusion‐based Human Motion Style Transfer with Semantic Guidance

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

Learning Self-Supervised Space-Time CNN for Fast Video Style Transfer

StyleFlow: Disentangle Latent Representations via Normalizing Flow for Unsupervised Text Style Transfer

Improving the Latent Space of Image Style Transfer

Style Transfer as Unsupervised Machine Translation