SVS-GAN: Leveraging GANs for Semantic Video Synthesis

Khaled M. Seyam,Julian Wiederer,Markus Braun,Bin Yang
2024-09-10
Abstract:In recent years, there has been a growing interest in Semantic Image Synthesis (SIS) through the use of Generative Adversarial Networks (GANs) and diffusion models. This field has seen innovations such as the implementation of specialized loss functions tailored for this task, diverging from the more general approaches in Image-to-Image (I2I) translation. While the concept of Semantic Video Synthesis (SVS)$\unicode{x2013}$the generation of temporally coherent, realistic sequences of images from semantic maps$\unicode{x2013}$is newly formalized in this paper, some existing methods have already explored aspects of this field. Most of these approaches rely on generic loss functions designed for video-to-video translation or require additional data to achieve temporal coherence. In this paper, we introduce the SVS-GAN, a framework specifically designed for SVS, featuring a custom architecture and loss functions. Our approach includes a triple-pyramid generator that utilizes SPADE blocks. Additionally, we employ a U-Net-based network for the image discriminator, which performs semantic segmentation for the OASIS loss. Through this combination of tailored architecture and objective engineering, our framework aims to bridge the existing gap between SIS and SVS, outperforming current state-of-the-art models on datasets like Cityscapes and KITTI-360.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in Semantic Video Synthesis (SVS). Specifically, the goal of SVS is to generate realistic and temporally coherent videos from a sequence of semantic maps. Compared with Semantic Image Synthesis (SIS), SVS needs to maintain visual consistency among multiple frames, which increases the complexity of the task. Existing methods usually rely on general loss functions or require additional data to achieve temporal coherence, and these methods perform poorly in generating high - quality videos. ### Main contributions of the paper: 1. **Define SVS and introduce a dedicated framework**: The paper formally defines SVS and proposes a framework specifically for pure SVS, which uses a sequence of semantic maps to generate realistic and spatio - temporally coherent video frames. 2. **New three - pyramid generator architecture**: A new three - pyramid generator architecture has been developed, which uses the information of semantic maps and the previous frame to predict visually coherent and semantically accurate frames. 3. **Integrate OASIS loss**: Integrate the OASIS loss into the SVS framework, which improves the alignment between the synthesized video and its corresponding semantic maps. ### Main technical details: - **Generator (G)**: The generator takes the past and current semantic maps ($S_{i - 1}, S_i$) as input, calculates the optical flow, and uses the appearance information of the previous frame ($I_{i - 1}$) and the current semantic map ($S_i$) to ensure that the generated frames conform to the specified semantic guidance. - **Image discriminator (D_I)**: Evaluate the realism of a single frame and its consistency with the semantic map, adopt an encoder - decoder structure, and use the OASIS loss for image segmentation. - **Video discriminator (D_V)**: Evaluate the temporal and spatial coherence of the generated frame groups, ensure that the generated frame sequence is smooth and logically consistent, thereby enhancing the realism of the video. ### Loss functions: - **OASIS adversarial losses ($L_{DI}$ and $L_{GI}$)**: Ensure that the generated images are semantically aligned with the input semantic maps. - **Adversarial loss ($L_{adv}$)**: Ensure that the generated video is temporally stable and coherent. - **VGG loss ($L_{VGG}$)**: Reduce the perceptual difference between the generated image and the real image. - **Feature matching loss ($L_{FM}$)**: Align the intermediate representations of the real image and the generated image. - **Optical flow and deformation loss ($L_{Flow}$)**: Evaluate the accuracy of the predicted optical flow and ensure the motion consistency between consecutive frames. ### Experimental results: - **Datasets**: The paper conducted experiments on two datasets, Cityscapes Sequence and KITTI - 360. - **Evaluation metrics**: Use metrics such as FID, FVD, FVD_cd, and MIoU to evaluate the quality and semantic accuracy of the generated videos. - **Quantitative results**: SVS - GAN outperforms existing methods in multiple metrics, especially in terms of FID, FVD_cd, and MIoU. - **Qualitative results**: The generated videos perform well in detail capture and temporal coherence, especially in scenes dealing with dynamic objects and rare categories. Through these contributions, the paper effectively solves the key problems in SVS and provides a new solution for generating high - quality semantic videos.