INV: Towards Streaming Incremental Neural Videos

Shengze Wang,Alexey Supikov,Joshua Ratcliff,Henry Fuchs,Ronald Azuma
DOI: https://doi.org/10.48550/arXiv.2302.01532
2023-02-03
Abstract:Recent works in spatiotemporal radiance fields can produce photorealistic free-viewpoint videos. However, they are inherently unsuitable for interactive streaming scenarios (e.g. video conferencing, telepresence) because have an inevitable lag even if the training is instantaneous. This is because these approaches consume videos and thus have to buffer chunks of frames (often seconds) before processing. In this work, we take a step towards interactive streaming via a frame-by-frame approach naturally free of lag. Conventional wisdom believes that per-frame NeRFs are impractical due to prohibitive training costs and storage. We break this belief by introducing Incremental Neural Videos (INV), a per-frame NeRF that is efficiently trained and streamable. We designed INV based on two insights: (1) Our main finding is that MLPs naturally partition themselves into Structure and Color Layers, which store structural and color/texture information respectively. (2) We leverage this property to retain and improve upon knowledge from previous frames, thus amortizing training across frames and reducing redundant learning. As a result, with negligible changes to NeRF, INV can achieve good qualities (>28.6db) in 8min/frame. It can also outperform prior SOTA in 19% less training time. Additionally, our Temporal Weight Compression reduces the per-frame size to 0.3MB/frame (6.6% of NeRF). More importantly, INV is free from buffer lag and is naturally fit for streaming. While this work does not achieve real-time training, it shows that incremental approaches like INV present new possibilities in interactive 3D streaming. Moreover, our discovery of natural information partition leads to a better understanding and manipulation of MLPs. Code and dataset will be released soon.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve high - quality free - view - point video generation in interactive streaming media scenarios (such as video conferencing, tele - presence). Although the existing spatio - temporal radiance field methods can generate realistic free - view - point videos, there is an inevitable delay because a large number of frames need to be buffered for processing, which makes them unsuitable for scenarios requiring real - time interaction. The paper proposes a new method - Incremental Neural Videos (INV). By processing frame by frame, it reduces the delay and has significant optimizations in both training time and storage cost, thus being more suitable for interactive 3D video streaming applications. Specifically, the main contributions of the paper include: 1. **Naturally segmenting the structure layer and the color layer**: It has been found that the multi - layer perceptron (MLP) will naturally divide its internal layers into the early layers that store structural information (structure layer) and the later layers that store color / texture information (color layer). This finding helps to understand the working mechanism of MLP more clearly and provides more effective means of operation. 2. **Designing Incremental Neural Videos (INV)**: Based on the above findings, INV consists of two sub - modules: (1) A color module shared across frames, which is used to encode the color / texture of the scene; (2) A structure module stored per frame, which is used to encode the changing structure of the dynamic scene. This method not only reduces the storage requirements but also improves the training efficiency. 3. **Proposing structure transfer**: This is an incremental training scheme. By using the information already learned in the previous frame to accelerate the training of subsequent frames, the training time is significantly reduced. In addition, the paper also proposes a time - weight compression technique, which further compresses the size of the model, making the size of each INV frame only 0.3MB, accounting for 6.6% of the size of the original NeRF model. These innovations enable INV to complete high - quality training for each frame within a few minutes and are suitable for streaming media transmission, opening up new possibilities for future interactive 3D video applications.