Deep Unfolding Transformers for Sparse Recovery of Video

Brent De Weerdt,Yonina C. Eldar,Nikos Deligiannis
DOI: https://doi.org/10.1109/tsp.2024.3381749
IF: 4.875
2024-01-01
IEEE Transactions on Signal Processing
Abstract:Deep unfolding models are designed by unrolling an optimization algorithm into a deep learning network. By incorporating domain knowledge from the optimization algorithm, they have shown faster convergence and higher performance compared to the original algorithm. We design an optimization problem for sequential signal recovery, which incorporates that the signals have a sparse representation in a dictionary and are correlated over time. A corresponding optimization algorithm is derived and unfolded into a deep unfolding Transformer encoder architecture, coined DUST. To show its improved reconstruction quality and flexibility in handling sequences of different lengths, we perform extensive experiments on video frame reconstruction from low-dimensional and/or noisy measurements, using several video datasets. We evaluate extensions to the base DUST model incorporating token normalization and multi-head attention, and compare our proposed networks with several deep unfolding recurrent neural networks (RNNs), generic unfolded and vanilla Transformers, and several video denoising models. The results show that our proposed Transformer architecture improves the reconstruction quality over state-of-the-art deep unfolding RNNs, existing Transformer networks, as well as state-of-the-art video denoising models, while significantly reducing the model size and computational cost of training and inference.
engineering, electrical & electronic
What problem does this paper attempt to address?