Talking-Head Video Compression with Motion Semantic Enhancement Model

Haobo Lei,Zhisong Bie,Zhao Jing,Hongxia Bie
DOI: https://doi.org/10.1109/icip51287.2024.10648217
2024-01-01
Abstract:The continuously advancing image generation technology has been utilized for high-quality video reconstruction using low-bitrate feature representation. The motion semantic representations provided by existing models exhibit significant redundancy, indicating that their potential as video compression tools is still to be fully explored. In this work, we propose a motion semantic enhancement model called MSEM for ultra-low-bitrate talking-head video compression, aiming at improving semantic extraction effectiveness and compactness. Specifically, we enhance semantic extraction accuracy by introducing a deformable feature estimator with flexible receptive field shapes. Based on the straight-through gradient estimation, we construct a semantic encoding space that contains more compact semantic representations with low redundancy. Extensive experiments clearly demonstrate that i) compared to mainstream semantic compression models, our method has stronger semantic feature extraction capabilities benefiting from a more reasonable semantic feature impact range, and ii) our method provides an average bitrate reduction for the same visual quality of more than $50 \%$ compared to VVC.
What problem does this paper attempt to address?