Rate-Distortion Optimized Cross Modal Compression with Multiple Domains

Junlong Gao,Chuanmin Jia,Zhimeng Huang,Shanshe Wang,Siwei Ma,Wen Gao
DOI: https://doi.org/10.1109/tcsvt.2024.3364153
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Cross-modal compression (CMC) aims to compress highly redundant visual data into compact, common, and human-comprehensible domains, such as text, to preserve semantic fidelity. However, CMC is limited by a constant level of semantic fidelity and constrained semantic fidelity due to a single compression domain (plain text). To address these issues, we propose a new approach called Multiple-domains rate-distortion optimized CMC (M-CMC). Specifically, our method divides the image into two complementary representations: (1) a structure representation with an edge map, and (2) a texture representation with dense captions, which include numerous region-caption pairs instead of plain text. In this way, we expand the single domain to multiple domains, namely, edge maps, regions, and text. To achieve diverse levels of semantic fidelity, we suggest a rate-distortion reward function, where the distortion measures the semantic fidelity of reconstructed images and the rate measures the information content of the text. We also propose Multiple-stage Self-Critical Sequence Training (MSCST) to optimize the reward function. Extensive experimental results demonstrate that the proposed method achieves diverse levels of semantic translation more effectively than other CMC-based methods, achieves higher semantic compression performance compared to traditional block-based and learning-based image compression frameworks with 97,000-500 times compression ratio, and provides a simple yet effective way for image editing.
What problem does this paper attempt to address?