Multimodal generative semantic communication based on latent diffusion model

Weiqi Fu,Lianming Xu,Xin Wu,Haoyang Wei,Li Wang
2024-08-10
Abstract:In emergencies, the ability to quickly and accurately gather environmental data and command information, and to make timely decisions, is particularly critical. Traditional semantic communication frameworks, primarily based on a single modality, are susceptible to complex environments and lighting conditions, thereby limiting decision accuracy. To this end, this paper introduces a multimodal generative semantic communication framework named mm-GESCO. The framework ingests streams of visible and infrared modal image data, generates fused semantic segmentation maps, and transmits them using a combination of one-hot encoding and zlib compression techniques to enhance data transmission efficiency. At the receiving end, the framework can reconstruct the original multimodal images based on the semantic maps. Additionally, a latent diffusion model based on contrastive learning is designed to align different modal data within the latent space, allowing mm-GESCO to reconstruct latent features of any modality presented at the input. Experimental results demonstrate that mm-GESCO achieves a compression ratio of up to 200 times, surpassing the performance of existing semantic communication frameworks and exhibiting excellent performance in downstream tasks such as object classification and detection.
Computer Vision and Pattern Recognition,Networking and Internet Architecture
What problem does this paper attempt to address?
The paper attempts to address the issue of efficiently transmitting multimodal images (specifically visible light and infrared images) and reconstructing them through a semantic communication framework in emergency situations. Specifically, the paper proposes a multimodal generative semantic communication framework named mm-GESCO, which aims to solve the following key problems: 1. **Data Compression and Transmission Efficiency**: During disasters, infrastructure damage leads to the lack of public network support, and drones must rely on temporarily deployed dedicated networks to transmit data. In such cases, traditional single-modal semantic communication frameworks are susceptible to complex environments and lighting conditions, limiting the accuracy of decision-making. Therefore, the paper proposes a method that generates semantic segmentation maps by fusing visible light and infrared images and combines one-hot encoding and zlib compression technology to achieve a data compression rate of up to 200 times, thereby improving data transmission efficiency. 2. **Multimodal Data Reconstruction**: Existing research mainly focuses on the reconstruction of single-modal data, which has limitations in multitasking processing. The mm-GESCO framework utilizes a latent diffusion model, combined with contrastive learning methods, to align data of different modalities in the latent space, enabling a single model to reconstruct various modalities of data based on the input modal information, reducing deployment costs in emergency situations. 3. **Downstream Task Performance**: Experimental results show that mm-GESCO performs excellently in downstream tasks such as object classification and detection, surpassing existing single-modal or multimodal semantic communication frameworks. In summary, this paper aims to improve data transmission efficiency and multimodal data reconstruction capabilities in emergency situations through an innovative approach, supporting more efficient search and rescue missions.