Visual Language Model based Cross-modal Semantic Communication Systems

Feibo Jiang,Chuanguo Tang,Li Dong,Kezhi Wang,Kun Yang,Cunhua Pan
2024-05-06
Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Information Theory,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily explores several challenges faced by Image Semantic Communication (ISC) systems in dynamic environments and proposes a cross-modal semantic communication system based on Visual Language Models (VLM-CSC) to address these challenges. Specifically, the paper focuses on the following three main issues: 1. **Low Semantic Density**: - Images, as natural signals, have significant spatial redundancy. Traditional ISC systems encode the entire image directly, extracting only pixel-level low-level semantic information. In contrast, text is a high-density information carrier, and textual descriptions can transcend pixel-level semantics to achieve higher-level object and scene understanding. Additionally, traditional ISC systems lack the ability to utilize Knowledge Bases (KB), resulting in poor interpretability of the semantic encoding and decoding process as a black-box model. 2. **Catastrophic Forgetting**: - ISC systems typically operate in dynamic environments, where the feature distribution of transmitted image data and channel conditions drift over time. This leads to a mismatch between the actual data distribution and the training data distribution, affecting the performance of semantic encoders and decoders. To improve the performance of ISC systems, continuous learning of semantic encoders and decoders is required. However, existing knowledge may be overwritten or disrupted by new knowledge, leading to catastrophic forgetting during the learning process. 3. **Uncertain Signal-to-Noise Ratio (SNR)**: - In wireless communication, traditional deep learning-based ISC systems usually consider only a few discrete SNR conditions during the training phase, failing to cover all possible SNR scenarios. When the channel conditions during training and inference do not match, performance may degrade significantly. Training semantic/channel encoders for multiple SNR conditions and switching based on specific SNR values during inference would incur substantial storage and computational overhead. To address these issues, the paper proposes a VLM-CSC system comprising three innovative components: 1. **Cross-Modal Knowledge Base (CKB)**: - Utilizing BLIP to generate high-quality textual descriptions to enhance semantic density; using SD to reconstruct images at the receiver end that match the textual descriptions, improving system interpretability. 2. **Memory-Enhanced Encoders and Decoders (MED)**: - Introducing a hybrid long-short term memory mechanism, enabling semantic encoders and decoders to track environmental changes during learning, thereby avoiding catastrophic forgetting. 3. **Noise Attention Module (NAM)**: - Dynamically adjusting semantic and channel encoding based on different SNR conditions to ensure the robustness of semantic features under varying SNR conditions. Experimental simulations validate the effectiveness, adaptability, and robustness of the VLM-CSC system.