Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily explores several challenges faced by Image Semantic Communication (ISC) systems in dynamic environments and proposes a cross-modal semantic communication system based on Visual Language Models (VLM-CSC) to address these challenges. Specifically, the paper focuses on the following three main issues: 1. **Low Semantic Density**: - Images, as natural signals, have significant spatial redundancy. Traditional ISC systems encode the entire image directly, extracting only pixel-level low-level semantic information. In contrast, text is a high-density information carrier, and textual descriptions can transcend pixel-level semantics to achieve higher-level object and scene understanding. Additionally, traditional ISC systems lack the ability to utilize Knowledge Bases (KB), resulting in poor interpretability of the semantic encoding and decoding process as a black-box model. 2. **Catastrophic Forgetting**: - ISC systems typically operate in dynamic environments, where the feature distribution of transmitted image data and channel conditions drift over time. This leads to a mismatch between the actual data distribution and the training data distribution, affecting the performance of semantic encoders and decoders. To improve the performance of ISC systems, continuous learning of semantic encoders and decoders is required. However, existing knowledge may be overwritten or disrupted by new knowledge, leading to catastrophic forgetting during the learning process. 3. **Uncertain Signal-to-Noise Ratio (SNR)**: - In wireless communication, traditional deep learning-based ISC systems usually consider only a few discrete SNR conditions during the training phase, failing to cover all possible SNR scenarios. When the channel conditions during training and inference do not match, performance may degrade significantly. Training semantic/channel encoders for multiple SNR conditions and switching based on specific SNR values during inference would incur substantial storage and computational overhead. To address these issues, the paper proposes a VLM-CSC system comprising three innovative components: 1. **Cross-Modal Knowledge Base (CKB)**: - Utilizing BLIP to generate high-quality textual descriptions to enhance semantic density; using SD to reconstruct images at the receiver end that match the textual descriptions, improving system interpretability. 2. **Memory-Enhanced Encoders and Decoders (MED)**: - Introducing a hybrid long-short term memory mechanism, enabling semantic encoders and decoders to track environmental changes during learning, thereby avoiding catastrophic forgetting. 3. **Noise Attention Module (NAM)**: - Dynamically adjusting semantic and channel encoding based on different SNR conditions to ensure the robustness of semantic features under varying SNR conditions. Experimental simulations validate the effectiveness, adaptability, and robustness of the VLM-CSC system.

Visual Language Model based Cross-modal Semantic Communication Systems

LaMoSC: Large Language Model-Driven Semantic Communication System for Visual Transmission

Semantic Feature Decomposition based Semantic Communication System of Images with Large-scale Visual Generation Models

Large AI Model-Based Semantic Communications

Deep Learning-Based Image Semantic Communication System

Large AI Model Empowered Multimodal Semantic Communications

A Robust Semantic Communication System for Image

Semantic Communication based on Large Language Model for Underwater Image Transmission

Semantic Importance-Aware Communications with Semantic Correction Using Large Language Models

Innovative semantic communication system

Image Generation with Multimodule Semantic Feature-Aided Selection for Semantic Communications

Semantic Communication System Based on Semantic Slice Models Propagation

Cross-Modal Semantic Communications

Large Language Model Enabled Semantic Communication Systems

ES-ISC Demo: an Explainable Semantics-based Image Semantic Communication System for 6G

Alternate Learning Based Sparse Semantic Communications for Visual Transmission

Language-Oriented Communication with Semantic Coding and Knowledge Distillation for Text-to-Image Generation

Semantic Successive Refinement: A Generative AI-aided Semantic Communication Framework

Vector Quantized Semantic Communication System

Layered Semantic Communication System for Dynamic Scenarios

Wireless Transmission of Images With The Assistance of Multi-level Semantic Information