OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions

Yu-Shin Huang,Peter Just,Krishna Narayanan,Chao Tian
2024-10-06
Abstract:We consider coverless steganography where a Large Language Model (LLM) drives an arithmetic coding decoder to generate stego-texts. An efficient method should embed secret message bits in as few language tokens as possible, while still keeping the stego-text natural and fluent. We show that on the individual token level, this problem is mathematically equivalent to maximizing the entropy of a replacement probability distribution of the next token generation, subject to a constraint on the KL divergence between the chosen probability distribution and the original distribution given by the LLM. A closed-form solution is provided for the optimization problem, which can be computed efficiently. Several important practical issues are also tackled: 1) An often-overlooked tokenization mismatch issue is resolved with a simple prompt selection approach, 2) The combination of the optimized distribution and the vocabulary truncation technique is considered, and 3) The combination of the optimized distribution with other sequence-level selection heuristics to further enhance the efficiency and reliability is studied.
Information Theory,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "OD - Stega: Almost - Imperceptible Steganography Driven by Large - Language - Models Based on Optimized Distribution" aims to solve the following problems: 1. **Improve steganography efficiency**: - When embedding secret messages in natural - language texts, how to embed more secret - message bits in as few language tokens as possible while keeping the generated steganographic text natural and fluent. - Specifically, the authors propose a method that, by optimizing the generation - probability distribution of each token, enables more secret messages to be embedded while still maintaining the naturalness of the generated steganographic text. 2. **Optimize probability distribution**: - At the individual - token level, this problem is mathematically transformed into maximizing the entropy of the replacement - probability distribution while constraining the KL - divergence between the new distribution and the original distribution. - The authors provide a closed - form solution that can efficiently calculate the optimal probability distribution. 3. **Solve practical problems**: - **Word - segmentation - mismatch problem**: Sub - word tokenizers in modern pre - trained language models may lead to inconsistent word segmentation, thus affecting the accuracy of decoding. The authors propose a simple prompt - selection method to solve this problem. - **Vocabulary - truncation technique**: Combine the optimized distribution and the vocabulary - truncation technique to reduce computational complexity. - **Sequence - level - selection heuristic**: Research how to combine the optimized distribution with other sequence - level - selection heuristic methods to further improve efficiency and reliability. ### Mathematical formulas 1. **Mathematical formulation of the optimization problem**: \[ \max_{Q_i} \sum_{j = 1}^{N_i}-Q_i^j\log Q_i^j \] Constraints: \[ D_{\text{KL}}(Q_i\|P_i)=\sum_{j = 1}^{N_i}Q_i^j\log\frac{Q_i^j}{P_i^j}\leq\delta \] \[ Q_i^j\geq0,\quad\forall j\in[1:N_i] \] \[ \sum_{j = 1}^{N_i}Q_i^j = 1 \] \[ Q_i^j = 0,\quad\forall j\in A_i=[N_i + 1:N] \] 2. **Optimal probability - adjustment strategy**: \[ Q_i^j=\begin{cases} \frac{P_i^j u}{1 + u\sum_{j = 1}^{N_i}P_i^j u},&\forall j\notin A_i\\ 0,&\forall j\in A_i \end{cases} \] When \(\delta\in[0,\frac{1}{N_i}\sum_{j = 1}^{N_i}\log\frac{1}{N_iP_i^j}]\), otherwise: \[ Q_i^j=\begin{cases} \frac{1}{N_i},&\forall j\notin A_i\\ 0,&\forall j\in A_i \end{cases} \] 3. **Additive property of KL - divergence**: \[ D_{\text{KL}}(Q_i\|P_i)=D_{\text{KL}}(\hat{P}_i(\epsilon)\|P_i)+D_{\text{KL}}(Q_i\|\hat{P}_i(\epsilon)) \] ### Summary This paper improves the efficiency and reliability of steganography based on large - language models through the method of optimizing probability distribution. The authors not only provide theoretical solutions but also solve some key problems in practical applications, such as word - segmentation mismatch and computational complexity.