Memorization for Good: Encryption with Autoregressive Language Models

Samuel Stevens,Yu Su
2023-10-14
Abstract:Over-parameterized neural language models (LMs) can memorize and recite long sequences of training data. While such memorization is normally associated with undesired properties such as overfitting and information leaking, our work casts memorization as an unexplored capability of LMs. We propose the first symmetric encryption algorithm with autoregressive language models (SELM). We show that autoregressive LMs can encode arbitrary data into a compact real-valued vector (i.e., encryption) and then losslessly decode the vector to the original message (i.e., decryption) via random subspace optimization and greedy decoding. While SELM is not amenable to conventional cryptanalysis, we investigate its security through a novel empirical variant of the classic IND-CPA (indistinguishability under chosen-plaintext attack) game and show promising results on security. Our code and datasets are available at <a class="link-external link-https" href="https://github.com/OSU-NLP-Group/SELM" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores the issue of developing a new symmetric encryption algorithm by leveraging the characteristics of language models (LMs). Specifically, the researchers utilize the ability of over-parameterized neural language models to memorize and recite long sequence training data, and explore this as an underutilized skill. Through this approach, the paper proposes the first symmetric encryption algorithm based on autoregressive language models (SELM), aiming to demonstrate that these models can be used for data encryption and decryption. The working principle of SELM is to encode arbitrary data into a compact real-valued vector (i.e., the encryption process) through random subspace optimization, and then losslessly recover the original message from this vector through random subspace optimization and greedy decoding (i.e., the decryption process). To ensure security, the authors also designed a novel empirical version of the classic IND-CPA game to evaluate the security of SELM and presented preliminary results. Overall, the core issues that this paper attempts to address are: - How to utilize the memory capability of language models for encryption? - How to design an encryption scheme that is both efficient and secure, leveraging the optimization capabilities of language models in low-dimensional subspaces? - How to evaluate the security of the proposed encryption scheme?