Does Vec2Text Pose a New Corpus Poisoning Threat?

Shengyao Zhuang,Bevan Koopman,Guido Zuccon
2024-10-09
Abstract:The emergence of Vec2Text -- a method for text embedding inversion -- has raised serious privacy concerns for dense retrieval systems which use text embeddings. This threat comes from the ability for an attacker with access to embeddings to reconstruct the original text. In this paper, we take a new look at Vec2Text and investigate how much of a threat it poses to the different attacks of corpus poisoning, whereby an attacker injects adversarial passages into a retrieval corpus with the intention of misleading dense retrievers. Theoretically, Vec2Text is far more dangerous than previous attack methods because it does not need access to the embedding model's weights and it can efficiently generate many adversarial passages. We show that under certain conditions, corpus poisoning with Vec2Text can pose a serious threat to dense retriever system integrity and user experience by injecting adversarial passaged into top ranked positions. Code and data are made available at <a class="link-external link-https" href="https://github.com/ielab/vec2text-corpus-poisoning" rel="external noopener nofollow">this https URL</a>
Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the new threats that the Vec2Text method may bring in corpus - poisoning attacks in dense retrieval systems. Specifically, the paper explores how Vec2Text can mislead dense retrievers by generating adversarial passages, thus affecting the integrity of the system and user experience. Vec2Text is a text - embedding inversion technique that can recover the original text from embedding vectors, which enables it to efficiently generate a large number of adversarial passages without accessing model weights, making it more dangerous than previous attack methods. The main research objectives of the paper include: 1. **Evaluating the effectiveness of Vec2Text in corpus - poisoning attacks**: Verify through experiments whether Vec2Text can successfully inject adversarial passages in dense retrieval systems and evaluate the impact of these passages on retrieval results. 2. **Comparing Vec2Text with existing methods**: Compare Vec2Text with existing gradient - based attack methods (such as HotFlip) and analyze their advantages and disadvantages. 3. **Proposing potential defense measures**: Based on the experimental results, explore how to prevent the threats brought by Vec2Text to protect the security of dense retrieval systems. The paper shows through experiments the high efficiency and potential threats of Vec2Text in generating adversarial passages, especially its performance in generating a large number of adversarial passages is better than the traditional HotFlip method. However, Vec2Text also has some limitations, for example, the generated adversarial passages may not be natural enough to attract users' clicks. Nevertheless, these adversarial passages may still have a negative impact on Retrieval - Augmented Generation (RAG) systems.