A pen mark is all you need - Incidental prompt injection attacks on Vision Language Models in real-life histopathology

Jan Clusmann,Stefan J. K. Schulz,Dyke Ferber,Isabella C. Wiest,Aurelie Fernandez,Markus Eckstein,Fabienne Lange,Nic G. Reitsam,Franziska Kellers,Maxime Schmitt,Peter Neidlinger,Paul-Henry Koop,Carolin V. Schneider,Daniel Truhn,Wilfried Roth,Moritz Jesinghaus,Jakob N. Kather,Sebastian Foersch
DOI: https://doi.org/10.1101/2024.12.11.24318840
2024-12-12
Abstract:Vision-language models (VLMs) can analyze multimodal medical data. However, a significant weakness of VLMs, as we have recently described, is their susceptibility to prompt injection attacks. Here, the model receives conflicting instructions, leading to potentially harmful outputs. In this study, we hypothesized that handwritten labels and watermarks on pathological images could act as inadvertent prompt injections, influencing decision-making in histopathology. We conducted a quantitative study with a total of N = 3888 observations on the state-of-the-art VLMs Claude 3 Opus, Claude 3.5 Sonnet and GPT-4o. We designed various real-world inspired scenarios in which we show that VLMs rely entirely on (false) labels and watermarks if presented with those next to the tissue. All models reached almost perfect accuracies (90 - 100 %) for ground-truth leaking labels and abysmal accuracies (0 - 10 %) for misleading watermarks, despite baseline accuracies between 30-65 % for various multiclass problems. Overall, all VLMs accepted human-provided labels as infallible, even when those inputs contained obvious errors. Furthermore, these effects could not be mitigated by prompt engineering. It is therefore imperative to consider the presence of labels or other influencing features during future evaluation of VLMs in medicine and other fields.
What problem does this paper attempt to address?