Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

Jorge García-Carrasco,Alejandro Maté,Juan Trujillo
DOI: https://doi.org/10.24963/ijcai.2024/43
2024-07-29
Abstract:Large Language Models (LLMs), characterized by being trained on broad amounts of data in a self-supervised manner, have shown impressive performance across a wide range of tasks. Indeed, their generative abilities have aroused interest on the application of LLMs across a wide range of contexts. However, neural networks in general, and LLMs in particular, are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model. This is a serious concern that impedes the use of LLMs on high-stakes applications, such as healthcare, where a wrong prediction can imply serious consequences. Even though there are many efforts on making LLMs more robust to adversarial attacks, there are almost no works that study \emph{how} and \emph{where} these vulnerabilities that make LLMs prone to adversarial attacks happen. Motivated by these facts, we explore how to localize and understand vulnerabilities, and propose a method, based on Mechanistic Interpretability (MI) techniques, to guide this process. Specifically, this method enables us to detect vulnerabilities related to a concrete task by (i) obtaining the subset of the model that is responsible for that task, (ii) generating adversarial samples for that task, and (iii) using MI techniques together with the previous samples to discover and understand the possible vulnerabilities. We showcase our method on a pretrained GPT-2 Small model carrying out the task of predicting 3-letter acronyms to demonstrate its effectiveness on locating and understanding concrete vulnerabilities of the model.
Machine Learning,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the vulnerability of large - language models (LLMs) when facing adversarial attacks. Specifically, although LLMs perform well on many tasks, they are vulnerable to adversarial attacks, that is, tiny input perturbations can lead to incorrect model outputs. This vulnerability is especially concerning in high - risk applications (such as healthcare), because incorrect predictions may have serious consequences. Currently, although much work has been done to improve the robustness of LLMs against adversarial attacks, few studies have focused on the specific mechanisms of these vulnerabilities, that is, how and in which parts of the model these vulnerabilities occur. To this end, the author proposes a method based on Mechanistic Interpretability (MI) techniques, aiming to locate and understand the vulnerabilities in LLMs. This method is implemented through the following steps: 1. **Task description, dataset construction, and metric definition**: Clearly define the task or behavior to be studied, and construct the corresponding dataset and metric standards for evaluating the model's performance on this task. 2. **Circuit identification and understanding**: Use MI techniques such as activation patching to identify model components related to specific tasks, forming so - called "circuits". 3. **Adversarial sample generation**: Automatically generate adversarial samples for specific tasks, which will be used to detect potential vulnerabilities. 4. **Locating and understanding vulnerabilities**: Use the generated adversarial samples to conduct logit - attribution experiments, locate the model components affected by vulnerabilities, and further use MI techniques to understand the sources of vulnerabilities. Through this systematic method, the author hopes to gain a deeper understanding of the internal mechanisms of LLMs, thereby detecting, understanding, and ultimately mitigating or solving these vulnerabilities without additional adversarial training, while also avoiding possible side effects.