LEACE: Perfect linear concept erasure in closed form

Nora Belrose,David Schneider-Joseph,Shauli Ravfogel,Ryan Cotterell,Edward Raff,Stella Biderman
2023-10-30
Abstract:Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at <a class="link-external link-https" href="https://github.com/EleutherAI/concept-erasure" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Computers and Society
What problem does this paper attempt to address?