Abstract:When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks spanning six domains, from text understanding to coding. Each task in ORION can be represented abstractly by a request (e.g. a question) that retrieves an attribute (e.g. the character name) from a context (e.g. a story). We apply causal analysis on 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. After causally enforcing this decomposition, models are still able to solve the original task, preserving 70% of the original correct token probability in 98 of the 106 studied model-task pairs. We connect our macroscopic decomposition with a microscopic description by performing a fine-grained case study of a question-answering task on Pythia-2.8b. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input. Our solution improves accuracy drastically (from 15.5% to 97.5% on Pythia-12b). This work presents evidence of a universal emergent modular processing of tasks across varied domains and models and is a pioneering effort in applying interpretability for scalable internal oversight of LMs.

Representational Analysis of Binding in Language Models

Representational Analysis of Binding in Large Language Models

How do Language Models Bind Entities in Context?

Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

Unraveling the Binding Problem in Working Memory: Insights from the Hierarchical Binding Model

A Causal View of Entity Bias in (Large) Language Models

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Binding Language Models in Symbolic Languages

Optimal quadratic binding for relational reasoning in vector symbolic neural architectures

Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism

LLMBind: A Unified Modality-Task Integration Framework

Eliminating Position Bias of Language Models: A Mechanistic Approach

Aligning Large Language Models with Human Opinions through Persona Selection and Value--Belief--Norm Reasoning

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

What do Entity-Centric Models Learn? Insights from Entity Linking in Multi-Party Dialogue

Efficient and Interpretable Neural Models for Entity Tracking

Discovering Variable Binding Circuitry with Desiderata

Locating and Extracting Relational Concepts in Large Language Models

Linguistic Properties Matter for Implicit Discourse Relation Recognition: Combining Semantic Interaction, Topic Continuity and Attribution