Abstract:With the rapid advancement of neural language models, the deployment of over-parameterized models has surged, increasing the need for interpretable explanations comprehensible to human inspectors. Existing post-hoc interpretability methods, which often focus on unigram features of single input textual instances, fail to capture the models' decision-making process fully. Additionally, many methods do not differentiate between decisions based on spurious correlations and those based on a holistic understanding of the input. Our paper introduces DISCO, a novel method for discovering global, rule-based explanations by identifying causal n-gram associations with model predictions. This method employs a scalable sequence mining technique to extract relevant text spans from training data, associate them with model predictions, and conduct causality checks to distill robust rules that elucidate model behavior. These rules expose potential overfitting and provide insights into misleading feature combinations. We validate DISCO through extensive testing, demonstrating its superiority over existing methods in offering comprehensive insights into complex model behaviors. Our approach successfully identifies all shortcuts manually introduced into the training data (100% detection rate on the MultiRC dataset), resulting in an 18.8% regression in model performance -- a capability unmatched by any other method. Furthermore, DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output. This alleviates the burden of abundant instance-wise explanations and helps assess the model's risk when encountering out-of-distribution (OOD) data.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the over - fitting problem of deep - learning models in natural language processing (NLP), especially the over - fitting problem in text classification tasks. Specifically, the paper focuses on: 1. **Wrong predictions caused by over - fitting**: Modern over - parameterized neural language models (such as Transformer models) may rely too much on certain specific phrases or patterns (called "shortcuts") in the input text during the training process, resulting in poor performance when the model encounters unseen data. 2. **Lack of interpretability**: Existing post - hoc interpretability methods usually only focus on single - word features (unigram features) in a single input instance and cannot fully capture the decision - making process of the model. In addition, many methods cannot distinguish between decisions based on spurious correlations and those based on an overall understanding of the input. 3. **Identifying causal relationships**: Existing methods have difficulty identifying which patterns are the real causes of model predictions and which are just spurious associations related to the predictions. To solve these problems, the authors propose a new method - **DISCO** (DISCovering Overfittings as Causal Rules for Text Classification Models). DISCO solves the problems in the following ways: - **Extracting global rules**: DISCO extracts n - gram sequences with high support from the training data and associates these sequences with model predictions. - **Causal checking**: By generating counterfactuals and conducting causal checks, DISCO can distinguish between true causal relationships and spurious correlations. - **Providing interactive explanations**: DISCO supports interactive explanations, helping human inspectors distinguish spurious causes in the rules and evaluate the risk of the model when it encounters out - of - distribution (OOD) data. ### Main contributions of DISCO 1. **Discovering global rules**: DISCO can discover global, rule - based explanations, revealing potential over - fitting problems in model behavior. 2. **Causal reasoning**: Through causal reasoning techniques, DISCO can identify the real causes of model predictions, rather than just correlation - based patterns. 3. **Improving interpretability**: The rules provided by DISCO not only help understand the prediction logic of the model but also expose misleading feature combinations, helping to improve the model. ### Experimental verification The authors conducted extensive experiments on multiple datasets (such as Movies, SST - 2, MultiRC, CLIMATE - FEVER) and different pre - trained models (such as BERT BASE, SBERT, LSTM) to verify the effectiveness of DISCO. The experimental results show that DISCO can successfully identify all manually injected shortcut patterns (100% detection rate) and perform well in multiple language task - model combinations. In conclusion, DISCO provides a powerful tool for understanding and improving text classification models, especially having significant advantages in identifying and explaining over - fitting problems in the models.

DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability

DISCO: Distilling Counterfactuals with Large Language Models

Multi-resolution Interpretation and Diagnostics Tool for Natural Language Classifiers

Generating Hierarchical Explanations on Text Classification Without Connecting Rules

Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability

Right for the Wrong Reason: Can Interpretable ML Techniques Detect Spurious Correlations?

DISSECT: Disentangled Simultaneous Explanations via Concept Traversals

Discover and Cure: Concept-aware Mitigation of Spurious Correlation

Explaining high-dimensional text classifiers

Towards Understanding Sensitive and Decisive Patterns in Explainable AI: A Case Study of Model Interpretation in Geometric Deep Learning

Interpreting Deep Learning Model Using Rule-based Method

DISCO: Comprehensive and Explainable Disinformation Detection

Towards LLM-guided Causal Explainability for Black-box Text Classifiers

Molecular genetic alterations as potential prognostic indicators in colorectal carcinoma and molecular genetics of colorectal carcinoma

A Responsible Machine Learning Workflow with Focus on Interpretable Models, Post-hoc Explanation, and Discrimination Testing

Quantifying Explainability in Outcome-Oriented Predictive Process Monitoring

Representing visual classification as a linear combination of words

The Intriguing Properties of Model Explanations

Explaining Language Models' Predictions with High-Impact Concepts