Jianbo Chen,Michael I. Jordan
Abstract:We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the interpretability of natural language processing (NLP) models, especially when dealing with language data. Specifically, the author proposes a method based on syntactic structure to explain the trained classification model, by assigning least - squares importance scores to each word in the sentence and using the Banzhaf value in cooperative game theory for theoretical support.
### Main Problems and Solutions
1. **Problem Background**:
- Modern machine - learning models are difficult to understand and debug after training, which poses challenges to the trustworthiness, diagnosis, debugging, and robustness of the models.
- Current explanation methods either simplify the model to enhance interpretability, resulting in a decline in prediction accuracy; or use black - box methods to explain arbitrary models, but lack prior knowledge for domain - specific explanations.
2. **Research Objectives**:
- Propose a new method to explain NLP models, especially how to quantify the interactions between words in a sentence.
- By introducing syntactic structures (such as phrase trees), assign importance scores to each word, thereby better understanding the prediction process of the model.
3. **Specific Solutions**:
- **LS - Tree Values**: The author proposes a least - squares - based framework, called LS - Tree values, which calculates the importance score of each word by minimizing the squared residuals of each node in the parse tree. These scores are related to the Banzhaf value in cooperative game theory, providing theoretical support.
- **Quantify Interactions**: Based on LS - Tree values, the author develops a new method, using Cook's distance to quantify the interactions between sibling nodes (nodes with a common parent) in a sentence.
- **Experimental Verification**: Through experiments on multiple datasets, the effectiveness of this method is verified, and the abilities of different models (such as linear models, CNN, LSTM, BERT) in capturing nonlinear and adversarial relationships are analyzed.
### Key Formulas
1. **Least - Squares Problem for LS - Tree Values**:
\[
\min_{\psi \in \mathbb{R}^d} \sum_{S \in \wp} [v(S) - \sum_{i \in S} \psi_i]^2
\]
where \( v(S) = f(S) - f(\emptyset) \), representing the feature function, and \( \psi_i \) is the importance score of the \( i \)-th word.
2. **Cook's Distance**:
\[
D_i = \text{Const.} \cdot (\hat{\beta}(i) - \hat{\beta})^T X^T X (\hat{\beta}(i) - \hat{\beta})
\]
where \( \hat{\beta}(i) \) and \( \hat{\beta} \) are the least - squares estimates after deleting the \( i \)-th data point and the original least - squares estimate, respectively.
### Experimental Results
- **Nonlinear Analysis**: By comparing the correlations of different models with the linear model, it is found that BERT is the most nonlinear model.
- **Adversarial Relationship Capture**: BERT performs best in capturing adversarial relationships (such as "not", "but", etc.), especially on smaller datasets.
In general, this paper provides a systematic method to explain NLP models by introducing LS - Tree values and Cook's distance, and demonstrates its effectiveness in multiple tasks.