Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment

Salmonn Talebi,Elizabeth Tong,Anna Li,Ghiam Yamin,Greg Zaharchuk,Mohammad R. K. Mofrad
DOI: https://doi.org/10.1186/s12911-024-02444-z
IF: 3.298
2024-02-10
BMC Medical Informatics and Decision Making
Abstract:Deep learning has demonstrated significant advancements across various domains. However, its implementation in specialized areas, such as medical settings, remains approached with caution. In these high-stake environments, understanding the model's decision-making process is critical. This study assesses the performance of different pretrained Bidirectional Encoder Representations from Transformers (BERT) models and delves into understanding its decision-making within the context of medical image protocol assignment.
medical informatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the model performance and interpretability in the medical imaging protocol assignment task. Specifically, the research aims to evaluate the performance of different pre - trained BERT models (such as BERT, BioBERT, ClinicalBERT and RoBERTa) in the neuroradiology protocol assignment task and gain an in - depth understanding of the decision - making processes of these models. ### Decomposition of the Main Problem 1. **Improvement of Model Performance**: - Researchers hope to improve the accuracy of the medical imaging protocol classification task by fine - tuning the pre - trained BERT models. - They selected four pre - trained models: BERT, BioBERT, ClinicalBERT and RoBERTa, and fine - tuned them to adapt to the specific medical text classification task. 2. **Model Interpretability**: - In high - risk fields such as the medical environment, understanding the decision - making process of the model is crucial. - Researchers used the Integrated Gradients method to quantify the contribution of each word in the input text to the model's decision, and verified it by deleting important and unimportant words. - An experienced radiologist reviewed the word importance scores generated by the model to assess whether the model's decision was in line with human reasoning. 3. **Systematic Error Identification**: - Researchers analyzed the misclassification cases of the model and discovered potential systematic errors. - These errors may include multiple - choice questions, age - related results, ambiguous entries and obvious errors. ### Formula Representation To ensure that the formulas are correct and readable, the following are the Markdown - format representations of some key concepts involved in the paper: - **F1 Score**: An indicator used to measure the accuracy of the model, which combines precision and recall. The calculation formula is: \[ F1 = 2\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}} \] - **Integrated Gradients**: Used to calculate the importance of each word. The formula is as follows: \[ IG_i(x)=(x_i - x'_i)\cdot\int_{\alpha = 0}^{1}\frac{\partial F(x'+\alpha\cdot(x - x'))}{\partial x_i}d\alpha \] where \(x\) is the input text, \(x'\) is the baseline input (usually a zero - vector), and \(F\) is the model output. ### Conclusion The research results show that the fine - tuned BERT model exhibits performance close to the human level in the medical imaging protocol assignment task and can effectively identify key words. By detecting systematic errors, the research provides directions for improving the safety and practicality of the model. In addition, the interpretability of the model has also been enhanced, making its application in the clinical environment more reliable.