Abstract:This paper describes our approach to the SemEval-2024 safe biomedical Natural Language Inference for Clinical Trials (NLI4CT) task, which concerns classifying statements about Clinical Trial Reports (CTRs). We explored the capabilities of Mistral-7B, a generalist open-source Large Language Model (LLM). We developed a prompt for the NLI4CT task, and fine-tuned a quantized version of the model using an augmented version of the training dataset. The experimental results show that this approach can produce notable results in terms of the macro F1-score, while having limitations in terms of faithfulness and consistency. All the developed code is publicly available on a GitHub repository
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform safe biomedical natural language inference (NLI) in clinical trial reports (CTRs). The specific task is to classify statements regarding CTRs. The paper focuses on how to use large - language models (LLMs) to judge the entailment or contradiction relationships between CTRs and statements, especially maintaining the accuracy, faithfulness, and consistency of model inferences in complex reasoning in the medical field.
### Background of the Paper
With the wide application of large - language models in natural language processing (NLP) tasks, these models have also made remarkable achievements in text entailment evaluation. However, they face challenges when dealing with domain - specific data, such as medical data, and are vulnerable to shortcut learning, factual inconsistency, and performance degradation. Therefore, Task 2 of SemEval - 2024 - Safe Biomedical Natural Language Inference (NLI4CT) - aims to evaluate the performance of LLMs in this specific medical task, especially the accuracy, faithfulness, and consistency of their inferences.
### Research Methods
1. **Model Selection**: The researchers selected the open - source Mistral - 7B model and quantified it and performed low - rank adaptation (LoRA) fine - tuning to improve the model's performance on the NLI4CT task.
2. **Data Augmentation**: To increase the diversity of the training data, the researchers created multiple training sets through manual annotation and automatic generation methods, including:
- **Train_Manual**: Based on the original training set, new samples were generated through negation and rewriting.
- **Train_Manual - Synthetic**: On the basis of Train_Manual, more samples were generated using automatic methods.
- **Train_Full - Synthetic**: A large number of samples were completely generated by automatic methods.
3. **Instruction Fine - Tuning**: The researchers performed instruction fine - tuning on the Mistral - 7B model to better adapt to the NLI4CT task, using the supervised fine - tuning objective and the autoregressive language modeling method.
4. **Experimental Setup**: The researchers evaluated different prompt templates on the development set and selected the template with the best performance. Finally, the best model and training set were used to evaluate on the test set.
### Experimental Results
- **Macro - F1 Score**: The macro - F1 score of the best model on the test set reached 0.80, ranking first on the leaderboard.
- **Consistency**: The consistency score of the model was 0.72, ranking 15th.
- **Faithfulness**: The faithfulness score of the model was 0.83, ranking 11th.
### Conclusions
Although the model performs well in terms of classification accuracy, it performs poorly when faced with perturbations of statements (for example, predicting the same label for contradictory examples and different labels for synonymous rewriting examples). The researchers believe that by increasing high - quality manually - annotated samples, especially those for adversarial rewrites, the performance of the model can be further improved. Future work directions include exploring different models, optimizing prompt templates, adding more training data, and carefully constructing new training sets, with a focus on the intervention of statements rather than the number of basic statements.
### Formula Representation
This paper does not involve complex mathematical, physical, chemical, or biological formulas, so there is no need to present formulas in a special Markdown format. However, if it is necessary to express simple mathematical concepts, such as calculating the F1 score, it can be expressed as follows:
\[ \text{F1} = 2\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}} \]
where:
- **Precision** (Precision Rate): \( \text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}} \)
- **Recall** (Recall Rate): \( \text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}} \)