Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4

Aryo Pradipta Gema,Giwon Hong,Pasquale Minervini,Luke Daines,Beatrice Alex
2024-03-31
Abstract:The NLI4CT task assesses Natural Language Inference systems in predicting whether hypotheses entail or contradict evidence from Clinical Trial Reports. In this study, we evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Parameter-Efficient Fine-Tuning (PEFT). We propose a PEFT method to improve the consistency of LLMs by merging adapters that were fine-tuned separately using triplet and language modelling objectives. We found that merging the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. However, our novel methods did not produce more accurate results than GPT-4 in terms of faithfulness and consistency. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328. Finally, our contamination analysis with GPT-4 indicates that there was no test data leakage.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the application of Natural Language Inference (NLI) systems in Clinical Trial Reports (CTR) to predict whether a hypothesis is entailed or contradicted by the evidence in the reports. Specifically, the paper focuses on the following aspects: 1. **Improving model accuracy**: Evaluating the performance of different large language models (LLMs) on the NLI task and exploring various strategies (such as chain-of-thought, in-context learning, and parameter-efficient fine-tuning) to enhance model accuracy. 2. **Enhancing model consistency and faithfulness**: Besides accuracy, the paper pays special attention to the consistency and faithfulness of the model, i.e., the model's stability in the face of semantic changes and its fidelity to the original data. 3. **Proposing new fine-tuning methods**: The paper introduces a parameter-efficient fine-tuning (PEFT) method by merging adapters independently fine-tuned with different training objectives (such as triplet loss and language modeling loss) to further improve model performance. 4. **Comparison with GPT-4**: The paper compares the proposed model with GPT-4, evaluating the effectiveness of different strategies in improving model performance, particularly in terms of faithfulness and consistency. 5. **Data leakage analysis**: To ensure fairness, the paper also conducts a data leakage analysis to check whether the NLI4CT dataset was included in GPT-4's pre-training data. Overall, the paper aims to enhance the performance of NLI systems in handling clinical trial reports through various strategies and techniques, particularly improving model accuracy, consistency, and faithfulness.