Maël Jullien,Marco Valentino,Hannah Frost,Paul O'Regan,Donal Landers,André Freitas
Abstract:How can we interpret and retrieve medical evidence to support clinical decisions? Clinical trial reports (CTR) amassed over the years contain indispensable information for the development of personalized medicine. However, it is practically infeasible to manually inspect over 400,000+ clinical trial reports in order to find the best evidence for experimental treatments. Natural Language Inference (NLI) offers a potential solution to this problem, by allowing the scalable computation of textual entailment. However, existing NLI models perform poorly on biomedical corpora, and previously published datasets fail to capture the full complexity of inference over CTRs. In this work, we present a novel resource to advance research on NLI for reasoning on CTRs. The resource includes two main tasks. Firstly, to determine the inference relation between a natural language statement, and a CTR. Secondly, to retrieve supporting facts to justify the predicted relation. We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these tasks. Baselines on this corpus expose the limitations of existing NLI models, with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To the best of our knowledge, we are the first to design a task that covers the interpretation of full CTRs. To encourage further work on this challenging dataset, we make the corpus, competition leaderboard, website and code to replicate the baseline experiments available at: <a class="link-external link-https" href="https://github.com/ai-systems/nli4ct" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of natural language inference (NLI) in clinical trial reports (CTR), especially how to efficiently interpret and retrieve medical evidence to support clinical decision - making. Specifically, the paper proposes two main tasks:
1. **Text Entailment Task**: Determine the inferential relationship between a natural - language statement and a clinical trial report. This task requires the model to be able to understand the content of the clinical trial report and judge whether the given statement can be derived from the report (i.e., judge whether there is an entailment or contradiction relationship between the two).
2. **Evidence Extraction Task**: Extract facts from the clinical trial report that support the above - mentioned inferential relationship. This task requires the model not only to be able to judge the relationship between the statement and the report, but also to find specific parts in the report as evidence to support this judgment.
### Background and Motivation
- **Importance of Clinical Trial Reports**: Clinical trial reports have accumulated a large amount of information about the effectiveness and safety of new treatment methods, and this information is crucial for the development of personalized medicine.
- **Limitations of Manual Review**: It is impractical to manually review more than 400,000 clinical trial reports to find the best evidence.
- **Deficiencies of Existing NLI Models**: Existing NLI models perform poorly when dealing with biomedical corpora, and existing datasets also cannot fully capture the complexity of reasoning about clinical trial reports.
### Main Contributions of the Paper
1. **Defined a New Benchmark** (NLI4CT), including two main tasks that cover multiple fundamental challenges faced by modern NLI systems.
2. **Released a New Public Corpus**, which contains 2,400 expert - annotated entailment relationships, as well as related clinical trial reports, labels, and lists of extracted evidence.
3. **Conducted an Extensive Empirical Evaluation of Existing NLI Models**, demonstrating the limitations and challenges of current models in solving the proposed tasks. 7 representative NLI models were tested, and the highest F1 score was 0.644.
### Main Challenges
- **Biomedical Reasoning**: Including dealing with abbreviations, synonyms, taxonomic relationships, and domain knowledge.
- **Common - sense Reasoning**: Including coreference resolution and general world knowledge.
- **Numerical Reasoning**: Involving comparisons of doses, frequencies, and percentages, and usually requiring unit conversion.
### Experimental Results
- **Performance of Baseline Models**: In task 1, the best - performing models are BioBERT and BioMegatron, with F1 scores both exceeding 0.644. However, these models show severe over - fitting on the training set.
- **Evidence - Only Baseline**: The baseline model generated using only the gold evidence as a premise does not significantly improve performance, indicating that even after 10 rounds of training, the model does not effectively derive conclusions from relevant evidence.
- **Statistical Artifacts**: By using only statements without accessing clinical trial reports, it is found that the model relies on superficial statistical artifacts rather than learning the underlying rules of the task.
### Conclusion
This paper provides a new benchmark for natural language inference research in the biomedical field by proposing the NLI4CT dataset and two tasks. Although existing NLI models have shown certain capabilities in some aspects, they still face many challenges, especially when dealing with complex biomedical and numerical reasoning tasks. Future research needs to further improve the models to better meet these challenges.