Abstract:Objective: Clinical knowledge enriched transformer models (e.g., ClinicalBERT) have state-of-the-art results on clinical NLP (natural language processing) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (e.g., Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts. Materials and Methods: Inspired by the success of long sequence transformer models and the fact that clinical notes are mostly long, we introduce two domain enriched language models, Clinical-Longformer and Clinical-BigBird, which are pre-trained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks. Results: The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results. Discussion: Our pre-trained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at <a class="link-external link-https" href="https://github.com/luoyuanlab/Clinical-Longformer" rel="external noopener nofollow">this https URL</a>, and the pre-trained models available for public download at: <a class="link-external link-https" href="https://huggingface.co/yikuan8/Clinical-Longformer" rel="external noopener nofollow">this https URL</a>. Conclusion: This study demonstrates that clinical knowledge enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.

Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Learning structures of the French clinical language:development and validation of word embedding models using 21 million clinical reports from electronic health records

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

A Comparative Study of Pretrained Language Models for Long Clinical Text

CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data

A Benchmark Evaluation of Clinical Named Entity Recognition in French

DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

Impact of translation on biomedical information extraction from real-life clinical notes

How Long Is Enough? Exploring the Optimal Intervals of Long-Range Clinical Note Language Modeling

CroissantLLM: A Truly Bilingual French-English Language Model

Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

From pre-training to fine-tuning: An in-depth analysis of Large Language Models in the biomedical domain

Improving Transformer Performance for French Clinical Notes Classification Using Mixture of Experts on a Limited Dataset

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Pre-training data selection for biomedical domain adaptation using journal impact metrics

The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences