Abstract:Background: Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. Methods: In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. Results: We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers' script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. Conclusions: The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803 . In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.

PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain

A Comparative Study of Pretrained Language Models for Long Clinical Text

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

A BERT model generates diagnostically relevant semantic embeddings from pathology synopses with active learning

Transformers and the Representation of Biomedical Background Knowledge

Extracting Pulmonary Nodules and Nodule Characteristics from Radiology Reports of Lung Cancer Screening Patients Using Transformer Models

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

Time to Embrace Natural Language Processing (NLP)-based Digital Pathology: Benchmarking NLP- and Convolutional Neural Network-based Deep Learning Pipelines

Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT

CancerBERT: a BERT model for Extracting Breast Cancer Phenotypes from Electronic Health Records

Bioformer: an efficient transformer language model for biomedical text mining

A Cross-institutional Evaluation on Breast Cancer Phenotyping NLP Algorithms on Electronic Health Records

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

BulkRNABert: Cancer prognosis from bulk RNA-seq based language models

A Comparative Evaluation Of Transformer Models For De-Identification Of Clinical Text Data

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

The Utility of General Domain Transfer Learning for Medical Language Tasks

Transformer Models in Healthcare: A Survey and Thematic Analysis of Potentials, Shortcomings and Risks

Applications of transformer-based language models in bioinformatics: a survey