TabMedBERT: A Tabular Knowledge Enhanced Biomedical Pretrained Language Model

Xu Yan,Lei Geng,Ziqiang Cao,Juntao Li,Wenjie Li,Sujian Li,Xinjie Zhou,Yang,Jun Zhang
DOI: https://doi.org/10.3233/faia240674
2024-01-01
Abstract:Most existing biomedical language models are trained on plain text with general learning goals such as random word infilling, failing to capture the knowledge in the biomedical corpus sufficiently. Since biomedical articles usually contain many tables summarising the main entities and their relations, in the paper, we propose a Tabular knowledge enhanced bioMedical pretrained language model, called TabMedBERT. Specifically, we align entities between table cells, and article text spans with pre-defined rules. Then we add two table-related self-supervised tasks to integrate tabular knowledge into the language model: Entity Infilling (EI) and Table Cloze Test (TCT). While EI masks tokens within aligned entities in the article, TCT converts aligned entities in the table layout into a cloze text by erasing one entity and prompts the model to extract the appropriate span to fill in the blank. Experimental results demonstrate that TabMedBERT surpasses all competing language models without adding additional parameters, establishing a new state-of-the-art performance of 85.59% (+1.29%) on the BLURB biomedical NLP benchmark and 7 additional information extraction datasets. Moreover, the model architecture for TCT provides a straightforward solution to revise information extraction with paired entities.
What problem does this paper attempt to address?