Supplementing Domain Knowledge to BERT with Semi-Structured Information of Documents

Jing Chen,Zhihua Wei,Jiaqi Wang,Rui Wang,Chuanyang Gong,Hongyun Zhang,Duoqian Miao
DOI: https://doi.org/10.1016/j.eswa.2023.121054
IF: 8.5
2024-01-01
Expert Systems with Applications
Abstract:Domain adaptation is a good way to boost BERT’s performance on domain-specific natural language processing (NLP) tasks. Common domain adaptation methods, however, can be deficient in capturing domain knowledge. Meanwhile, the context fragmentation inherent in Transformer-based models also hinders the acquisition of domain knowledge. Considering the semi-structural characteristics of documents and its potential for alleviating these problems, we leverage the semi-structured information of documents to supplement domain knowledge to BERT. To this end, we propose a topic-based domain adaptation method, which enhances the capture of domain knowledge at various levels of text granularity. Specifically, topic masked language modeling is designed at the paragraph level for pre-training; topic subsection matching degree dataset is automatically constructed at the subsection level for intermediate fine-tuning. Experiments are conducted over four biomedical NLP tasks across six datasets. The results show that our method benefits BERT, RoBERTa, SpanBERT, BioBERT, and PubMedBERT in nearly all cases. And we see significant gains in two question answering (QA) tasks, especially customer health QA, the topic-related one, with an average accuracy improvement of 4.8%. Thus, the semi-structured information of documents can be exploited to make BERT capture domain knowledge more effectively.
What problem does this paper attempt to address?