BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text

Ronghui You,Yuxuan Liu,Hiroshi Mamitsuka,Shanfeng Zhu
DOI: https://doi.org/10.1093/bioinformatics/btaa837
IF: 5.8
2020-09-25
Bioinformatics
Abstract:Abstract Motivation With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. Results We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. Supplementary information Supplementary data are available at Bioinformatics online
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issue of large-scale automatic Medical Subject Headings (MeSH) indexing. Specifically: 1. **Problems with existing methods**: - Current methods (such as FullMeSH) have three main drawbacks: - Using Learning To Rank (LTR) is time-consuming; - They can only capture predefined parts of the full text; - They ignore the entire MEDLINE database. 2. **Proposed new method**: - The paper proposes BERTMeSH, a deep learning-based full-text MeSH indexing method, with the following features: - It employs the pre-trained deep contextual representation model BERT, which can capture the deep semantics of the full text; - It uses a transfer learning strategy, combining full-text data from PubMed Central (PMC) and title and abstract data from MEDLINE, leveraging the advantages of both; - It has higher accuracy and computational efficiency, showing significant improvements over FullMeSH on multiple metrics. Through these improvements, BERTMeSH aims to enhance the accuracy and efficiency of large-scale full-text MeSH indexing, thereby better supporting biomedical text mining and information retrieval applications.