C2BERT - Cross-contrast BERT for Chinese Biomedical Sentence Representation.

Xiaosu Wang,Yun Xiong,Hao Niu,Yao Zhang,Yangyong Zhu
DOI: https://doi.org/10.1109/BIBM52615.2021.9669855
2021-01-01
Abstract:Pre-trained language models (PLMs), such as BERT, have achieved great success on various natural language processing (NLP) tasks. Nevertheless, we observe that PLM-derived native Chinese biomedical sentence representations are somehow collapsed, which means PLMs induce a non-smooth anisotropic semantic space of Chinese biomedical sentences and most sentences are mapped into a small area and therefore produce high similarity. Such PLM-derived native sentence representations poorly capture semantic meaning of Chinese biomedical sentences.To alleviate the aforementioned collapse issue, we then propose a novel contrastive learning framework, named Cross-contrast BERT (C2 BERT), that advances the state-of-the-art Chinese biomedical sentence embeddings. C2 BERT proposes to derive positive/negative samples from two transformer-based different PLMs; this design decision reflects our philosophy that our goal is to conflate the knowledge stored in different PLMs to produce Chinese biomedical sentence embeddings, rather than introducing new noise. Moreover, without costly further pretraining, C2 BERT exploits contrastive learning as an auxiliary training objective during fine-tuning with supervision from biomedical sentence-related tasks. We demonstrate with extensive experiments that our C2 BERT model is more effective than competitive baselines on diverse Chinese biomedical sentence-related tasks.
What problem does this paper attempt to address?