CSDR-BERT: a pre-trained scientific dataset match model for Chinese Scientific Dataset Retrieval

Xintao Chu,Jianping Liu,Jian Wang,Xiaofeng Wang,Yingfei Wang,Meng Wang,Xunxun Gu
DOI: https://doi.org/10.48550/arXiv.2301.12700
2023-03-30
Abstract:As the number of open and shared scientific datasets on the Internet increases under the open science movement, efficiently retrieving these datasets is a crucial task in information retrieval (IR) research. In recent years, the development of large models, particularly the pre-training and fine-tuning paradigm, which involves pre-training on large models and fine-tuning on downstream tasks, has provided new solutions for IR match tasks. In this study, we use the original BERT token in the embedding layer, improve the Sentence-BERT model structure in the model layer by introducing the SimCSE and K-Nearest Neighbors method, and use the cosent loss function in the optimization phase to optimize the target output. Our experimental results show that our model outperforms other competing models on both public and self-built datasets through comparative experiments and ablation implementations. This study explores and validates the feasibility and efficiency of pre-training techniques for semantic retrieval of Chinese scientific datasets.
Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: With the rise of the open science movement, the number of openly shared scientific data sets on the Internet is increasing continuously. How to efficiently retrieve these scientific data sets has become a key task in information retrieval (IR) research. Traditional retrieval methods cannot meet the needs of researchers for quick and accurate answers when faced with the rapidly increasing scientific data sets. Therefore, this research aims to improve the semantic retrieval ability of Chinese scientific data sets by introducing pre - training models and text - matching calculations. Specifically, this paper proposes a semantic text - matching model CSDR - BERT based on pre - training technology to enhance the text - representation ability of Chinese scientific data sets, and improves the Sentence - BERT model structure through contrastive learning and K - Nearest Neighbors (KNN) methods. In addition, the research also uses the cosent loss function for optimization, thereby improving the performance of the model on public and self - built data sets. ### Main contributions 1. **Collect and create a scientific data set for semantic text - matching**, and construct a vocabulary. 2. **Improve the Sentence - BERT model structure**, develop a semantic retrieval model CSDR - BERT specifically for Chinese scientific data sets, and construct clusters through contrastive learning to improve the performance of semantic text - matching tasks. 3. **Use SimCSE and CSL data sets for pre - training**, enhance the knowledge base of the pre - training model, and improve its semantic - matching ability. 4. **Conduct experiments on self - built and public data sets** to verify the effectiveness of the model. ### Key problems solved - **Semantic complexity**: Sentences in the scientific field contain complex metadata information models and are difficult to be recognized by traditional models. - **Domain - specific expressions**: Terminologies in different research fields are expressed differently, and the same meaning may have different expressions (such as varicella and herpes zoster). - **Cross - domain expression differences**: Different research fields have unique ways of expressing terminologies (such as glucose and C6H12O6). Through these improvements, the CSDR - BERT model can match texts in Chinese scientific data sets more accurately at the semantic level, thereby improving the efficiency and accuracy of scientific data retrieval.