CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework

Hao Yang,Shimin Tao,Minghan Wang,Min Zhang,Daimeng Wei,Shuai Zhao,Miaomiao Ma,Ying Qin
DOI: https://doi.org/10.1007/978-3-031-10986-7_18
2022-01-01
Abstract:Unsupervised/Supervised SimCSE [5] achieves the SOTA performance of sentence-level semantic representation based on contrastive learning and dropout data augmentation. In particular, supervised SimCSE mines positive pairs and hard-negative pairs through Natural Language Inference (NLI) entailment/contradiction labels, which significantly outperforms other unsupervised/supervised models. As NLI data is scarce, can we construct pseudo-NLI data to improve the semantic representation of multi-domain sentences? This paper proposes a Chinese-centric Cross Domain Contrastive learning framework (CCDC), which provides a "Hard/Soft NLI Data Builder" to annotate entailment/contradiction pairs through Business Rules and Neural Classifiers, especially out-domain but semantic-alike sentences as hard-negative samples. Experiments show that the CCDC framework can achieve both intra-domain and cross-domain enhancement. Moreover, with the Soft NLI Data Builder, the CCDC framework can achieve the best results of all domains with one model, improving 34% and 11% in terms of the Spearman correlation coefficient compared with the baseline (BERTbase) and strong baseline (unsupervised SimCSE). And through empirical analysis, this framework effectively reduces the anisotropy of the pre-trained models and shows semantic clustering over unsupervised SimCSE.
What problem does this paper attempt to address?