Abstract:Contrastive learning has been demonstrated effective in unsupervised sentence representation learning. Given one sentence, positive pairs are obtained by passing the sentence to the encoder twice using the different dropout masks, and negative pairs are obtained by taking another sentence in the same mini-batch. However, the method suffers from the surface structure bias, i.e., sentences with similar surface structures will be regarded as close in semantics while sentences with dissimilar surface structures will be viewed as distinct in semantics. This leads to the result that paraphrasing a sentence that is dissimilar in surface structure will receive a lower semantic similarity score than inserting a negative word into the sentence. In this paper, we first verify the bias by collecting a sentence transformation testset. Then we systematically probe the existing models by proposing novel splits based on benchmark datasets in accordance with semantic and surface structure similarity. We tackle the bias in two aspects: balancing the learning target by augmenting with data that counters the bias, and meanwhile preserving word semantics by leveraging recall loss to prevent catastrophic forgetting. We evaluate our model on standard semantic textual similarity (STS) tasks using different pre-trained backbones and achieve state-of-the-art averaged performance across the STS benchmarks. Particularly, our models that are fine-tuned with RoBERTa base and RoBERTa large achieve significantly better performance on most benchmark datasets.

GLS-CSC: A Simple but Effective Strategy to Mitigate Chinese STM Models' Over-Reliance on Superficial Clue

OssCSE: Overcoming Surface Structure Bias in Contrastive Learning for Unsupervised Sentence Embedding

Enhancing Out-of-Domain Detection for Speech Spoofing Countermeasure Via Supervised Contrastive Learning

A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models

Rethinking Masked Language Modeling for Chinese Spelling Correction

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

Rich Semantic Knowledge Enhanced Large Language Models for Few-shot Chinese Spell Checking

Improve Chinese Spelling Check by Reevaluation

Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese Spelling Correction

Visual and Phonological Feature Enhanced Siamese BERT for Chinese Spelling Error Correction

Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models

C-LLM: Learn to Check Chinese Spelling Errors Character by Character

Investigating Glyph Phonetic Information for Chinese Spell Checking: What Works and What's Next

Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction

Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization

PSDSpell: Pre-Training with Self-Distillation Learning for Chinese Spelling Correction

SDCL: Self-Distillation Contrastive Learning for Chinese Spell Checking.

Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking

Improving Chinese Spelling Correction by Ranking.

SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check