TOCOL: Improving Contextual Representation of Pre-trained Language Models Via Token-Level Contrastive Learning

Keheng Wang,Chuantao Yin,Rumei Li,Sirui Wang,Yunsen Xian,Wenge Rong,Zhang Xiong
DOI: https://doi.org/10.1007/s10994-023-06512-9
2023-01-01
Abstract:Self-attention, which allows transformers to capture deep bidirectional contexts, plays a vital role in BERT-like pre-trained language models. However, the maximum likelihood pre-training objective of BERT may produce an anisotropic word embedding space, which leads to biased attention scores for high-frequency tokens, as they are very close to each other in representation space and thus have higher similarities. This bias may ultimately affect the encoding of global contextual information. To address this issue, we propose TOCOL, a TOken-Level COntrastive Learning framework for improving the contextual representation of pre-trained language models, which integrates a novel self-supervised objective to the attention mechanism to reshape the word representation space and encourages PLM to capture the global semantics of sentences. Results on the GLUE Benchmark show that TOCOL brings considerable improvement over the original BERT. Furthermore, we conduct a detailed analysis and demonstrate the robustness of our approach for low-resource scenarios.
What problem does this paper attempt to address?