Sinkhorn Distance Minimization for Knowledge Distillation
Xiao Cui,Yulei Qin,Yuting Gao,Enwei Zhang,Zihan Xu,Tong Wu,Ke Li,Xing Sun,Wengang Zhou,Houqiang Li
DOI: https://doi.org/10.1109/tnnls.2024.3501335
IF: 14.255
2024-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:Knowledge distillation (KD) has been widely adopted to compress largelanguage models (LLMs). Existing KD methods investigate various divergencemeasures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL),and Jensen-Shannon (JS) divergences. However, due to limitations inherent intheir assumptions and definitions, these measures fail to deliver effectivesupervision when few distribution overlap exists between the teacher and thestudent. In this paper, we show that the aforementioned KL, RKL, and JSdivergences respectively suffer from issues of mode-averaging, mode-collapsing,and mode-underestimation, which deteriorates logits-based KD for diverse NLPtasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits theSinkhorn distance to ensure a nuanced and precise assessment of the disparitybetween teacher and student distributions. Besides, profit by properties of theSinkhorn metric, we can get rid of sample-wise KD that restricts the perceptionof divergence in each teacher-student sample pair. Instead, we propose abatch-wise reformulation to capture geometric intricacies of distributionsacross samples in the high-dimensional space. Comprehensive evaluation on GLUEand SuperGLUE, in terms of comparability, validity, and generalizability,highlights our superiority over state-of-the-art methods on all kinds of LLMswith encoder-only, encoder-decoder, and decoder-only architectures.