Learning to Improve Out-of-Distribution Generalization Via Self-adaptive Language Masking

Shuoran Jiang,Youcheng Pan,Qingcai Chen,Yang Xiang,Xiangping Wu
DOI: https://doi.org/10.1109/taslp.2024.3394774
2024-01-01
Abstract:Although the pre-trained Transformers learned general linguistic knowledge from large-scale corpus, they still overfit on the lexical biases when fine-tuning on specific datasets. This problem limits the generalizability of pre-trained models, particularly when learning over out-of-distribution (OOD) data. To address this issue, this paper proposes a self-adaptive language masking (AdaLMask) paradigm to fine-tune the pre-trained Transformers. AdaLMask obviates lexical biases by eliminating the dependence on semantically inessential words. Specifically, AdaLMask learns a Gumbel-Softmax distribution to determine the desired masking positions, and the distribution parameters are optimized via a representation-invariant (RInv) objective to ensure the masked positions are semantically lossless. Four natural language processing tasks are chosen to evaluate the effectiveness of the proposed method on the robustness of lexical biases and OOD generalization. All empirical results demonstrate that the AdaLMask paradigm substantially improves the OOD generalization of pre-trained Transformers.
What problem does this paper attempt to address?