Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching

Parul Chopra,Sai Krishna Rallabandi,Alan W Black,Khyathi Raghavi Chandu
DOI: https://doi.org/10.48550/arXiv.2111.01231
2021-11-02
Abstract:Code-switching (CS), a ubiquitous phenomenon due to the ease of communication it offers in multilingual communities still remains an understudied problem in language processing. The primary reasons behind this are: (1) minimal efforts in leveraging large pretrained multilingual models, and (2) the lack of annotated data. The distinguishing case of low performance of multilingual models in CS is the intra-sentence mixing of languages leading to switch points. We first benchmark two sequence labeling tasks -- POS and NER on 4 different language pairs with a suite of pretrained models to identify the problems and select the best performing model, char-BERT, among them (addressing (1)). We then propose a self training method to repurpose the existing pretrained models using a switch-point bias by leveraging unannotated data (addressing (2)). We finally demonstrate that our approach performs well on both tasks by reducing the gap between the switch point performance while retaining the overall performance on two distinct language pairs in both the tasks. Our code is available here: <a class="link-external link-https" href="https://github.com/PC09/EMNLP2021-Switch-Point-biased-Self-Training" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?