Since the Scientific Literature Is Multilingual, Our Models Should Be Too

Abteen Ebrahimi,Kenneth Church
2024-03-27
Abstract:English has long been assumed the $\textit{lingua franca}$ of scientific research, and this notion is reflected in the natural language processing (NLP) research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain. We end with suggestions for the NLP community on how to improve performance on non-English documents.
Computation and Language
What problem does this paper attempt to address?
The main issue this paper attempts to address is the over-reliance on English in current Natural Language Processing (NLP) research for representing scientific literature. Although English dominates scientific literature, there is still a significant amount of non-English literature that cannot be effectively processed using models that only support English. The authors demonstrate through quantitative analysis that scientific literature is indeed multilingual and point out that current models and benchmarks fail to reflect this linguistic diversity. The paper highlights the impact of using English-only models in a multilingual domain, including generating inaccurate document representations and the real negative effects faced by users. The authors finally propose improvements aimed at enhancing the NLP community's ability to handle non-English literature, promoting a more inclusive and diverse scientific research environment.