Lost in Translation: Large Language Models in Non-English Content Analysis

Gabriel Nicholas,Aliya Bhatia
DOI: https://doi.org/10.48550/arXiv.2306.07377
2023-06-12
Computation and Language
Abstract:In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.
What problem does this paper attempt to address?
The paper primarily explores the application and limitations of Large Language Models (LLMs) in the analysis of non-English content. It points out that the internet, as a major source of information, economic opportunities, and social interactions, has automated systems (such as chatbots, content moderation systems, and search engines) that perform poorly when handling approximately 7000 languages other than English. Although LLMs have become the dominant method for building online language analysis and generation systems in recent years, these models are primarily designed for English. To expand to other languages, researchers and tech companies have started developing Multilingual Language Models, which are trained using data from dozens or even hundreds of languages simultaneously. In theory, such models can infer connections between different languages and use resource-rich languages (like English) to improve the performance of resource-poor languages. However, these models still face numerous issues: 1. **Data Imbalance**: Most multilingual models still use English as the primary training text, leading to the transfer of English values and assumptions to other language contexts, which may not be applicable. 2. **Performance Disparities**: Due to the vast differences in available data, multilingual models perform better on resource-rich languages and worse on resource-poor languages. 3. **Erroneous Translations**: Machine-translated texts used to fill data gaps may further exacerbate issues of language misuse. 4. **Difficulty in Debugging**: When multilingual models malfunction, their non-intuitive connections between different languages make it harder to identify, diagnose, and fix problems. Moreover, the widespread use of large language models in content analysis raises concerns about potential biases and misjudgments, especially when these models are used for high-stakes decisions, such as determining immigration status or making critical healthcare decisions. Therefore, the paper calls for companies to maintain transparency when deploying these models and suggests that researchers and governments take measures to mitigate the impact of multilingual models on users of low-resource languages.