Danoliteracy of Generative, Large Language Models

Søren Vejlgaard Holm,Lars Kai Hansen,Martin Carsten Nielsen
2024-10-30
Abstract:The language technology moonshot moment of Generative, Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were until recently difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate Danoliteracy, a measure of Danish language and cultural competency, across eight diverse scenarios such Danish citizenship tests and abstractive social media question answering. This limited-size benchmark is found to produce a robust ranking that correlates to human feedback at $\rho \sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95\%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaption.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the evaluation challenges of generative large - language models (GLLMs) in low - resource languages such as Danish. Specifically, the paper proposes a benchmark named "Danoliteracy" to evaluate the performance of GLLMs in Danish language and cultural competence. These problems can be summarized as follows: 1. **Lack of applicable evaluation corpora**: - In low - resource languages such as Danish, due to the lack of applicable evaluation corpora, it is difficult to quantitatively verify the capabilities of GLLMs and one can only rely on qualitative demonstrations. 2. **Standardization of evaluation methods**: - There is a lack of a standardized evaluation framework, which makes researchers and practitioners lack a unified standard when evaluating different models, resulting in results that are difficult to compare and verify. 3. **Consistency of model performance**: - Researchers want to know whether the performance of GLLMs is consistent in different tasks and scenarios, that is, whether a model performing well in one task means it will also perform well in other tasks. 4. **Specificity of low - resource languages**: - The evaluation of low - resource languages needs to consider the specific context of language and culture, which is different from the evaluation methods of high - resource languages such as English. ### Solutions To address the above problems, the paper proposes the following solutions: 1. **Construct an evaluation benchmark**: - An evaluation benchmark containing eight different scenarios has been constructed, and these scenarios cover practical applications such as Danish citizenship tests and social media Q&A. 2. **Multi - dimensional evaluation**: - Use multiple evaluation metrics, including accuracy, similarity scores, etc., to evaluate the performance of the model from multiple perspectives. 3. **Human - feedback survey**: - By collecting feedback from Danish - speaking users, verify the effectiveness of the evaluation benchmark and provide a more comprehensive evaluation perspective. 4. **Factor analysis**: - Explore the consistency of model performance through factor analysis and find that one main factor explains most of the variation in scenario performance, indicating the existence of a "Danoliteracy g - factor". ### Main contributions 1. **Open - source evaluation benchmark**: - Provide an open - source evaluation framework and a real - time leaderboard website for researchers and practitioners to use and verify. 2. **New - type evaluation data set**: - Release a series of new Danish - language evaluation data sets, enriching the evaluation resources for low - resource languages. 3. **Evidence of model performance**: - Provide evidence that the current GPT - 4 and Claude Opus models perform excellently in Danish, outperforming other closed - source and open - source models. 4. **Danoliteracy g - factor**: - Preliminary evidence supports the existence of a "Danoliteracy g - factor", similar to the g - factor of human intelligence, which helps guide practitioners' expectations of model performance in different tasks. ### Conclusion By constructing and validating the "Danoliteracy" evaluation benchmark, the paper successfully solves the evaluation problems of GLLMs in low - resource languages such as Danish, providing important references and tools for future research and applications.