Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

Chanjun Park,Hyeonwoo Kim
2024-09-05
Abstract:This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to address the problem of a too - short observation period in previous studies on the Open Ko - LLM Leaderboard through an 11 - month longitudinal study. Specifically, the paper mainly focuses on the following aspects: 1. **Specific challenges in performance improvement for different tasks**: - The paper analyzes the performance change trends of different tasks on the Open Ko - LLM Leaderboard during the 11 - month period. These tasks include common - sense reasoning (Ko - HellaSwag), natural language understanding (Ko - ARC), multi - task language understanding and domain knowledge (Ko - MMLU), common - sense generation (Ko - CommonGen V2), and truthfulness assessment (Ko - TruthfulQA). By analyzing these data, researchers hope to identify which tasks are the most challenging for developers, which tasks have reached performance saturation, and which tasks still have significant difficulties. 2. **The impact of model size on the correlation of task performance**: - The study explores the performance correlations of models of different scales in various benchmark tests. Specifically, the paper divides the models into three categories: those with the number of parameters less than 3 billion, those between 3 billion and 7 billion, and those between 7 billion and 14 billion. By analyzing the performance of these models on different tasks, researchers hope to understand how model capacity affects task performance and provide in - depth understanding of the impact of model expansion on the overall effect. 3. **Changes in leaderboard dynamics over time**: - The paper also examines the changes in the leaderboard dynamics on the Open Ko - LLM Leaderboard. The research focuses on analyzing three key aspects: the change in the correlation of task performance between the early months and the entire 11 - month period, the change in performance over time based on model types, and the performance change based on model size. Through these analyses, researchers hope to capture the long - term trends of model performance and ranking dynamics. Through these research questions, the paper provides a comprehensive understanding of the development progress of Korean large - language models (LLMs) and the evolution of the evaluation framework. Through the analysis of 1,769 models, the study reveals the long - term trends and inherent challenges of model performance, providing valuable guidance for future LLM development.