Abstract:This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.

What problem does this paper attempt to address?

This paper aims to address the problem of a too - short observation period in previous studies on the Open Ko - LLM Leaderboard through an 11 - month longitudinal study. Specifically, the paper mainly focuses on the following aspects: 1. **Specific challenges in performance improvement for different tasks**: - The paper analyzes the performance change trends of different tasks on the Open Ko - LLM Leaderboard during the 11 - month period. These tasks include common - sense reasoning (Ko - HellaSwag), natural language understanding (Ko - ARC), multi - task language understanding and domain knowledge (Ko - MMLU), common - sense generation (Ko - CommonGen V2), and truthfulness assessment (Ko - TruthfulQA). By analyzing these data, researchers hope to identify which tasks are the most challenging for developers, which tasks have reached performance saturation, and which tasks still have significant difficulties. 2. **The impact of model size on the correlation of task performance**: - The study explores the performance correlations of models of different scales in various benchmark tests. Specifically, the paper divides the models into three categories: those with the number of parameters less than 3 billion, those between 3 billion and 7 billion, and those between 7 billion and 14 billion. By analyzing the performance of these models on different tasks, researchers hope to understand how model capacity affects task performance and provide in - depth understanding of the impact of model expansion on the overall effect. 3. **Changes in leaderboard dynamics over time**: - The paper also examines the changes in the leaderboard dynamics on the Open Ko - LLM Leaderboard. The research focuses on analyzing three key aspects: the change in the correlation of task performance between the early months and the entire 11 - month period, the change in performance over time based on model types, and the performance change based on model size. Through these analyses, researchers hope to capture the long - term trends of model performance and ranking dynamics. Through these research questions, the paper provides a comprehensive understanding of the development progress of Korean large - language models (LLMs) and the evolution of the evaluation framework. Through the analysis of 1,769 models, the study reveals the long - term trends and inherent challenges of model performance, providing valuable guidance for future LLM development.

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models

Exploring the Latest LLMs for Leaderboard Extraction

Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean

KMMLU: Measuring Massive Multitask Language Understanding in Korean

Assessing the Proficiency of LLMs with Various Tasks and Evaluators

When Young Scholars Cooperate with LLMs in Academic Tasks: The Influence of Individual Differences and Task Complexities

Exploring the LLM Journey from Cognition to Expression with Linear Representations

Online Continual Knowledge Learning for Language Models

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

An Empirical Study on Challenges for LLM Application Developers

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

LawBench: Benchmarking Legal Knowledge of Large Language Models

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study