PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Preferred Elements,Kenshin Abe,Kaizaburo Chubachi,Yasuhiro Fujita,Yuta Hirokawa,Kentaro Imajo,Toshiki Kataoka,Hiroyoshi Komatsu,Hiroaki Mikami,Tsuguo Mogami,Shogo Murai,Kosuke Nakago,Daisuke Nishino,Toru Ogawa,Daisuke Okanohara,Yoshihiko Ozaki,Shotaro Sano,Shuji Suzuki,Tianqi Xu,Toshihiko Yanase
2024-10-22
Abstract:We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model's performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4. The base model is available at <a class="link-external link-https" href="https://huggingface.co/pfnet/plamo-100b" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop a large - scale language model (PLaMo - 100B) specifically designed for Japanese proficiency. Specifically, the paper focuses on the following aspects: 1. **Improving Japanese processing ability**: Compared with existing large - language models, PLaMo - 100B is specifically optimized for Japanese tasks, aiming to improve performance in Japanese - specific tasks. 2. **Training from scratch**: Unlike many models that are fine - tuned based on the weights of existing models, PLaMo - 100B is trained from scratch, using a data set of 2 trillion tokens, of which 1.5 trillion are used for initial pre - training and 0.5 trillion are used for continuous pre - training. This ensures that the model can better adapt to the task requirements of Japanese and English. 3. **Introducing advanced training techniques**: In order to ensure the stability of the training process, a variety of advanced techniques, such as QK Normalization and Z - Loss, are introduced in the paper. These techniques help maintain the stability and performance of the model during large - scale training. 4. **Post - training optimization**: Through post - training techniques such as Supervised Fine - Tuning (SFT) and Direct Preference Optimization (DPO), the performance of the model is further improved. In particular, the paper describes in detail how to generate high - quality training data to enhance the performance of the model in various tasks. 5. **Evaluating model performance**: The paper comprehensively evaluates PLaMo - 100B through multiple benchmark tests (such as Jaster, Japanese MT - Bench, and Rakuda Benchmark). The results show that it is competitive in both Japanese and English tasks, especially performing excellently in Japanese tasks and even outperforming GPT - 4 on some benchmarks. In summary, the main objective of this paper is to develop a high - performance Japanese language model. By training from scratch and introducing a variety of advanced techniques and optimization methods, it is ensured that the model achieves the best performance in Japanese - specific tasks.