LLM-jp,Akiko Aizawa,Eiji Aramaki,Bowen Chen,Fei Cheng,Hiroyuki Deguchi,Rintaro Enomoto,Kazuki Fujii,Kensuke Fukumoto,Takuya Fukushima,Namgi Han,Yuto Harada,Chikara Hashimoto,Tatsuya Hiraoka,Shohei Hisada,Sosuke Hosokawa,Lu Jie,Keisuke Kamata,Teruhito Kanazawa,Hiroki Kanezashi,Hiroshi Kataoka,Satoru Katsumata,Daisuke Kawahara,Seiya Kawano,Atsushi Keyaki,Keisuke Kiryu,Hirokazu Kiyomaru,Takashi Kodama,Takahiro Kubo,Yohei Kuga,Ryoma Kumon,Shuhei Kurita,Sadao Kurohashi,Conglong Li,Taiki Maekawa,Hiroshi Matsuda,Yusuke Miyao,Kentaro Mizuki,Sakae Mizuki,Yugo Murawaki,Ryo Nakamura,Taishi Nakamura,Kouta Nakayama,Tomoka Nakazato,Takuro Niitsuma,Jiro Nishitoba,Yusuke Oda,Hayato Ogawa,Takumi Okamoto,Naoaki Okazaki,Yohei Oseki,Shintaro Ozaki,Koki Ryu,Rafal Rzepka,Keisuke Sakaguchi,Shota Sasaki,Satoshi Sekine,Kohei Suda,Saku Sugawara,Issa Sugiura,Hiroaki Sugiyama,Hisami Suzuki,Jun Suzuki,Toyotaro Suzumura,Kensuke Tachibana,Yu Takagi,Kyosuke Takami,Koichi Takeda,Masashi Takeshita,Masahiro Tanaka,Kenjiro Taura,Arseny Tolmachev,Nobuhiro Ueda,Zhen Wan,Shuntaro Yada,Sakiko Yahata,Yuya Yamamoto,Yusuke Yamauchi,Hitomi Yanaka,Rio Yokota,Koichiro Yoshino

Abstract:This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit <a class="link-external link-https" href="https://llm-jp.nii.ac.jp/en/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper introduces a cross-organization project called LLM-jp, which aims to research and develop open-source large-scale Japanese language models (LLMs). The project aims to address some key issues in LLM development, such as high computational resource requirements, control mainly by a few large institutions, lack of model details transparency, and concerns regarding security and social acceptance. Additionally, the paper also highlights specific concerns in Japan, where the low representation of Japanese language in the GPT-3 dataset may marginalize Japanese culture and activities, as well as the potential knowledge loss from relying on foreign models. LLM-jp started in May 2023 and involves over 1500 participants from academia and industry. The project adopts a completely transparent approach by openly sharing models, corpora, training data, etc. It is divided into multiple working groups, including corpus construction, model building, fine-tuning and evaluation, and computational infrastructure. They have released two versions of model suites and developed a multilingual tokenizer. The paper provides a detailed description of the process from corpus construction to pre-training models, including document filtering and transformation, tokenization using SentencePiece and MeCab, etc. For pre-training models v1.0 and v2.0, they built different scale and quality corpora and performed respective optimizations. Additionally, they are developing a corpus search function to identify the sources of generated text and analyze the workings of LLMs. Future work includes building a larger 175B parameter model and a richer corpus, exploring best practices for language mixture ratio and corpus size, and sharing GPU cluster operation experience with other projects.

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

LLM360: Towards Fully Transparent Open-Source LLMs

Building a Large Japanese Web Corpus for Large Language Models

Pretraining and Updates of Domain-Specific LLM: A Case Study in the Japanese Business Domain

Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology

Future-proofing geotechnics workflows: accelerating problem-solving with large language models

A Survey on Human-Centric LLMs

LLMBox: A Comprehensive Library for Large Language Models

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

InternLM-Law: An Open Source Chinese Legal Large Language Model

LLM for Everyone: Representing the Underrepresented in Large Language Models

70B-parameter large language models in Japanese medical question-answering

PolyLM: An Open Source Polyglot Large Language Model

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play

YuLan: An Open-source Large Language Model