LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

LLM-jp,Akiko Aizawa,Eiji Aramaki,Bowen Chen,Fei Cheng,Hiroyuki Deguchi,Rintaro Enomoto,Kazuki Fujii,Kensuke Fukumoto,Takuya Fukushima,Namgi Han,Yuto Harada,Chikara Hashimoto,Tatsuya Hiraoka,Shohei Hisada,Sosuke Hosokawa,Lu Jie,Keisuke Kamata,Teruhito Kanazawa,Hiroki Kanezashi,Hiroshi Kataoka,Satoru Katsumata,Daisuke Kawahara,Seiya Kawano,Atsushi Keyaki,Keisuke Kiryu,Hirokazu Kiyomaru,Takashi Kodama,Takahiro Kubo,Yohei Kuga,Ryoma Kumon,Shuhei Kurita,Sadao Kurohashi,Conglong Li,Taiki Maekawa,Hiroshi Matsuda,Yusuke Miyao,Kentaro Mizuki,Sakae Mizuki,Yugo Murawaki,Ryo Nakamura,Taishi Nakamura,Kouta Nakayama,Tomoka Nakazato,Takuro Niitsuma,Jiro Nishitoba,Yusuke Oda,Hayato Ogawa,Takumi Okamoto,Naoaki Okazaki,Yohei Oseki,Shintaro Ozaki,Koki Ryu,Rafal Rzepka,Keisuke Sakaguchi,Shota Sasaki,Satoshi Sekine,Kohei Suda,Saku Sugawara,Issa Sugiura,Hiroaki Sugiyama,Hisami Suzuki,Jun Suzuki,Toyotaro Suzumura,Kensuke Tachibana,Yu Takagi,Kyosuke Takami,Koichi Takeda,Masashi Takeshita,Masahiro Tanaka,Kenjiro Taura,Arseny Tolmachev,Nobuhiro Ueda,Zhen Wan,Shuntaro Yada,Sakiko Yahata,Yuya Yamamoto,Yusuke Yamauchi,Hitomi Yanaka,Rio Yokota,Koichiro Yoshino
2024-07-04
Abstract:This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit <a class="link-external link-https" href="https://llm-jp.nii.ac.jp/en/" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper introduces a cross-organization project called LLM-jp, which aims to research and develop open-source large-scale Japanese language models (LLMs). The project aims to address some key issues in LLM development, such as high computational resource requirements, control mainly by a few large institutions, lack of model details transparency, and concerns regarding security and social acceptance. Additionally, the paper also highlights specific concerns in Japan, where the low representation of Japanese language in the GPT-3 dataset may marginalize Japanese culture and activities, as well as the potential knowledge loss from relying on foreign models. LLM-jp started in May 2023 and involves over 1500 participants from academia and industry. The project adopts a completely transparent approach by openly sharing models, corpora, training data, etc. It is divided into multiple working groups, including corpus construction, model building, fine-tuning and evaluation, and computational infrastructure. They have released two versions of model suites and developed a multilingual tokenizer. The paper provides a detailed description of the process from corpus construction to pre-training models, including document filtering and transformation, tokenization using SentencePiece and MeCab, etc. For pre-training models v1.0 and v2.0, they built different scale and quality corpora and performed respective optimizations. Additionally, they are developing a corpus search function to identify the sources of generated text and analyze the workings of LLMs. Future work includes building a larger 175B parameter model and a richer corpus, exploring best practices for language mixture ratio and corpus size, and sharing GPU cluster operation experience with other projects.