H2O-Danube3 Technical Report

Pascal Pfeiffer,Philipp Singer,Yauhen Babakhin,Gabor Fodor,Nischay Dhankhar,Sri Satish Ambati

2024-07-12

Abstract:We present H2O-Danube3, a series of small language models consisting of H2O-Danube3-4B, trained on 6T tokens and H2O-Danube3-500M, trained on 4T tokens. Our models are pre-trained on high quality Web data consisting of primarily English tokens in three stages with different data mixes before final supervised tuning for chat version. The models exhibit highly competitive metrics across a multitude of academic, chat, and fine-tuning benchmarks. Thanks to its compact architecture, H2O-Danube3 can be efficiently run on a modern smartphone, enabling local inference and rapid processing capabilities even on mobile devices. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper introduces the H2O-Danube3 series of small-scale language models, including H2O-Danube3-4B and H2O-Danube3-500M, which are trained on 6T and 4T English data. These models undergo multi-stage training with a focus on high-quality network data and are finally supervised fine-tuned for chat versions. The models demonstrate high competitiveness in various academic, chat, and fine-tuning benchmarks, and due to their compact architecture, they can efficiently run on modern smartphones, supporting on-device inference and fast processing capabilities. The research extends previous work on small-scale language models, with a focus on efficient inference and edge device applications. These small models, after task-specific fine-tuning, even outperform certain BERT-based encoder-decoder models in tasks such as sequence classification, question-answering, and token classification. The paper provides a detailed description of the model architecture, training process, and fine-tuning steps, along with extensive evaluations covering standard academic metrics, chat benchmarks, and fine-tuning benchmarks. The results demonstrate that H2O-Danube3 performs strongly across various dimensions, expanding the range of choices for open-source small-scale language models. In addition, the paper introduces the iOS application H2O AI Personal GPT1, which allows users to run H2O-Danube3 offline on their mobile phones. The model is also quantized to reduce size while maintaining performance, making it suitable for resource-constrained devices. Through these efforts, the paper aims to further popularize language models, economically serving a wider audience and playing a role in various scenarios such as chatbots, task-specific applications, research, and offline device applications.

H2O-Danube3 Technical Report

H2O-Danube-1.8B Technical Report

H2O Open Ecosystem for State-of-the-art Large Language Models

H2OVL-Mississippi Vision Language Models Technical Report

h2oGPT: Democratizing Large Language Models

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

DataComp-LM: In search of the next generation of training sets for language models

Zyda: A 1.3T Dataset for Open Language Modeling

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Zyda-2: a 5 Trillion Token High-Quality Dataset

Jellyfish: A Large Language Model for Data Preprocessing

The Llama 3 Herd of Models

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

Unlocking the Potential: Benchmarking Large Language Models in Water Engineering and Research

DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model

Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

LLM360: Towards Fully Transparent Open-Source LLMs

Aquila2 Technical Report

Mark My Words: Analyzing and Evaluating Language Model Watermarks