RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev,Soham De,Samuel L Smith,Anushan Fernando,George-Cristian Muraru,Ruba Haroun,Leonard Berrada,Razvan Pascanu,Pier Giuseppe Sessa,Robert Dadashi,Léonard Hussenot,Johan Ferret,Sertan Girgin,Olivier Bachem,Alek Andreev,Kathleen Kenealy,Thomas Mesnard,Cassidy Hardin,Surya Bhupatiraju,Shreya Pathak,Laurent Sifre,Morgane Rivière,Mihir Sanjay Kale,Juliette Love,Pouya Tafti,Armand Joulin,Noah Fiedel,Evan Senter,Yutian Chen,Srivatsan Srinivasan,Guillaume Desjardins,David Budden,Arnaud Doucet,Sharad Vikram,Adam Paszke,Trevor Gale,Sebastian Borgeaud,Charlie Chen,Andy Brock,Antonia Paterson,Jenny Brennan,Meg Risdal,Raj Gundluru,Nesh Devanathan,Paul Mooney,Nilay Chauhan,Phil Culliton,Luiz Gustavo Martins,Elisa Bandy,David Huntsperger,Glenn Cameron,Arthur Zucker,Tris Warkentin,Ludovic Peran,Minh Giang,Zoubin Ghahramani,Clément Farabet,Koray Kavukcuoglu,Demis Hassabis,Raia Hadsell,Yee Whye Teh,Nando de Frietas

2024-08-28

Abstract:We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper attempts to address issues primarily focused on improving the efficiency and performance of language models when handling long sequences. Specifically: 1. **Reducing Memory Usage**: Traditional Transformer models require storing KV caches that scale linearly with the sequence length when processing long sequences, leading to high memory consumption. RecurrentGemma reduces memory usage by compressing the input sequence into a fixed-size state, enabling the model to efficiently handle long sequences. 2. **Improving Inference Speed**: Due to the reduction in memory usage, RecurrentGemma significantly enhances inference speed when generating long sequences. Particularly when dealing with long sequences, RecurrentGemma's throughput is much higher than similar Transformer models (such as Gemma). 3. **Maintaining or Enhancing Performance**: Despite having less training data compared to some benchmark models (like Gemma), RecurrentGemma still performs comparably or even better on multiple downstream tasks. This indicates that RecurrentGemma does not sacrifice model performance while reducing resource consumption. 4. **Applications in Resource-Constrained Environments**: By improving efficiency and performance, RecurrentGemma is poised to unlock new application scenarios in resource-constrained environments, such as deploying high-performance language models on mobile devices or edge computing devices. In summary, RecurrentGemma aims to make language models more efficient, faster, and high-performing when handling long sequences through technological innovation, thereby expanding their application potential in various scenarios.

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Gemma 2: Improving Open Language Models at a Practical Size

Gemma: Open Models Based on Gemini Research and Technology

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

PaliGemma 2: A Family of Versatile VLMs for Transfer

CodeGemma: Open Code Models Based on Gemma

PaliGemma: A versatile 3B VLM for transfer

Gemini: A Family of Highly Capable Multimodal Models

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

An In-depth Look at Gemini's Language Abilities

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

GLM-130B: An Open Bilingual Pre-trained Model

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Code Llama: Open Foundation Models for Code

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation