RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev,Soham De,Samuel L Smith,Anushan Fernando,George-Cristian Muraru,Ruba Haroun,Leonard Berrada,Razvan Pascanu,Pier Giuseppe Sessa,Robert Dadashi,Léonard Hussenot,Johan Ferret,Sertan Girgin,Olivier Bachem,Alek Andreev,Kathleen Kenealy,Thomas Mesnard,Cassidy Hardin,Surya Bhupatiraju,Shreya Pathak,Laurent Sifre,Morgane Rivière,Mihir Sanjay Kale,Juliette Love,Pouya Tafti,Armand Joulin,Noah Fiedel,Evan Senter,Yutian Chen,Srivatsan Srinivasan,Guillaume Desjardins,David Budden,Arnaud Doucet,Sharad Vikram,Adam Paszke,Trevor Gale,Sebastian Borgeaud,Charlie Chen,Andy Brock,Antonia Paterson,Jenny Brennan,Meg Risdal,Raj Gundluru,Nesh Devanathan,Paul Mooney,Nilay Chauhan,Phil Culliton,Luiz Gustavo Martins,Elisa Bandy,David Huntsperger,Glenn Cameron,Arthur Zucker,Tris Warkentin,Ludovic Peran,Minh Giang,Zoubin Ghahramani,Clément Farabet,Koray Kavukcuoglu,Demis Hassabis,Raia Hadsell,Yee Whye Teh,Nando de Frietas
2024-08-28
Abstract:We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address issues primarily focused on improving the efficiency and performance of language models when handling long sequences. Specifically: 1. **Reducing Memory Usage**: Traditional Transformer models require storing KV caches that scale linearly with the sequence length when processing long sequences, leading to high memory consumption. RecurrentGemma reduces memory usage by compressing the input sequence into a fixed-size state, enabling the model to efficiently handle long sequences. 2. **Improving Inference Speed**: Due to the reduction in memory usage, RecurrentGemma significantly enhances inference speed when generating long sequences. Particularly when dealing with long sequences, RecurrentGemma's throughput is much higher than similar Transformer models (such as Gemma). 3. **Maintaining or Enhancing Performance**: Despite having less training data compared to some benchmark models (like Gemma), RecurrentGemma still performs comparably or even better on multiple downstream tasks. This indicates that RecurrentGemma does not sacrifice model performance while reducing resource consumption. 4. **Applications in Resource-Constrained Environments**: By improving efficiency and performance, RecurrentGemma is poised to unlock new application scenarios in resource-constrained environments, such as deploying high-performance language models on mobile devices or edge computing devices. In summary, RecurrentGemma aims to make language models more efficient, faster, and high-performing when handling long sequences through technological innovation, thereby expanding their application potential in various scenarios.