CodeGemma: Open Code Models Based on Gemma

CodeGemma Team,Heri Zhao,Jeffrey Hui,Joshua Howland,Nam Nguyen,Siqi Zuo,Andrea Hu,Christopher A. Choquette-Choo,Jingyue Shen,Joe Kelley,Kshitij Bansal,Luke Vilnis,Mateo Wirth,Paul Michel,Peter Choy,Pratik Joshi,Ravin Kumar,Sarmad Hashmi,Shubham Agrawal,Zhitao Gong,Jane Fine,Tris Warkentin,Ale Jakse Hartman,Bin Ni,Kathy Korevec,Kelly Schaefer,Scott Huffman
2024-06-19
Abstract:This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper introduces CodeGemma, an open-source code model collection based on the Gemma model for code and natural language generation tasks. CodeGemma consists of three model variants: the 7B pre-trained model and the 7B fine-tuned model for instructions, as well as a 2B model specifically designed for fast code completion and open-ended generation. These models are further trained on large-scale code data to enhance code comprehension and reasoning abilities. One highlight of CodeGemma is its excellent performance in code completion and generation tasks while maintaining a good understanding of natural language. The 7B model excels in mathematical reasoning, while the 2B model is suitable for delay-sensitive applications, such as integrated development environments (IDEs), due to its fast inference speed. The paper also mentions improvements in training methods, such as optimizing the Fill-in-the-Middle (FIM) task and using multi-file packaging techniques to enhance the model's understanding of code context in practical applications. In addition, the models are enhanced in logical reasoning and problem-solving abilities through supervised fine-tuning and reinforcement learning on mathematical problems. CodeGemma demonstrates its advantages in code generation, multi-language code tasks, and mathematical reasoning in various automated benchmark tests, comparing to existing models. It aims to provide efficient and high-quality code generation solutions for real-world deployments, particularly suitable for low-latency environments.