CodeGemma: Open Code Models Based on Gemma

CodeGemma Team,Heri Zhao,Jeffrey Hui,Joshua Howland,Nam Nguyen,Siqi Zuo,Andrea Hu,Christopher A. Choquette-Choo,Jingyue Shen,Joe Kelley,Kshitij Bansal,Luke Vilnis,Mateo Wirth,Paul Michel,Peter Choy,Pratik Joshi,Ravin Kumar,Sarmad Hashmi,Shubham Agrawal,Zhitao Gong,Jane Fine,Tris Warkentin,Ale Jakse Hartman,Bin Ni,Kathy Korevec,Kelly Schaefer,Scott Huffman

2024-06-19

Abstract:This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

This paper introduces CodeGemma, an open-source code model collection based on the Gemma model for code and natural language generation tasks. CodeGemma consists of three model variants: the 7B pre-trained model and the 7B fine-tuned model for instructions, as well as a 2B model specifically designed for fast code completion and open-ended generation. These models are further trained on large-scale code data to enhance code comprehension and reasoning abilities. One highlight of CodeGemma is its excellent performance in code completion and generation tasks while maintaining a good understanding of natural language. The 7B model excels in mathematical reasoning, while the 2B model is suitable for delay-sensitive applications, such as integrated development environments (IDEs), due to its fast inference speed. The paper also mentions improvements in training methods, such as optimizing the Fill-in-the-Middle (FIM) task and using multi-file packaging techniques to enhance the model's understanding of code context in practical applications. In addition, the models are enhanced in logical reasoning and problem-solving abilities through supervised fine-tuning and reinforcement learning on mathematical problems. CodeGemma demonstrates its advantages in code generation, multi-language code tasks, and mathematical reasoning in various automated benchmark tests, comparing to existing models. It aims to provide efficient and high-quality code generation solutions for real-world deployments, particularly suitable for low-latency environments.

CodeGemma: Open Code Models Based on Gemma

Gemma: Open Models Based on Gemini Research and Technology

Gemma 2: Improving Open Language Models at a Practical Size

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Code Llama: Open Foundation Models for Code

PaliGemma: A versatile 3B VLM for transfer

An In-depth Look at Gemini's Language Abilities

Gemini: A Family of Highly Capable Multimodal Models

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

PaliGemma 2: A Family of Versatile VLMs for Transfer

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

ShieldGemma: Generative AI Content Moderation Based on Gemma

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

CodeT: Code Generation with Generated Tests