Abstract:We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-stage approach where the early stages are distinguished by their context length and a final stage where we use a curated, high-quality dataset. Additionally, we report the effect of doubling the batch size mid-training and how training loss spikes are affected by the learning rate. The downstream performance of the foundation model is evaluated on established benchmarks, including multilingual and code datasets. The foundation model shows strong generalization across all the tasks which makes it suitable for downstream finetuning use cases. For the vision language model, we report the performance on several benchmarks and show that our model achieves a higher average score compared to open-source models of similar size. The model weights and code of both Falcon2-11B and Falcon2-11B-vlm are made available under a permissive license.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve include the following aspects: 1. **Improving the performance and efficiency of large - scale language models**: - Falcon2 - 11B is a base model trained on over 5 trillion tokens. Through a multi - stage training method, researchers aim to optimize the model's performance, especially in terms of context length extension, multilingual support, and efficient training. - The paper explores the impact of doubling the batch size during the training process on the training loss and analyzes how the learning rate affects the fluctuations of the training loss. 2. **Constructing a multimodal ecosystem**: - Researchers not only developed the large - language model Falcon2 - 11B but also introduced its multimodal counterpart Falcon2 - 11B - vlm, which is a vision - to - text model. This marks an important step towards constructing a multimodal ecosystem. - The multimodal model can process image inputs and answer image - related queries, thus expanding the application range of the model. 3. **Improving multilingual support**: - To enhance multilingual capabilities, Falcon2 - 11B increased the proportion of non - English languages in the training data to ensure better performance of the model on multilingual tasks. - Researchers paid special attention to data quality filtering for different languages to ensure high - quality multilingual training data. 4. **Evaluating the performance of downstream tasks**: - The paper reports in detail the performance of Falcon2 - 11B in multiple benchmark tests, including multilingual and code datasets. The results show that the model performs well on various tasks and is suitable for downstream fine - tuning use cases. - For the visual - language model Falcon2 - 11B - vlm, the research team also evaluated it in multiple benchmark tests, demonstrating its average score superior to open - source models. 5. **Optimizing the training process**: - By using FlashAttention - 2 (FA2), researchers increased the utilization rate of GPUs and significantly improved the training speed especially in the case of long - context windows. - Researchers also explored different training strategies, such as learning rate scheduling, weight decay, etc., to ensure the stability and efficiency of model training. In summary, the main objective of this paper is to further improve the performance and applicability of large - language models by improving training methods, adding multimodal support, and optimizing multilingual capabilities.

Falcon2-11B Technical Report

Falcon Mamba: The First Competitive Attention-free 7B Language Model

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

Aquila2 Technical Report

Pixtral 12B

VILA$^2$: VILA Augmented VILA

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Falcon 7b for Software Mention Detection in Scholarly Documents

FALCON: Honest-Majority Maliciously Secure Framework for Private Deep Learning

Language models scale reliably with over-training and on downstream tasks

What matters when building vision-language models?

Gemma 2: Improving Open Language Models at a Practical Size

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Code Llama: Open Foundation Models for Code

LLaMA: Open and Efficient Foundation Language Models

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding