Falcon2-11B Technical Report

Quentin Malartic,Nilabhra Roy Chowdhury,Ruxandra Cojocaru,Mugariya Farooq,Giulia Campesan,Yasser Abdelaziz Dahou Djilali,Sanath Narayan,Ankit Singh,Maksim Velikanov,Basma El Amel Boussaha,Mohammed Al-Yafeai,Hamza Alobeidli,Leen Al Qadi,Mohamed El Amine Seddik,Kirill Fedyanin,Reda Alami,Hakim Hacid
2024-07-20
Abstract:We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-stage approach where the early stages are distinguished by their context length and a final stage where we use a curated, high-quality dataset. Additionally, we report the effect of doubling the batch size mid-training and how training loss spikes are affected by the learning rate. The downstream performance of the foundation model is evaluated on established benchmarks, including multilingual and code datasets. The foundation model shows strong generalization across all the tasks which makes it suitable for downstream finetuning use cases. For the vision language model, we report the performance on several benchmarks and show that our model achieves a higher average score compared to open-source models of similar size. The model weights and code of both Falcon2-11B and Falcon2-11B-vlm are made available under a permissive license.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve include the following aspects: 1. **Improving the performance and efficiency of large - scale language models**: - Falcon2 - 11B is a base model trained on over 5 trillion tokens. Through a multi - stage training method, researchers aim to optimize the model's performance, especially in terms of context length extension, multilingual support, and efficient training. - The paper explores the impact of doubling the batch size during the training process on the training loss and analyzes how the learning rate affects the fluctuations of the training loss. 2. **Constructing a multimodal ecosystem**: - Researchers not only developed the large - language model Falcon2 - 11B but also introduced its multimodal counterpart Falcon2 - 11B - vlm, which is a vision - to - text model. This marks an important step towards constructing a multimodal ecosystem. - The multimodal model can process image inputs and answer image - related queries, thus expanding the application range of the model. 3. **Improving multilingual support**: - To enhance multilingual capabilities, Falcon2 - 11B increased the proportion of non - English languages in the training data to ensure better performance of the model on multilingual tasks. - Researchers paid special attention to data quality filtering for different languages to ensure high - quality multilingual training data. 4. **Evaluating the performance of downstream tasks**: - The paper reports in detail the performance of Falcon2 - 11B in multiple benchmark tests, including multilingual and code datasets. The results show that the model performs well on various tasks and is suitable for downstream fine - tuning use cases. - For the visual - language model Falcon2 - 11B - vlm, the research team also evaluated it in multiple benchmark tests, demonstrating its average score superior to open - source models. 5. **Optimizing the training process**: - By using FlashAttention - 2 (FA2), researchers increased the utilization rate of GPUs and significantly improved the training speed especially in the case of long - context windows. - Researchers also explored different training strategies, such as learning rate scheduling, weight decay, etc., to ensure the stability and efficiency of model training. In summary, the main objective of this paper is to further improve the performance and applicability of large - language models by improving training methods, adding multimodal support, and optimizing multilingual capabilities.