TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar,Saloni Mittal,Vidula Magdum,Omkar Dhekane,Raviraj Joshi,Geetanjali Kale,Arnav Ladkat
DOI: https://doi.org/10.1007/978-3-031-58495-4_12
2024-04-28
Abstract:For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper primarily focuses on how to improve model training efficiency, reduce carbon emissions, and enhance downstream task performance by optimizing data selection strategies during the pre-training process. Specifically, the paper attempts to address the following issues: 1. **Reducing Carbon Footprint**: The pre-training process of current large-scale language models (such as Transformer) consumes a significant amount of computational resources, leading to substantial carbon emissions. Therefore, researchers need to find a way to reduce the use of computational resources while ensuring model performance. 2. **Efficient Data Selection**: Selecting data relevant to specific domains from a massive corpus can optimize the pre-training process. Effective data selection can not only save time but also reduce unnecessary computational resource consumption. 3. **Improving Existing Data Selection Strategies**: Existing data selection methods (such as N-Grams, TF-IDF, etc.) are effective but can be computationally expensive in some cases, making them unsuitable for large-scale production environments. The paper proposes a new domain-adaptive data selection method—TextGram, aimed at improving pre-training efficiency and reducing costs. Through these efforts, the paper hopes to achieve a more environmentally friendly and efficient pre-training process while ensuring model performance. Experimental results show that TextGram performs better than other data selection methods, especially in text classification tasks.