TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar,Saloni Mittal,Vidula Magdum,Omkar Dhekane,Raviraj Joshi,Geetanjali Kale,Arnav Ladkat

DOI: https://doi.org/10.1007/978-3-031-58495-4_12

2024-04-28

Abstract:For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper primarily focuses on how to improve model training efficiency, reduce carbon emissions, and enhance downstream task performance by optimizing data selection strategies during the pre-training process. Specifically, the paper attempts to address the following issues: 1. **Reducing Carbon Footprint**: The pre-training process of current large-scale language models (such as Transformer) consumes a significant amount of computational resources, leading to substantial carbon emissions. Therefore, researchers need to find a way to reduce the use of computational resources while ensuring model performance. 2. **Efficient Data Selection**: Selecting data relevant to specific domains from a massive corpus can optimize the pre-training process. Effective data selection can not only save time but also reduce unnecessary computational resource consumption. 3. **Improving Existing Data Selection Strategies**: Existing data selection methods (such as N-Grams, TF-IDF, etc.) are effective but can be computationally expensive in some cases, making them unsuitable for large-scale production environments. The paper proposes a new domain-adaptive data selection method—TextGram, aimed at improving pre-training efficiency and reducing costs. Through these efforts, the paper hopes to achieve a more environmentally friendly and efficient pre-training process while ensuring model performance. Experimental results show that TextGram performs better than other data selection methods, especially in text classification tasks.

TextGram: Towards a better domain-adaptive pretraining

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Data Selection for Language Models via Importance Resampling

A Survey on Data Selection for Language Models

Does your data spark joy? Performance gains from domain upsampling at the end of training

Text-Free Multi-domain Graph Pre-training: Toward Graph Foundation Models

Pre-train or Annotate? Domain Adaptation with a Constrained Budget

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining

Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance

Multi-Stage Pre-training for Low-Resource Domain Adaptation

What's in a Domain? Learning Domain-Robust Text Representations using Adversarial Training

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Data Selection via Optimal Control for Language Models

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

A Compact Pretraining Approach for Neural Language Models

Data Selection Strategies for Multi-Domain Sentiment Analysis