Abstract:Gene transformer models such as Nucleotide Transformer, DNABert, and LOGO are trained to learn optimal gene sequence representations by using the Masked Language Modeling (MLM) training objective over the complete Human Reference Genome. However, the typical tokenization methods employ a basic sliding window of tokens, such as k-mers, that fail to utilize gene-centric semantics. This could result in the (trivial) masking of easily predictable sequences, leading to inefficient MLM training. Time-variant training strategies are known to improve pretraining efficiency in both language and vision tasks. In this work, we focus on using curriculum masking where we systematically increase the difficulty of masked token prediction task by using a Pointwise Mutual Information-based difficulty criterion, as gene sequences lack well-defined semantic units similar to words or sentences of NLP domain. Our proposed Curriculum Masking-based Gene Masking Strategy (CM-GEMS) demonstrates superior representation learning capabilities compared to baseline masking approaches when evaluated on downstream gene sequence classification tasks. We perform extensive evaluation in both few-shot (five datasets) and full dataset settings (Genomic Understanding Evaluation benchmark consisting of 27 tasks). Our findings reveal that CM-GEMS outperforms state-of-the-art models (DNABert-2, Nucleotide transformer, DNABert) trained at 120K steps, achieving similar results in just 10K and 1K steps. We also demonstrate that Curriculum-Learned LOGO (a 2-layer DNABert-like model) can achieve nearly 90% of the state-of-the-art model performance of 120K steps. We will make the models and codes publicly available at <a class="link-external link-https" href="https://github.com/roysoumya/curriculum-GeneMask" rel="external noopener nofollow">this https URL</a>.

Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Emerging Property of Masked Token for Effective Pre-training

Improving the Reusability of Pre-trained Language Models in Real-world Applications

Train No Evil: Selective Masking for Task-Guided Pre-Training

Using Selective Masking as a Bridge between Pre-training and Fine-tuning

Learning Better Masking for Better Language Model Pre-training

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

GamMa: Efficient Fine-Tuning of Pre-Trained Language Models Using Gradient Activation Mapping Masking

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Uniform Masking Prevails in Vision-Language Pretraining

Masked Structural Growth for 2x Faster Language Model Pre-training

Unlocking Efficiency: Adaptive Masking for Gene Transformer Models

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

TLM: Token-Level Masking for Transformers

MPNet: Masked and Permuted Pre-training for Language Understanding

How does the task complexity of masked pretraining objectives affect downstream performance?

Investigating Masking-based Data Generation in Language Models

Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval