Abstract:The computational and memory demands of vanilla attention scale quadratically with the sequence length $N$, posing significant challenges for processing long sequences in Transformer models. FlashAttention alleviates these challenges by eliminating the $O(N^2)$ memory dependency and reducing attention latency through IO-aware memory optimizations. However, its native support for certain attention mask types is limited, and it does not inherently accommodate more complex masking requirements. Previous approaches resort to using dense masks with $O(N^2)$ memory complexity, leading to inefficiencies. In this paper, we propose FlashMask, an extension of FlashAttention that introduces a column-wise sparse representation of attention masks. This approach efficiently represents a wide range of mask types and facilitates the development of optimized kernel implementations. By adopting this novel representation, FlashMask achieves linear memory complexity $O(N)$, suitable for modeling long-context sequences. Moreover, this representation enables kernel optimizations that eliminate unnecessary computations by leveraging sparsity in the attention mask, without sacrificing computational accuracy, resulting in higher computational efficiency. We evaluate FlashMask's performance in fine-tuning and alignment training of LLMs such as SFT, LoRA, DPO, and RM. FlashMask achieves significant throughput improvements, with end-to-end speedups ranging from 1.65x to 3.22x compared to existing FlashAttention dense method. Additionally, our kernel-level comparisons demonstrate that FlashMask surpasses the latest counterpart, FlexAttention, by 12.1% to 60.7% in terms of kernel TFLOPs/s, achieving 37.8% to 62.3% of the theoretical maximum FLOPs/s on the A100 GPU. The code is open-sourced on PaddlePaddle and integrated into PaddleNLP, supporting models with over 100 billion parameters for contexts up to 128K tokens.

Incremental and Data-Efficient Concept Formation to Support Masked Word Prediction

Incremental Concept Formation over Visual Images Without Catastrophic Forgetting

Cobweb: An Incremental and Hierarchical Model of Human-Like Category Learning

Point Cloud Domain Adaptation Via Masked Local 3D Structure Prediction

On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies

A Word-Granular Adversarial Attacks Framework for Causal Event Extraction

Learning Concept Embeddings for Efficient Bag-of-Concepts Densification

Learning from the Web: Webly Supervised Meta-Learning for Masked Face Recognition

Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition.

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

Contextual Representation Learning beyond Masked Language Modeling

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

CCMC: Code Completion with a Memory Mechanism and a Copy Mechanism

Bit Cipher -- A Simple yet Powerful Word Representation System that Integrates Efficiently with Language Models

Co-learning of Word Representations and Morpheme Representations.

NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents

Exemplar Masking for Multimodal Incremental Learning

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

FlashMask: Efficient and Rich Mask Extension of FlashAttention

Mask & Focus: Conversation Modelling by Learning Concepts