Abstract:The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based on open-sourced simulation frameworks show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario. Further experiments on a real-world edge neural processing unit demonstrate speedup of up to 1.76x for attention as compared to FLAT, without affecting model output accuracy.

Is Flash Attention Stable?

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM

Stable Adam Optimization for 16-bit Neural Networks Training

SWattention: Designing Fast and Memory-Efficient Attention for a New Sunway Supercomputer

DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training

Stable and low-precision training for large-scale vision-language models

Small-scale proxies for large-scale Transformer training instabilities

Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash Attention

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Why ADAM Beats SGD for Attention Models

A Study of Optimizations for Fine-tuning Large Language Models

Does the Adam Optimizer Exacerbate Catastrophic Forgetting?

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Stick-breaking Attention