Abstract:Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of high computational cost in neural network data pre - fetching. Specifically, although Attention - based Neural Networks (ANNs) perform well in memory access prediction, their high inference latency limits their practicality. To bridge this gap, the paper proposes a new method - Tabularization. By converting complex ANN models into simple tabular structures, it significantly reduces model complexity and inference latency while maintaining prediction accuracy. ### Main contributions 1. **Proposed a new knowledge transfer method**: from large - scale attention - based neural networks to hierarchical tabular structures, achieving more practical neural - network - based data pre - fetchers. 2. **Designed the tabularization kernel**: Converting attention mechanisms and linear operations into table lookups, thereby eliminating matrix multiplication in model inference. 3. **Developed an example pre - fetcher DART**: Constructed by the tabularization method, DART reduces arithmetic operations by 99.99% while only decreasing the F1 score by 0.09, and accelerates large - scale model inference by 170 times and distilled model inference by 9.4 times. 4. **Proposed the layer fine - tuning algorithm**: To alleviate the error accumulation problem when mapping multiple layers to tables. 5. **Evaluated the performance of DART**: On multiple workloads, DART achieves a 37.6% IPC improvement, surpassing the state - of - the - art rule - based pre - fetcher BO (6.1% IPC improvement), and neural - network - based pre - fetchers TransFetch (33.1% IPC improvement) and Voyager (37.2% IPC improvement). ### Key technologies 1. **Attention mechanism**: - **Feed - forward network (FFN)**: \[ \text{Linear}(X) = WX + B \] \[ \text{FFN}(X) = \text{Linear}_O(\max(0, \text{Linear}_H(X))) \] - **Multi - head self - attention (MSA)**: \[ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^\top}{\sqrt{D_k}}\right)V \] \[ \text{MSA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O \] where \(\text{head}_i=\text{Attention}(QW_Q^i, KW_K^i, VW_V^i)\). 2. **Product quantization (PQ)**: - **Prototype learning**: \[ p_c(\tilde{A}) \triangleq \arg \min_{P} \sum_{i} \|\tilde{A}_c^i - P_c^k\|^2 \] - **Table construction**: \[ h_c(b)_k \triangleq b_c^\top \cdot P_c^k \] - **Vector encoding**: \[ g_c(a) \triangleq \arg \min_k \|a_c - P_c^k\|^2 \] - **Table lookup and aggregation**: \[ f(a, b)=\sum_c h_c(b)_k, \quad k = g_c(a) \] 3. **Tabularization kernel**: - **Linear inner

Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

Phases, Modalities, Temporal and Spatial Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics

Algorithm/Architecture of NN-Based Configuration Prefetching

ABMLP: Attention-Based Multi-Layer Perceptron Prefetcher

Latency-aware Neural Architecture Performance Predictor with Query-to-Tier Technique

Exploiting Near-Memory Processing Architectures for Bayesian Neural Networks Acceleration

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

An Efficient Data Prefetch Strategy for Deep Learning Based on Non-volatile Memory

Deep learning based data prefetching in CPU-GPU unified virtual memory.

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

Puppeteer: A Random Forest-based Manager for Hardware Prefetchers across the Memory Hierarchy

AMPP: an Adaptive Multilayer Perceptron Prefetcher for Irregular Data Prefetching

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

Optimizing BCPNN Learning Rule for Memory Access

SGDP: A Stream-Graph Neural Network Based Data Prefetcher

Data Cache Prefetching with Perceptron Learning

Triangel: A High-Performance, Accurate, Timely On-Chip Temporal Prefetcher

Accelerating Graph Analytics on a Reconfigurable Architecture with a Data-Indirect Prefetcher