Abstract:Virtual memory (VM) is critical to the usability and programmability of hardware accelerators. Unfortunately, implementing accelerator VM efficiently is challenging because the area and power constraints make it difficult to employ the large multi-level TLBs used in general-purpose CPUs. Recent research proposals advocate a number of restrictions on virtual-to-physical address mappings in order to reduce the TLB size or increase its reach. However, such restrictions are unattractive because they forgo many of the original benefits of traditional VM, such as demand paging and copy-on-write. We propose SPARTA, a divide and conquer approach to address translation. SPARTA splits the address translation into accelerator-side and memory-side parts. The accelerator-side translation hardware consists of a tiny TLB covering only the accelerator's cache hierarchy (if any), while the translation for main memory accesses is performed by shared memory-side TLBs. Performing the translation for memory accesses on the memory side allows SPARTA to overlap data fetch with translation, and avoids the replication of TLB entries for data shared among accelerators. To further improve the performance and efficiency of the memory-side translation, SPARTA logically partitions the memory space, delegating translation to small and efficient per-partition translation hardware. Our evaluation on index-traversal accelerators shows that SPARTA virtually eliminates translation overhead, reducing it by over 30x on average (up to 47x) and improving performance by 57%. At the same time, SPARTA requires minimal accelerator-side translation hardware, reduces the total number of TLB entries in the system, gracefully scales with memory size, and preserves all key VM functionalities.

What problem does this paper attempt to address?

This paper attempts to address the efficiency and flexibility issues in the implementation of virtual memory (VM) in hardware accelerators. Specifically, the existing virtual memory implementations in accelerators face the following challenges: 1. **Area and Power Consumption Constraints**: Due to the strict area and power consumption limitations of hardware accelerators, large multi - level TLBs (Translation Lookaside Buffers) like those in general - purpose CPUs cannot be used, which leads to a performance bottleneck in address translation. 2. **Limitations of Address Mapping**: In order to reduce the size of the TLB or increase its coverage, some studies suggest imposing specific limitations on the mapping from virtual addresses to physical addresses. However, these limitations sacrifice many of the benefits brought by traditional virtual memory, such as demand paging and copy - on - write. 3. **Address Translation Overhead**: The traditional address translation mechanism has a relatively high latency on accelerators, especially when traversing page tables, which requires crossing multiple network layers and memory controllers, resulting in performance degradation. To solve these problems, the paper proposes SPARTA (Split and PARtitioned Translation for Accelerators), a divide - and - conquer address translation method. The main contributions of SPARTA include: - **Hierarchical Translation**: SPARTA divides the address translation task into two parts, the accelerator - side and the memory - side. The translation hardware on the accelerator - side only contains a small TLB, covering the cache hierarchy of the accelerator (if any). The translation on the memory - side is completed by the shared memory - side TLB/MMU. - **Logical Partitioning**: SPARTA divides the physical memory space into multiple logical partitions and ensures that each virtual address uniquely identifies the partition where its data is located. This enables parallel execution of data retrieval and address translation, improving performance. - **Low Overhead and High Flexibility**: SPARTA almost eliminates the address translation overhead, reducing it by an average of 31.5 times (up to 47 times at most), while improving performance by 57%. It also retains all the key functions of virtual memory, such as demand paging and copy - on - write, and imposes minimal limitations on the mapping from virtual addresses to physical addresses. Through these designs, SPARTA significantly improves the address translation performance and efficiency of accelerators while maintaining the flexibility of virtual memory.

SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators

SPC-Indexed Indirect Branch Hardware Cache Redirecting Technique in Binary Translation

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators

Accelerating Address Translation for Virtualization by Leveraging Hardware Mode

SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

Spire: Improving Dynamic Binary Translation Through Spc-Indexed Indirect Branch Redirecting

An Efficient Hardware Prefetcher Exploiting the Prefetch Potential of Long-Stride Access Pattern on Virtual Address

Compiling Halide Programs to Push-Memory Accelerators

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads

Efficient Processing of Sparse Tensor Decomposition via Unified Abstraction and PE-Interactive Architecture

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator

Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings

Automatic multidimensional memory partitioning for FPGA-based accelerators (abstract only).

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

AMOS: enabling <u>a</u>utomatic <u>m</u>apping for tensor computations <u>o</u>n <u>s</u>patial accelerators with hardware abstraction