Abstract:As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational GNNs. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Together, these contributions show that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.

MLIR-based code generation for GPU tensor cores

Towards a high-performance AI compiler with upstream MLIR

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

AI Powered Compiler Techniques for DL Code Optimization

TPU-MLIR: A Compiler For TPU Using MLIR

AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

ReACT: Redundancy-Aware Code Generation for Tensor Expressions.

ML-driven Hardware Cost Model for MLIR

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

UNIT: Unifying Tensorized Instruction Compilation

Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries

HIR: An MLIR-based Intermediate Representation for Hardware Accelerator Description

High Performance Code Generation in MLIR: An Early Case Study with GEMM

MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations

Automatic Generation of Spatial Accelerator for Tensor Algebra