Abstract:Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix multiplication kernel for fundamental analysis, and NVIDIA's CUTLASS library for comprehensive performance data collection across advanced configurations. Using the NVIDIA RTX 4070 as our experimental platform, we developed a Random Forest-based prediction model with multi-output regression capability. Through analysis of both naive tiled matrix multiplication with varying tile sizes (1 to 32) and 16,128 CUTLASS GEMM operations across diverse configurations, we identified critical performance patterns related to matrix dimensions, thread block configurations, and memory access patterns. Our framework achieved exceptional accuracy with an R^2 score of 0.98 for runtime prediction (mean error 15.57%) and 0.78 for power prediction (median error 5.42%). The system successfully predicts performance across matrix sizes, demonstrating robust scaling behavior. Our results show that optimal tile size selection can improve performance by up to 3.2x while reducing power consumption by 22% compared to baseline configurations. Analysis of shared memory utilization and SM occupancy reveals that tile sizes of 16x16 achieve the best balance between parallelism and resource usage. The implementation of our framework, including prediction models and analysis tools, is available as an open-source project at GPPerf [<a class="link-external link-https" href="https://github.com/pavlyhalim/GPPerf" rel="external noopener nofollow">this https URL</a>].

Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors

Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs

Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Generating Families of Practical Fast Matrix Multiplication Algorithms

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Compiler-Level Matrix Multiplication Optimization for Deep Learning

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension

Towards Efficient Tile Low-Rank GEMM Computation on Sunway Many-Core Processors

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

Optimizing sparse general matrix–matrix multiplication for DCUs

A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication