Automatic Mapping of Heterogeneous DNN Models on Adaptive Multi-Accelerator Systems

Jieru Zhao,Guan Shen,Wenchao Ding,Quan Chen,Minyi Guo
DOI: https://doi.org/10.1109/tcad.2024.3410841
IF: 2.9
2024-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:As DNNs are developing rapidly, the computational and memory burden imposed on hardware systems grows exponentially. This becomes even more severe for large language models (LLMs) and multi-modal models. As a promising solution that achieves high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and mobile SoCs. Thus, a challenging problem arises: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies, to fully exploit computation resources and communication bandwidth in the system. To this end, we propose MARS, a novel mapping framework that performs computation-aware accelerator selection and applies communication-aware sharding strategies to maximize parallelism. We also provide optimizations to overlap the computation and communication latency. Considering the high complexity of the design space, we propose two effective mapping algorithms to explore it. Experiments show that MARS achieves 34.3% latency reduction for DNN workloads compared to the baseline and 63.0% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.
What problem does this paper attempt to address?