Cambricon-D: Full-Network Differential Acceleration for Diffusion Models
Weihao Kong,Yifan Hao,Qi Guo,Yongwei Zhao,Xinkai Song,Xiaqing Li,Mo Zou,Zidong Du,Rui Zhang,Chang Liu,Yuanbo Wen,Pengwei Jin,Xing Hu,Wei Li,Zhiwei Xu,Tianshi Chen
DOI: https://doi.org/10.1109/isca59077.2024.00070
2024-01-01
Abstract:Diffusion models have made significant progress in current image generation tasks, thus becoming a prominent area of research. Diffusion models necessitate repetitive iterations on minimally altered input data across timesteps, each timestep requiring the recalculation of the entire model, resulting in a remarkable computational redundancy and substantial hardware expenditures. Performing differential computing on input data seems to be a feasible approach for addressing such computational redundancy and improving hardware efficacy. However, non-linear operations (particularly activation functions) necessitate the merging of deltas (i.e., differential values) with raw inputs repeatedly to ensure computational correctness, leading to significant memory access for loading raw inputs, which fragmentedly blocks the forwarding of deltas throughout the network and undermines performance. To solve this problem, we propose Cambricon-D, a full-network differential computing architecture with concise memory access. While maintaining the computational efficiency brought by differential computing, Cambricon-D employs a sign-mask dataflow, which requires only the loading of 1-bit signs (instead of large bitwidth raw inputs), thereby facilitating the seamless forwarding of deltas and effectively mitigating memory access overheads. Experimental results show that, compared to Diffy, Cambricon-D's dataflow reduces 66% similar to 82% off-chip memory access. In total, Cambricon-D achieves 1.46x similar to 2.38x speedup over A100 on various diffusion models with different resolutions.