T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Suchita Pati,Shaizeen Aga,Mahzabeen Islam,Nuwan Jayasena,Matthew D. Sinclair
2024-01-30
Abstract:Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy.
Hardware Architecture,Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?