Google Scholar

Lita: Accelerating distributed training of sparsely activated models

J Li, Y Jiang, Y Zhu, C Wang, H Xu - arXiv preprint arXiv:2210.17223, 2022 - arxiv.org

arXiv preprint arXiv:2210.17223, 2022•arxiv.org

Scaling model parameters usually improves model quality, but at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have constant computation cost over their dense counterparts, thus providing opportunities to train and serve a large model at a reasonable cost. However, the distributed training of an MoE model is prone to low efficiency, mainly due to the interleaved all-to-all communication during model computation. This paper makes three main contributions. First, we systematically analyze the all-to-all overhead in distributed training of MoE. Second, we propose a new communication scheduling scheme based on tensor partitioning that prioritizes the all-to-all operations over other communication, due to its blocking nature. Third, we introduce expert packing that reduces the all-to-all transfer size and incorporates optimizations to mitigate its overheads. Both techniques effectively tackle the all-to-all bottleneck, and we integrate them into a new system called Lina. Experiments on an A100 GPU testbed show that Lina improves the training step time of popular NLP models by up to 1.73x over the state-of-the-art.

arxiv.org

Show moreShow less

Save Cite Cited by 2 Related articles All 2 versions View as HTML

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

Lita: Accelerating distributed training of sparsely activated models