Lita: Accelerating distributed training of sparsely activated models
Scaling model parameters usually improves model quality, but at the price of high
computation overhead. Sparsely activated models, usually in the form of Mixture of Experts
(MoE) architecture, have constant computation cost over their dense counterparts, thus
providing opportunities to train and serve a large model at a reasonable cost. However, the
distributed training of an MoE model is prone to low efficiency, mainly due to the interleaved
all-to-all communication during model computation. This paper makes three main …
computation overhead. Sparsely activated models, usually in the form of Mixture of Experts
(MoE) architecture, have constant computation cost over their dense counterparts, thus
providing opportunities to train and serve a large model at a reasonable cost. However, the
distributed training of an MoE model is prone to low efficiency, mainly due to the interleaved
all-to-all communication during model computation. This paper makes three main …
Scaling model parameters usually improves model quality, but at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have constant computation cost over their dense counterparts, thus providing opportunities to train and serve a large model at a reasonable cost. However, the distributed training of an MoE model is prone to low efficiency, mainly due to the interleaved all-to-all communication during model computation. This paper makes three main contributions. First, we systematically analyze the all-to-all overhead in distributed training of MoE. Second, we propose a new communication scheduling scheme based on tensor partitioning that prioritizes the all-to-all operations over other communication, due to its blocking nature. Third, we introduce expert packing that reduces the all-to-all transfer size and incorporates optimizations to mitigate its overheads. Both techniques effectively tackle the all-to-all bottleneck, and we integrate them into a new system called Lina. Experiments on an A100 GPU testbed show that Lina improves the training step time of popular NLP models by up to 1.73x over the state-of-the-art.
arxiv.org
Showing the best result for this search. See all results