Distributed training with PyTorch and Azure ML
Suppose you have a very large PyTorch model, and you’ve already tried many common tricks to speed up training: you optimized your code, you moved training to the cloud and selected a fast GPU VM, you installed software packages that improve training performance (for example, by using the ACPT curated environment on Azure ML). And yet, you still wish your model could train faster. Maybe it’s time to give distributed training a try! Continue reading to learn the simplest way to do distributed training with PyTorch and Azure ML.
Overview of distributed training
Distributed training divides the work of training a model across multiple processes, working in parallel on multiple GPUs and possibly on multiple machines to accelerate progress. This technique can be used for any model, but it’s particularly useful when training very large deep learning models.
There are two main types of distributed training: data parallelism and model parallelism. In data parallelism, the full model is copied to each process and is trained on a portion of the input data. In model parallelism, the model is segmented into separate parts that run concurrently in each process, and each model piece receives the full data as input. In this article I’ll cover data parallelism, since it’s easier to implement, more commonly…