PyTorch 1.0 shines for rapid prototyping with dynamic neural networks, auto-differentiation, deep Python integration, and strong support for GPUs Credit: Thinkstock Deep learning is an important part of the business of Google, Amazon, Microsoft, and Facebook, as well as countless smaller companies. It has been responsible for many of the recent advances in areas such as automatic language translation, image classification, and conversational interfaces. We haven’t gotten to the point where there is a single dominant deep learning framework. TensorFlow (Google) is very good, but has been hard to learn and use. Also TensorFlow’s dataflow graphs have been difficult to debug, which is why the TensorFlow project has been working on eager execution and the TensorFlow debugger. TensorFlow used to lack a decent high-level API for creating models; now it has three of them, including a bespoke version of Keras. CNTK (Microsoft) and Apache MXNet (Amazon) have been the principal competitors to TensorFlow, but there are other framework lineages to consider. Caffe (Berkeley Artificial Intelligence Research Lab), originally for image classification, was expanded and updated to Caffe2 (Facebook and others) and given strong production capabilities. Torch (Facebook, Twitter, Google, and others) uses Lua scripting and a CUDA (Compute Unified Device Architecture) C/C++ back end to efficiently solve problems in machine learning, computer vision, signal processing, and other fields. Despite its strengths as a scripting language, Lua became a liability to Torch when the bulk of the deep learning community adopted Python. CUDA is Nvidia’s API for its general purpose GPUs. GPUs are much faster than CPUs for training and making predictions from deep neural networks; so are Google’s TPUs (tensor processing units) and FPGAs (field programmable gate arrays), which are available for use on AWS, Microsoft Azure, and elsewhere. In some cases, the use of advanced chips (GPUs, TPUs, or FPGAs) can speed up computations over CPUs by 50x per chip used, reducing training times from weeks to hours or from hours to minutes. PyTorch (Facebook, Twitter, Salesforce, and others) builds on Torch and Caffe2, using Python as its scripting language and an evolved Torch CUDA back end. The production features of Caffe2 – highly scalable execution engine, accelerated hardware support, support for mobile devices, etc. – are being incorporated into the PyTorch project. Tensors and neural networks in Python PyTorch is billed by its developers as “Tensors and dynamic neural networks in Python with strong GPU acceleration.” What does that mean? Tensors are a mathematical construct that is used heavily in physics and engineering. A tensor of rank two is a special kind of matrix; taking the inner product of a vector with the tensor yields another vector with a new magnitude and a new direction. TensorFlow takes its name from the way tensors (of synaptic weight, or the strength of connection between nodes) flow around its network model. NumPy also uses tensors, but calls them n-dimensional arrays (ndarray). We’ve already discussed GPU acceleration. A dynamic neural network is one that can change from iteration to iteration. For example, a dynamic neural network model in PyTorch may add and remove hidden layers during training to improve its accuracy and generality. PyTorch recreates the graph on the fly at each iteration step. In contrast, TensorFlow by default creates a single dataflow graph, optimizes the graph code for performance, and then trains the model. While eager execution mode is a fairly new option in TensorFlow, it’s the only way PyTorch runs: API calls execute when invoked, rather than being added to a graph to be run later. That might seem like it would be less computationally efficient, but PyTorch was designed to work that way, and it is no slouch when it comes to training or prediction speed. PyTorch architecture At a high level, the PyTorch library contains the following components: PyTorch integrates acceleration libraries such as Intel MKL (Math Kernel Library) and the Nvidia cuDNN (CUDA Deep Neural Network) and NCCL (Nvidia Collective Communications) libraries to maximize speed. Its core CPU and GPU tensor and neural network back ends—TH (Torch), THC (Torch CUDA), THNN (Torch Neural Network), and THCUNN (Torch CUDA Neural Network)—are written as independent libraries with a C99 API. At the same time, PyTorch is not a Python binding into a monolithic C++ framework, but designed to be deeply integrated with Python and to allow the use of other Python libraries. The memory usage in PyTorch is efficient compared to Torch and some of the alternatives. One of the optimizations is a set of custom memory allocators for the GPU, since available GPU memory can often limit the size of deep learning models that can be solved at GPU speeds. PyTorch GPU support CUDA GPU support in PyTorch goes down to the most fundamental level. In the example below, you see the code detecting a CUDA device, creating a tensor on the GPU, copying a tensor from CPU to GPU, adding the two tensors on the GPU, printing the result, and finally copying the result from GPU to CPU with a different data type and printing that result. # Let us run this cell only if CUDA is available # We will use ``torch.device`` objects to move tensors in and out of GPU if torch.cuda.is_available(): device = torch.device(“cuda”) # a CUDA device object y = torch.ones_like(x, device=device) # directly create a tensor on GPU x = x.to(device) # or just use strings ``.to(“cuda”)`` z = x + y print(z) print(z.to(“cpu”, torch.double)) # ``.to`` can also change dtype together! Out: tensor([ 1.9422], device=’cuda:0’) tensor([ 1.9422], dtype=torch.float64) In a higher-level scenario, you’d run an entire neural network training on the GPU. To begin with, you’d detect the first CUDA device, as above, and convert all your modules from CPU tensors to CUDA tensors: net.to(device) #this is a deep conversion of the whole neural network You’ll have to send the inputs and targets at every step to the GPU, as well: inputs, labels = inputs.to(device), labels.to(device) Basically, you can move the entire computation to the GPU with just a few lines of code. What about using multiple GPUs? DataParallel, a method of the nn neural network class, splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes its job, DataParallel collects and merges the results before returning it to you. model = Model(input_size, output_size) # this is a neural network class if torch.cuda.device_count() > 1: # can we go parallel? print(“Let’s use”, torch.cuda.device_count(), “GPUs!”) # yes, we can # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs model = nn.DataParallel(model) # use all available GPUs model.to(device) # copy the model to the GPUs PyTorch automatic gradient computation (autograd) PyTorch has the ability to snapshot a tensor whenever it changes, allowing you to record the history of operations on a tensor and automatically compute the gradients later. You enable this by setting the requires_grad property when you create the tensor: x = torch.ones(2, 2, requires_grad=True) Subsequent operations with a tensor that requires gradients may create new tensors, and those will inherit the requires_grad flag. You can change the requires_grad flag in place on a tensor at any time with the requires_grad_(…) method. In PyTorch, a trailing underscore on a method name such as requires_grad_ means that it updates the tensor in place; methods without the trailing underscore generate a new tensor. How do those snapshots help compute gradients? Basically, the framework approximates the gradient at every saved tensor by looking at the differences between that point and the previous tensor. This is less accurate, but roughly three times more efficient per variable parameter, than evaluating deltas around each state to get the derivatives. If the step size is small, the approximation won’t be too bad. In PyTorch, you compute the gradient using backpropagation (backprop) by calling the tensor’s backward() method, as shown in this animation, after clearing out any existing gradients from the neural network’s buffers. Then you can use that to update the weight tensor. In short, PyTorch programs create a graph on the fly. Then back-propagation uses the dynamically created graph, automatically calculating the gradients from the saved tensor states. PyTorch optimizers Most of the weight update rules (optimizers) used to find the minimum error take the gradient of the loss function as the initial direction to change the values for the next step, multiplied by a small learning rate to reduce the magnitude of the step. The basic algorithm is called steepest descent. For machine learning, the usual variant is stochastic gradient descent, or SGD, which uses multiple batches of data points and often goes through the data multiple times (epochs). More sophisticated versions of stochastic gradient descent, for example Adam and RMSprop, may compensate for biases, fold in momentum and velocity with the gradient, average gradients, or use adaptive learning rates. PyTorch currently supports 10 optimization methods. PyTorch neural networks The torch.nn class defines modules and other containers, module parameters, 11 kinds of layers, 17 loss functions, 20 activation functions, and two kinds of distance functions. Each kind of layer has many variants, for example six convolution layers and 18 pooling layers. The torch.nn.functional class defines 11 categories of functions. Somewhat confusingly, both torch.nn and torch.nn.functional contain loss and activation member functions. In many cases, however, the torch.nn member is little more than a wrapper for the corresponding torch.nn.functional member. You can define your own custom models as subclasses of nn.Module. For example: import torch.nn as nn import torch.nn.functional as F class Model(nn.Module): def __init__(self): super(Model, self).__init__() self.conv1 = nn.Conv2d(1, 20, 5) self.conv2 = nn.Conv2d(20, 20, 5) def forward(self, x): x = F.relu(self.conv1(x)) return F.relu(self.conv2(x)) This very simple model has two 2D convolution layers, and uses a rectified linear unit (ReLU) activation function for both layers. The three parameters to nn.Conv2d are the number of input channels, the number of output channels, and the size of the convolving kernel. You can also use one of the container modules to group your layers into a model. The nn.Sequential container adds modules in the order they are listed in the constructor: # Example of using Sequential model = nn.Sequential( nn.Conv2d(1,20,5), nn.ReLU(), nn.Conv2d(20,64,5), nn.ReLU() ) # Example of using Sequential with OrderedDict model = nn.Sequential(OrderedDict([ (‘conv1’, nn.Conv2d(1,20,5)), (‘relu1’, nn.ReLU()), (‘conv2’, nn.Conv2d(20,64,5)), (‘relu2’, nn.ReLU()) ])) The nn.ModuleList container is good for the case where you want to generate enumerable lists of layers from code. For example, look at this use of 10 linear layers: class MyModule(nn.Module): def __init__(self): super(MyModule, self).__init__() self.linears = nn.ModuleList([nn.Linear(10, 10) for i in range(10)]) def forward(self, x): # ModuleList can act as an iterable, or be indexed using ints for i, l in enumerate(self.linears): x = self.linears[i // 2](x) + l(x) return x PyTorch examples The pytorch/examples repo contains worked-out models for MNIST digit classification using convolutional neural networks; word-level language modeling using LSTM RNNs; ImageNet image classification using residual networks; LSUN scene understanding using deep convolutional generative adversarial networks (DCGAN); variational auto-encoder networks; image super-resolution using an efficient sub-pixel convolutional neural network; artistic style transfer using perceptual loss functions; training a CartPole to balance in OpenAI Gym with actor-critic models; SNLI natural language inference with global vectors for word representation (GloVe), LSTMs, and torchtext; and time sequence prediction (sine wave signal values) using LSTMs. The “Learning PyTorch with Examples” tutorial walks you through different ways of implementing machine learning with Python frameworks, before coming to the example below, which uses torch.nn and torch.optim to implement learning in a three-layer neural network model. In this case the loss function uses Professor Hinton’s MSELoss, and the optimizer chosen is Adam. # -*- coding: utf-8 -*- import torch # N is batch size; D_in is input dimension; # H is hidden dimension; D_out is output dimension. N, D_in, H, D_out = 64, 1000, 100, 10 # Create random Tensors to hold inputs and outputs x = torch.randn(N, D_in) y = torch.randn(N, D_out) # Use the nn package to define our model and loss function. model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.ReLU(), torch.nn.Linear(H, D_out), ) loss_fn = torch.nn.MSELoss(size_average=False) # Use the optim package to define an Optimizer that will update the weights of # the model for us. Here we will use Adam; the optim package contains many other # optimization algorithms. The first argument to the Adam constructor tells the # optimizer which Tensors it should update. learning_rate = 1e-4 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) for t in range(500): # Forward pass: compute predicted y by passing x to the model. y_pred = model(x) # Compute and print loss. loss = loss_fn(y_pred, y) print(t, loss.item()) # Before the backward pass, use the optimizer object to zero all of the # gradients for the variables it will update (which are the learnable # weights of the model). This is because by default, gradients are # accumulated in buffers( i.e, not overwritten) whenever .backward() # is called. Checkout docs of torch.autograd.backward for more details. optimizer.zero_grad() # Backward pass: compute gradient of the loss with respect to model # parameters loss.backward() # Calling the step function on an Optimizer makes an update to its # parameters optimizer.step() PyTorch installation You can install PyTorch on Linux, MacOS or Windows using conda, pip, or building from source code, on Python 2.7, 3.5, or 3.6. PyTorch supports CUDA 8, 9, 9.1, or CPU-only. On a Mac, as shown in the figure below, CUDA support requires building from source. That’s worse than it sounds, because the latest versions of PyTorch, CUDA, and Xcode are incompatible. IDG The PyTorch home page provides a GUI for generating the correct command lines for installing PyTorch with different operating systems, package managers, Python versions, and CUDA versions. I successfully installed the CPU-only, Python 2.7 version of PyTorch 0.4.0 on a MacBook Pro in about eight seconds using Pip: Martins-Retina-MacBook:~ martinheller$ time sudo pip install torch torchvision … Installing collected packages: torch, pillow, torchvision Found existing installation: Pillow 3.3.0 Uninstalling Pillow-3.3.0: Successfully uninstalled Pillow-3.3.0 Successfully installed pillow-5.1.0 torch-0.4.0 torchvision-0.2.1 real 0m8.133s user 0m3.452s sys 0m1.490s Unfortunately, the protobuf package version installed for PyTorch is incompatible with the version of TensorFlow I had installed; I’ll need to update TensorFlow before I can run it again. By the way, this kind of conflict is the motivation for Anaconda’s virtual Python environments. In any case, the PyTorch installation checked out: Martins-Retina-MacBook:~ martinheller$ python Python 2.7.10 (default, Oct 6 2017, 22:29:07) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin Type “help”, “copyright”, “credits” or “license” for more information. >>> import torch >>> x = torch.rand(5,3) >>> print(x) tensor([[ 0.3790, 0.9947, 0.9051], [ 0.1091, 0.0228, 0.8803], [ 0.8954, 0.6769, 0.9435], [ 0.5566, 0.8588, 0.1373], [ 0.9341, 0.7344, 0.3667]]) >>> print(x.size()) (5, 3) >>> print(torch.cuda.is_available()) False >>> To do serious model training, you’ll want to run PyTorch on a server or workstation-class machine or VM with one or more recent Nvidia GPUs. AWS, Google Cloud Platform, and Azure all support PyTorch 1.0 in their machine learning services and deep learning virtual machine images; IBM Cloud also supports PyTorch in Kubernetes clusters. Even if the image you want to use doesn’t already have the latest PyTorch, installation with pip or conda is easy and quick, as I saw on my laptop. Adapting PyTorch models for production For PyTorch 1.0, the project contributors will complete the work of marrying PyTorch and Caffe2, and will add a few additional features. The production goals include: Exporting to C++ runtimes for use in larger projects Optimizing mobile systems on iPhone, Android, Qualcomm, and other systems Using more efficient data layouts and performing kernel fusion to do faster inference (saving 10 percent of speed or memory at scale is a big win) Quantized inference (such as 8-bit inference) to allow models to run faster and use less power on constrained hardware Facebook has already supported all of these with Caffe2. One of the ways PyTorch is getting this level of production support without any sacrifice in hackability is through torch.jit, a just-in-time (JIT) compiler that at runtime takes your PyTorch models and rewrites them to run at production efficiency. The JIT compiler can also export your model to run in a C++ runtime based on Caffe2 bits. Although PyTorch is still in beta, the API seems to be stable, and the package is roughly as capable and performs as well as TensorFlow, CNTK, and MXNet. Because PyTorch APIs all execute immediately, PyTorch models are a bit easier to debug than models that create an acyclic graph to be solved in a session, the way TensorFlow works by default. While I wouldn’t rush to convert existing deep learning projects to PyTorch just yet, I’d certainly use it for training new models. If the current progress is anything to go by, PyTorch should be as good as any deep learning framework by the time of the PyTorch 1.0 release later this summer.