Deep Learning With PyTorch Guide For Beginners and Intermediate
Deep Learning With PyTorch Guide For Beginners and Intermediate
Deep Learning With PyTorch Guide For Beginners and Intermediate
By Jerry N. P.
How to contact us
If you find in this book any editing issues, damage or other
issues, please immediately let me know by email at:
[email protected]
The library was released on January of 2016 and many people have
adopted it for building neural networks because of its ease of use.
PyTorch relies on an Eager/Imperative paradigm. Each line of code
that is needed for building the graph defines a component of the
graph. Computations can be performed independently on these
components itself, even before we are done with building the graph.
This methodology is referred to as define-by-run .
Chapter 2
Getting Started with PyTorch
PyTorch can be installed on a number of various operating systems
including Windows, Mac and the various Linux distributions.
On Windows, the installation of PyTorch is easy. To enjoy the
PyTorch’s ability to support CUDA, your Windows system must have
NVIDIA GPU. PyTorch can be installed on Windows 7 and greater,
Windows 10 or greater. You can also install it on Windows Server
2008 r2 or greater.
Also, note that on Windows, PyTorch only supports Python 3.x, not
Python 2.x.
In my case, I am using Python 3.5 and I need to install PyTorch via
pip. I then run the following commands from the terminal of the
operating system:
pip3 install http://download.pytorch.org/whl/cpu/torch-
0.4.1-cp35-cp35m-win_amd64.whl
x = torch.rand(5, 3)
print(x)
It should return the following:
Now that the code has run successfully, it is very clear that PyTorch is
working correctly.
Computational Graphs
Deep learning is most implemented programmatically via
computational graphs. It is simply a set of calculations known as
nodes, with the nodes being connected in a directional ordering of
computation. What this means is that some of the nodes on the graph
rely on other nodes for their input, and these nodes in turn pass their
outputs to serve as inputs to other nodes.
In such graphs, each node can be treated as an independently working
piece of code. This way, performance optimizations can be done to
implement calculations like threading and multiple
processing/parallelism. All frameworks for deep learning like
TensorFlow and Theano work by construction of such graphs through
which can be able to perform neural network operations.
Tensors
Tensors are data structures that look like matrices and they are very
critical components for efficient computation in deep learning. GPUs
(Graphical Processing Units) are very effective when it comes to
performing operations between tensors, and this has become very
popular in deep learning.
There are various ways through which we can declare tensors in
PyTorch. Let us discuss them:
import torch
x = torch.Tensor(3, 5)
The above code will generate a tensor of size (3, 5), that is, 3 rows and
5 columns. The tensor will be filled with zeroes. We can display it by
running the print statement:
print(x)
The input layer will have 28 x 28 (=784) greyscale pixels which make
up the MNIST dataset. Once the data is received at the input layer, it
will be propagate through the two hidden layers, each having 200
nodes. The nodes will use the ReLU activation function. The output
layer will have 10 nodes which represent the 10 classes to which each
digit can belong to. A softmax output layer will be used for the purpose
of performing the classification.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(28 * 28, 200)
print(net)
The instance has been given the name net as shown above. The code
will give you the structure of your network.
Training
It is now time for us to train the network. We should begin by setting
up an optimizer and a loss criterion:
# Let’s first create a stochastic gradient descent optimizer
optimizer = optim.SGD(net.parameters(), lr=learning_rate,
momentum=0.9)
# Then we create a loss function
criterion = nn.NLLLoss()
We first created a stochastic gradient descent optimizer and specified
the learning rate of 0.01 and a momentum of 0.9. We also need to
supply all the network parameters to the optimizer. The parameters()
method provides us with an easy way of passing on these parameters.
This method can be found from the nn.Module class that can be
inherited from in Net class.
We then set the loss criterion to be a negative log likelihood loss.
When this is combined with the log softmax output from neural
network, we get an equivalent cross entropy loss for the 10
classification classes.
During the training of the network, we will extract data from data
loader object that comes included in the utilities module of PyTorch.
The data loader will supply the input in batches then target data that
will be supplied to the network and the loss function respectively. The
training code is given below:
# execute the main training loop
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{}
({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
The above loop is similar to our previous training loop up too where
we have the test_loss line. In this line, we are extracting the loss of the
network using . data[0] property, and this has been done in one line. In
the pred line, we have used data.max(1) , the .max() function is able to
return the index of the maximum value in a particular dimension of a
tensor. The neural network will then give us an output of size
(batch_size, 10), where every value of the 10-length second dimension
will be a log probability assigned by the network to each output class.
This simply means that it is the log probability showing whether the
provided image is an image that is between 0 and 9. This means that
for every input row/sample in the batch, the net_out.data will be as
follows:
At this point, we have the prediction of our neural network for every
sample in the batch already determined; hence this can be compared
with the actual target class from the training data. This will involve
counting the number of times that our neural network managed to get
it right. This can be done by calling the PyTorch .eq() function, which
works by comparing the values in two sensors. If these values match, it
returns a 1. If the values don’t match, it returns a 0:
correct += pred.eq(target.data).sum()
After summing the output of .eq() function, we will get a count of
number of times that the neural network produced the correct output,
then we take an accumulating sum of the correct predictions to be able
to determine the overall accuracy of our network on the test data. After
we run through the test data in batches, we will print out the averaged
accuracy and loss. This is shown below:
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{}
({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
After training the network for a total of 10 epochs, I got an accuracy of
98%, which is not bad.
Chapter 4
Loading and Processing Data
In machine learning, a lot of effort is needed in data loading and
processing. PyTorch provides us with a number of utilities that are
good for data loading, making it easy for us. They make our code more
readable.
The following packages are needed for this tutorial:
import numpy as np
from skimage import io, transform
import matplotlib.pyplot as plt
from torchvision import transforms, utils
from torch.utils.data import Dataset, DataLoader
# To suppress/ignore warnings
import warnings
warnings.filterwarnings("ignore")
n = 65
img_name = landmarks_frame.ix[n, 0]
landmarks = landmarks_frame.ix[n,
1:].as_matrix().astype('float')
landmarks = landmarks.reshape(-1, 2)
Let us create a helper function that will show an image together with
its landmarks then we use it to show a sample:
def show_landmarks(image, landmarks):
plt.imshow(image)
plt.scatter(landmarks[:, 0], landmarks[:, 1], s=10,
marker='.', c='r')
plt.pause(0.001) # pause a bit for the plots to be updated
plt.figure()
show_landmarks(io.imread(os.path.join('faces/',
img_name)),
landmarks)
plt.show()
Dataset Class
The torch.utils.data.Dataset is an abstract class that represents a
dataset. Your custom dataset has to inherit the Dataset then override
the methods given below:
__len__: for len(dataset) to return the size of the dataset.
_getitem_: for supporting indexing for the dataset[i] may be
used for getting the ith item.
We now need to create a dataset class for the face landmarks dataset.
The CSV will be read in __init__ but the reading of the images will be
left to _getitem_. This is efficient in terms of memory usage since all
images will not be stored in the memory at a go but read only when it
is required.
A dict {'image': image, 'landmarks': landmarks} will be a sample of
our dataset. The dataset will take an optional argument named
transform so that any processing that is required may be applied on
the sample. You will see how useful the transform argument is later.
class FaceLandmarksDataset(Dataset):
"""Using the Face Landmarks dataset."""
csv_file (string):
# Path to our csv file with annotations.
root_dir (string): # Directory having all the images.
transform (callable, optional):
# Optional transform that is to be applied on the sample.
self .landmarks_frame = pd.read_csv(csv_file)
self .root_dir = root_dir
self .transform = transform
def __len__(self):
return len(self.landmarks_frame)
landmarks = landmarks.reshape(-1, 2)
sample = {'image': image, 'landmarks': landmarks}
if self.transform:
sample = self.transform(sample)
return sample
We now need to create an instance of the class then iterate through
our data samples. The sizes of the first 4 data samples will be printed
and show their landmarks. This is shown below:
face_dataset =
FaceLandmarksDataset(csv_file='faces/face_landmarks.csv',
root_dir ='faces/')
fig = plt.figure()
for i in range(len(face_dataset)):
sample = face_dataset[i]
ax = plt.subplot(1, 4, i + 1)
plt .tight_layout()
ax .set_title('Sample #{}'.format(i))
ax .axis('off')
show_landmarks( **sample)
if i == 3:
plt .show()
break
Transforms
From what we have above, it is very clear that the samples that we
have are not of the same size. Most neural networks expect that all
images to be passed to them should have a fixed size. We need to write
some code that will transform the images into this.
Rescale- this will help in rescaling the image.
RandomCrop- to crop from the image randomly. This
process is called data augmentation .
ToTensor- to help us convert numpy images into PyTorch
images. There is a need for us to swap axes.
The above will be written as callable classes rather than simple
function so that the parameters of the transform don’t have to be
passed every time that they are called. This means that we should only
implement a _call_ method and if there is a need. We can use the
transform as shown below:
tsfm = Transform(params)
transformed_sample = tsfm(sample)
The transforms had to be applied to both the image and the
landmarks. This is shown below:
class Rescale(object):
Args:
output_size (tuple or tuple): The required output size. If it’s
a tuple, the output will be matched to the output_size. If it’s
an int, smaller of image edges will be matched to the
output_size while maintaining the aspect ratio to the same.
"""
h, w = image.shape[:2]
if isinstance(self.output_size, int):
if h > w:
new_h, new_w = self.output_size * h / w, self.output_size
else:
new_h, new_w = self.output_size, self.output_size * w / h
else:
class RandomCrop(object):
"""Crop the image in the sample randomly.
Args:
output_size (tuple or int): The Desired output size. If it’s an
int, a square crop will be made.
"""
def __init__(self, output_size):
assert isinstance(output_size, (int, tuple))
if isinstance(output_size, int):
self .output_size = (output_size, output_size)
else:
assert len(output_size) == 2
h, w = image.shape[:2]
new_h, new_w = self.output_size
transformed_sample = tsfrm(sample)
ax = plt.subplot(1, 3, i + 1)
plt.tight_layout()
ax.set_title(type(tsfrm).__name__)
show_landmarks(**transformed_sample)
plt.show()
root_dir ='faces/',
transform =transforms.Compose([
Rescale(256),
RandomCrop(224),
ToTensor()
]))
for i in range(len(transformed_dataset)):
sample = transformed_dataset[i]
if i == 3:
break
Note that we have a simple for loop to iterate through the dataset.
However, this way, we are losing a lot of features. In fact, this is what
we are missing:
Batching the data.
Shuffling the data.
Loading our data in parallel by use of the multiprocessing
workers.
The torch.utils.data.DataLoader iterator provides us with all the
above features. The parameters that we have used should be made
clear. We are interested in the collane_fn parameter. This parameter
can help you to specify how exactly you need the samples to be
batched. However, the default collate is expected to work fine in most
use cases.
dataloader = DataLoader(transformed_dataset,
batch_size=4,
shuffle =True, num_workers=4)
grid = utils.make_grid(images_batch)
plt .imshow(grid.numpy().transpose((1, 2, 0)))
for i in range(batch_size):
plt .scatter(landmarks_batch[i, :, 0].numpy() + i * im_size,
landmarks_batch[i, :, 1] .numpy(),
s =10, marker='.', c='r')
break
Using torchvision
You now know how to write and use datasets, dataloader and
transforms. The torchvision comes with a number of datasets and
transforms. You may not even have to write your custom classes. The
ImageFolder is one of the generic datasets that you can find in the
torchvision package.
Some of the class labels for the above mentioned dataset includes ants,
bees etc. It also has a number of transforms that you can use. These
can be used for writing a dataloader as shown below:
import torch
from torchvision import transforms, datasets
data_transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
hymenoptera_dataset =
datasets.ImageFolder(root='hymenoptera_data/train',
transform=data_transform)
dataset_loader =
torch.utils.data.DataLoader(hymenoptera_dataset,
batch_size=4, shuffle=True,
num_workers=4)
Chapter 5
Convolutional Neural
Networks
With a fully connected network with a few layers only, we cannot do
much. When it comes to image processing, a lot of is needed. This
means that more layers are needed in the network. However, we
encounter a number of problems when we attempt to add more layers
to a neural network. First, we risk facing the problem of vanishing
gradient. However, we can solve this problem to some extend by using
some sensible activation functions, like the ReLU family of activations.
Another problem associated with a deep fully connected network is
that the number of parameters that are trainable in the network, that
is, the weights, can grow rapidly. This is an indication that the training
may become practically impossible or slow down. The model will also
be exposed to overfitting.
Convolutional neural networks help us solve the second problem
above by exploiting the correlations between the adjacent inputs in
images or the time series. Consider a situation in which we have
images of cats and dogs. The pixels that are close to the eyes of the cat
are more likely to be the same to the ones that are close to the cat’s
nose rather than those close to the dog’s nose. What does this mean? It
means that not every node in a layer needs to be connected to all other
nodes in the next layer. This means that the number of weight
parameters that need to be trained in the model will be cut.
Convolutional neural networks also provide us with a number of tricks
that make it easy for us to train the network.
These types of networks are used for classifying images, clustering
them by similarity and for doing object recognition by scenes. These
types of networks are capable of identifying faces, street signs,
individuals, platypuses, eggplants, and other aspects regarding visual
data.
They are used together with text analysis through the Optical
Character Recognition (OCR) in which the images are seen as symbols
which are to be transcribed and sound can be applied once they have
been represented visually.
The use of neural networks in image recognition marks one of the
reasons as to why deep learning has become so popular in the world.
They are widely applied in fields such as machine visions which are
highly used in robotics, self-driving cars, and treatments for visually
impaired.
learning_rate = 0.001
DATA_PATH = 'C:\\Users\MNISTData'
MODEL_STORE_PATH = 'C:\\Users\pytorch_models\\'
Those are the hyperparameters that we will need, so now they are
setup. A specification of the drive in which we will be storing the
MNIST dataset has also been specified as well as a storage location for
the trained model hyperparameters after the completion of the
training process.
We can now setup a transform that is to be applied to the MNIST
dataset, as well as the dataset variables. This is shown below:
# transforms to apply to the data
trans = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
# MNIST dataset
train_dataset =
torchvision.datasets.MNIST(root=DATA_PATH, train=True,
transform=trans, download=True)
test_dataset =
torchvision.datasets.MNIST(root=DATA_PATH,
train=False, transform=trans)
Note the use of transforms.Compose() function. The function comes
from torchvision package. It allows developers to setup various
manipulations on a specified dataset. A number of transforms can be
chained together in a list via the Compose() function. We first specified
a transform that converts the input data set to a PyTorch tensor. The
PyTorch tensor is simply a specific data type used in PyTorch for all
different data and weight operations in the network.
In its simplest form, it is a multi-dimensional matrix. All the times,
PyTorch expects the data set to be transformed into a tensor so that
the data can be consumed by the network as the training and test set.
The next argument in our Compose() list is the normalization
transformation. Neural networks perform better after the data has
been normalized to range between -1 and 1 or 0 and 1. For us to do this
in PyTorch Normalize transform, we should supply the mean and
standard deviation of MNIST dataset. In our case, the values for these
are 0.1307 and 0.3081 respectively. For every input channel, one
should supply a mean and a standard deviation. Our data, that is,
MNIST, has only a single channel. If you have a dataset with more
than one channels, then you must provide a mean and a standard
deviation for each of the channels.
Next, we should create the objects for train_dataset and test_dataset .
These will later be passed to data loader. For us to be able to create
these two sets from the MNIST dataset, we have to pass in a number of
arguments. First, we should have the root argument that specifies the
folder in which train.pt and test.pt data files exist. The argument train
a Boolean that informs the data set to choose either the train.pt data
set or the test.pt data set. The next argument is transform, which is
where we will be supplying any transform object that has been created
to be applied to the data set; we will supply the trans object that was
created earlier. We finally have the download argument that tells
MNIST dataset function to download data from an online source if it is
required.
Now that we have created both the train and test data sets, it is time
for us to load them into our data loader. This can be done as follows:
train_loader = DataLoader(dataset=train_dataset,
batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset,
batch_size=batch_size, shuffle=False)
In PyTorch, the data loader object provides us with a number of
features that are useful in the consumption of training data, ability to
shuffle our data easily, ability to batch data easily and make
consumption of data much easily via the ability to employ
multiprocessing to load the data quickly and easily. As shown above,
there are three arguments that should be supplied, first being the data
set that is to be loaded, second the batch size that you need and finally
you need to shuffle the data randomly. We can use the data loader as
the iterator, so the standard python iterators like enumerate can be
used for extraction of the data.
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.layer2 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.drop_out = nn.Dropout()
self.fc1 = nn.Linear(7 * 7 * 64, 1000)
self.fc2 = nn.Linear(1000, 10)
We have defined our model. Anytime we need to create a structure in
PyTorch, the simplest or basic way of doing it is by creating a class that
inherits from the nn.Module super class. The nn.Module is a very
useful class provided by PyTorch as it allows you to build deep
learning networks. It also provides numerous methods like the ones
for moving variables and performing operations on a GPU or CPU. We
can also use it to apply recursive functions on all class properties and
create streamlined interfaces to be used for training etc.
We should begin by creating a sequence of layer objects within the
class _init_ function. We first create layer 1 via (self.layer1) by
creating nn.Sequential object. The method will allow us to create some
layers that are ordered sequentially in our network, and it is a great
way of building a convolution + ReLU + pooling sequence.
As shown in our sequential definition, the first element is a Conv2d
nn.Module method, which is a method for creating a set of
convolutional filters. The first argument denotes the number of input
channels, which in our case we have a single channel grayscale MNIST
images, meaning the value of this argument will be 1. The second
argument to the Conv2d should be the number of the output channels.
The first convolutional filter layer has 32 channels, meaning that the
value of our second argument will be 32.
The argument kernel_size denotes the size of the convolutional filter,
and in our case, we need 5 * 5 sized convolutional filters, meaning that
the value of this argument will be 5. If you need filters with different
sized shapes in x and y directions, you should supply (x-size, y-size).
Finally, you should specify the padding argument. This takes a bit
complex thought. The output size of any dimension from a pooling
operation or convolutional filtering can be computed using the
formula given below:
The Win denotes the width of the output, F denotes the filter size, P
denotes the padding while S denotes the stride. The same formula
should be applied in the calculation of the height, but since our image
and filtering are symmetrical, the same formula can be applied to
both. If there is a need to keep both the input and output dimensions
the same, with a stride of 1 and a filter of 5, then from the above
formula, we will need a padding of 2. This means that the value of
padding argument in Conv2d is 2.
The next element in our sequence is a ReLU activation. The last
element to be added to the sequential definition of self.layer1 is max
pooling operation. The first argument should be the pooling size, 2 * 2,
meaning that the argument will have a value of 2. Secondly, we should
down-sample the data by reducing the effective size of the image by a
factor of 2. For this to be done with the above formula, the stride
should be set to 2, and the padding to 0. This means that the stride
argument should be equal to 2. The padding argument has a default
value of 0 if it is not specified, and this is what has been done in the
above code. From such calculations, it is clear that the output of
self.layer1 will be 32 channels of the 14 * 14 images.
The second layer, that is, self.layer2, has been defined in the same way
as the first layer. The difference is that the input to the Conv2d
function has 32 channels, and an output of 64 channels. By use of the
same logic and knowing the pooling down-sampling, the self.layer2
should give an output of 64 channels of 7 * 7 images.
Next, we should specify a drop-out layer to avoid the problem of
overfitting in the model. Finally, we have create two fully connected
layers. The first layer will have a size of 7 x 7 x 64 nodes which will be
connected to the second layer of 1000 nodes. Anytime you need to
create a fully connected layer in PyTorch, you should use the nn.Linear
method. The first argument to the method should be the number of
nodes to the layer, while the second argument should be the number
of nodes in the following layer.
With the definition of _init_, the definitions of the layers have been
created. We should now define the way the data flows through the
network layers when performing the forward pass:
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.drop_out(out)
out = self.fc1(out)
out = self.fc2(out)
return out
loss_list = []
acc_list = []
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
# Run ning a forward pass
outputs = model(images)
loss = criterion(outputs, labels)
loss_list.append(loss.item())
if (i + 1) % 100 == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy:
{:.2f}%'
This will return a list of prediction integers from our model, with the
next line comparing the predictions to the true labels (predicted ==
labels) then gets their sum to know the number of correct predictions.
Note that the output from sum() will still be a tensor, so for you to be
able to access its value, you should call .item() . The number of correct
predictions should be divided by the batch_size, which is the same as
labels.size(0) , to get the accuracy.
Finally, during the process of training and after each 100 iterations of
inner loop, the progress will be printed.
Model Testing
We now need to test our model and see how accurate it is. The testing
will be done using the test dataset. Here is the code for this task:
import torch.nn as nn
import torch
import torch.optim as optim
from torch.autograd import Variable
from torch.optim import lr_scheduler
import numpy as np
from torchvision import datasets, models, transforms
import torchvision
import matplotlib.pyplot as plt
import os
import time
plt.ion()
'train': transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
]),
}
data_dir = 'hymenoptera_data'
image_datasets = {x:
datasets.ImageFolder(os.path.join(data_dir, x),
data_transforms[x])
for x in ['train', 'val']}
dataloders = {x:
torch.utils.data.DataLoader(image_datasets[x],
batch_size=4,
shuffle=True, num_workers=4)
for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train',
'val']}
class_names = image_datasets['train'].classes
use_gpu = torch.cuda.is_available()
best_model_wts = model.state_dict()
best_acc = 0.0
scheduler.step()
model.train(True) # Set the model to the training mode
else:
model.train(False) # Set the model to the evaluate mode
running_loss = 0.0
running_corrects = 0
optimizer.step()
# Get statistics
running_loss += loss.data[0]
running_corrects += torch.sum(preds == labels.data)
best_acc = epoch_acc
best_model_wts = model.state_dict()
print()
images_so_far = 0
fig = plt.figure()
outputs = model(inputs)
_, preds = torch.max(outputs.data, 1)
for j in range(inputs.size()[0]):
images_so_far += 1
ax = plt.subplot(num_images//2, 2, images_so_far)
ax.axis('off')
ax.set_title('predicted: {}'.format(class_names[preds[j]]))
imshow(inputs.cpu().data[j])
if images_so_far == num_images:
return
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)
if use_gpu:
model_ft = model_ft.cuda()
criterion = nn.CrossEntropyLoss()
Feature Extraction
We now want to freeze the whole network except its final layer. We
will set requires_grad == False so as to freeze all the parameters so
the parameters aren’t computed in backward() . Here is the code for
this:
model_conv =
torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False
if use_gpu:
model_conv = model_conv.cuda()
criterion = nn.CrossEntropyLoss()
plt.ioff()
plt.show()
Chapter 7
Developing Distributed
Applications
PyTorch comes with a distributed package, that is, torch.distributed ,
which enables practitioners and researchers to make their
computations parallel across clusters of machines and processes. This
is done by leveraging the message parsing semantics that allows each
process to communicate data to any of other processes. Processes are
allowed to use different communication backends and there is no
restriction that such processes must be run on the same machine. This
is not the case with the multiprocessing package, that is,
torch.multiprocessing .
Before we can start, we should first get the ability to run multiple
processes simultaneously. If you are able to access the compute
cluster, you should consider using your best coordination tool or check
with your local sysadmin. Examples of coordination tools include
clustershell, pdsh etc. We will be using a single machine and fork
multiple processes. We will use the template given below:
import torch
import os
from torch.multiprocessing import Process
import torch.distributed as dist
fn(rank, size)
if __name__ == "__main__":
size = 2
processes = []
for rank in range(size):
p = Process(target=init_processes, args=(rank, size, run))
p .start()
processes .append(p)
for p in processes:
p .join()
The script given above spawns two processes, with each processes
expected to setup a distributed environment. The process will also
initialize a process group, that is, dist.init_process_group , then run
the specified run function.
The init_processes function serves to ensure that each process is able
to coordinate via a master, and this will be done via the same port and
IP address. Note that a TCP backend was used, but it is also possible
for us to use a Gloo or MPI.
Point-Point Communication
Point-to-point communication is the transfer of data from one
function to another. This is achieved by use of the send and recv
functions as well as their immediate counterparts, isend and irecv .
else:
# Receive the tensor from process 0
dist .recv(tensor=tensor, src=0)
print('Rank ', rank, ' has data ', tensor[0])
All the processes will begin with a tensor of zero, then the process 0
will increment the tensor and send it to the process 1 and both will end
up with a 1.0. The process 1 is in need of memory so that it may store
the data that it receives.
You should also know that send/recv are blocking because they both
stop until the completion of the communication. The immediate are
non-blocking, the script progresses with execution and the methods in
return give us a DistributedRequest object from which we are able to
choose to wait() :
"""A non-blocking point-to-point communication."""
req = None
if rank == 0:
tensor += 1
# To send the tensor to the process 1
req = dist.isend(tensor=tensor, dst=1)
print('Rank 0 started sending')
else:
# To receive tensor from the process 0
Collective Communication
With collectives, communication patterns are allowed across all the
processes in a group, which is in contrast to what happens in a point-
to-point communication. A group denotes a subset of all the processes.
A group can be created by passing a list of ranks to
dist.new_group(group). The default setting is that collectives are
executed on all process, also referred to as the world . For example, if
you need to get the sum of all tensors at all the processes, you can use
the dist.all_reduce(tensor, op, group) collective.
""" An All-Reduce example """
def run(rank, size):
""" A point-to-point communication """
group = dist.new_group([0, 1])
tensor = torch.ones(1)
Distributed Training
We need to use the distributed module and do something useful with
it. We need to replicate the DistributedDataParallel functionality.
def __len__(self):
return len(self.index)
class DataPartitioner(object):
self .partitions = []
rng = Random()
rng .seed(seed)
data_len = len(data)
indexes = [x for x in range(0, data_len)]
rng .shuffle(indexes)
for frac in sizes:
part_len = int(frac * data_len)
self .partitions.append(indexes[0:part_len])
indexes = indexes[part_len:]
transforms .ToTensor(),
transforms .Normalize((0.1307,), (0.3081,))
]))
size = dist.get_world_size()
bsz = 128 / float(size)
partition_sizes = [1.0 / size for _ in range(size)]
partition = DataPartitioner(dataset, partition_sizes)
partition = partition.use(dist.get_rank())
train_set = torch.utils.data.DataLoader(partition,
batch_size =bsz,
shuffle =True)
return train_set, bsz
Suppose we have a total of 2 replicas, then every process will have a
train_set of 30000 samples, that is, 60000/2. The batch size should
also be divided by the number of replicas for the maintenance of
overall batch size of 128.
We can create the forward-backward-optimize training script, and
then add in a function call to the gradients of the models:
""" Distributed Synchronous SGD """
def run(rank, size):
torch .manual_seed(1234)
train_set, bsz = partition_dataset()
model = Net()
optimizer = optim.SGD(model.parameters(),
lr =0.01, momentum=0.5)
output = model(data)
loss = F.nll_loss(output, target)
epoch_loss += loss.data[0]
loss .backward()
average_gradients(model)
optimizer .step()
print('Rank ', dist .get_rank(), ', epoch ',
epoch, ': ', epoch_loss / num_batches)
We should now implement the average_gradients(model) function.
The purpose of the function is to take in a model and get the average of
its gradients across the whole world.
""" Averaging the Gradients """
def average_gradients(model):
size = float(dist.get_world_size())
for param in model.parameters():
dist .all_reduce(param.grad.data, op=dist.reduce_op.SUM)
param .grad.data /= size
We have now implemented a distributed synchronous SGD and we can
now train any model on a big compute cluster.
Chapter 8
Word Embeddings
Word embeddings are simply dense vectors of real numbers one per
word in a vocabulary. In Natural Language Processing (NLP), words
are mostly used as the features. But can a word be represented in a
computer? The ascii character representation of the word can be
stored, but that will only tell what the word is, without saying anything
about the meaning of the word. Or how can you combine such
representations?
We need our neural networks to give us dense outputs, with the inputs
are |V| dimensional, in which V is the vocabulary, but in most cases,
the outputs are only a few dimensional. So, how can we get from a
massive dimensional space to some smaller dimensional space?
Instead of the ascii representation, we can decide to use a one-hot
encoding. In such a representation, we use 0s and 1s, with each word
having many 0s but only a single 1. To differentiate the words, each
word will have a unique position of the 1.
However, there are a number of disadvantages associated with such a
representation. Of course, it is huge, and besides this, it treats the
words as independent entities that are not related to each other.
However, we need to be able to identify the similarities between
words.
Suppose we take every attribute as a dimension, then each word can be
given a vector. This way, it will become easy for us to measure the
similarity between the various words.
PyTorch supports the use of word embeddings. When creating one-hot
vectors, unique indexes were defined for every word. Similarly in
PyTorch, unique indexes should be defined using embeddings. These
will form the keys in a lookup table. The embeddings are stored in the
form of |V| x D, in which defines the dimensionality of the
embeddings, such that the word stored at the index i will be kept at the
ith row of the matrix. We will name the mapping of words to the
indices as word_to_ix .
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.optim as optim
import torch.nn.functional as F
torch.manual_seed(1)
class NGramLanguageModeler(nn.Module):
model = NGramLanguageModeler(len(vocab),
EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)
total_loss = torch.Tensor([0])
for context, target in trigrams:
# words
log_probs = model(context_var)
total_loss += loss.data
losses.append(total_loss)
print(losses)
# The loss will decrease after every iteration over training
data!
vocab = set(raw_text)
vocab_size = len(vocab)
class CBOW(nn.Module):
def __init__(self):
pass
# the data ready for use by the module are given below
import torch.onnx
import torch.utils.model_zoo as model_zoo
Super-resolution is a simple way of increasing the resolution of videos
and images and it is used widely in video editing and image
processing. In this case, we will begin by using a smaller super-
resolution model and some dummy input.
Let us begin by creating a SuperResolution model in PyTorch. We have
obtained it directly from the example models given by PyTorch
without any modification:
# Defining a Super Resolution model in PyTorch
import torch.nn.init as init
import torch.nn as nn
class SuperResolutionNet(nn.Module):
def __init__(self, upscale_factor, inplace=False):
self ._initialize_weights()
def _initialize_weights(self):
init .orthogonal(self.conv1.weight,
init.calculate_gain('relu'))
init .orthogonal(self.conv2.weight,
init.calculate_gain('relu'))
init .orthogonal(self.conv3.weight,
init.calculate_gain('relu'))
init .orthogonal(self.conv4.weight)
torch_model.load_state_dict(model_zoo.load_url(model_url,
map_location=map_location))
import onnx_caffe2.backend
import onnx
The results should show that the output of PyTorch and Caffe2 match
up to 3 decimal places. If the two fail to match, it means that there is a
problem that the operators in PyTorch and Caffe2 have implemented
differently.
import os
import numpy as np
import subprocess
from matplotlib import pyplot
from PIL import Image
from skimage import io, transform
We can now load the image then pre-process it using the skimage
library provided by Python. Don’t forget that the pre-processing is the
standard practice of processing data for testing/training neural
networks.
# load image
img_in = io.imread("./_static/img/cat.jpg")
In the next step, we need to take the resized image then run the super-
resolution model in the Caffe2 backend then save the output image:
final_img = Image.merge(
"YCbCr", [
img_out_y,
img_cb .resize(img_out_y.size, Image.BICUBIC),
img_cr .resize(img_out_y.size, Image.BICUBIC),
]) .convert("RGB")
# Save your image to be compared to the output image
from the mobile device
final_img.save("./_static/img/cat_superres.jpg")
We are now done with running our mobile nets in the pure Caffe2
backend. We can now run the model on an Android device then get the
model output.
Note that for the case of Android development, the adb shell is
required, otherwise, you will not be able to run the remaining section
of this chapter.
In the first step of running the model on mobile, we will be pushing a
native speed benchmark binary for mobile device to the adb. The
binary is capable of executing the model on mobile and export the
output of the model which can be retrieved later. You can find the
binary on GitHub. To run it, execute the command build_android.sh .
Note that you must have installed the ANDROID_NDK and the env
variable set to ANDROID_NDK=path to ndk root.
# First, let us push a bunch of stuff to the adb, set the path
for binary
CAFFE2_MOBILE_BINARY =
('caffe2/binaries/speed_benchmark')
'--caffe2_log_level=0 '
)
In the .c files, you can add the TH via the #include <TH/TH.h>
directive, and THC via #include <THC/THC.h> directive.
The ffi utils will ensure that the compiler is able to find them during
the build time:
/* src/my_lib.c */
#include <TH/TH.h>
{
if (!THFloatTensor_isSameSizeAs(input1, input2))
return 0;
THFloatTensor_resizeAs(output, input1);
THFloatTensor_cadd(output, input1, 1.0, input2);
return 1;
}
int my_lib_add_backward(THFloatTensor *grad_output,
THFloatTensor *grad_input)
{
THFloatTensor_resizeAs(grad_input, grad_output);
THFloatTensor_fill(grad_input, 1);
return 1;
}
There are no constraints on above code, but you have to prepare a
single header for listing all the functions that you need to call from
Python. This will be used by ffi utils for generation of appropriate
wrappers:
/* src/my_lib.h */
int my_lib_add_forward(THFloatTensor *input1,
THFloatTensor *input2, THFloatTensor *output);
int my_lib_add_backward(THFloatTensor *grad_output,
THFloatTensor *grad_input);
We can now create a short file that will help us build a custom
extension:
# build.py
from torch.utils.ffi import create_extension
ffi = create_extension(
name='_ext.my_lib',
headers='src/my_lib.h',
sources=['src/my_lib.c'],
with_cuda=False
)
ffi.build()
class MyAddFunction(Function):
def forward(self, input1, input2):
output = torch.FloatTensor()
my_lib .my_lib_add_forward(input1, input2, output)
return output
def backward(self, grad_output):
grad_input = torch.FloatTensor()
# modules/add.py
class MyAddModule(Module):
def forward(self, input1, input2):
return MyAddFunction()(input1, input2)
# main.py
import torch
from torch.autograd import Variable
import torch.nn as nn
from modules.add import MyAddModule
class MyNetwork(nn.Module):
def __init__(self):
super(MyNetwork, self) .__init__()
self .add = MyAddModule()
model = MyNetwork()
import torch.nn as nn
import torch
from torch.autograd import Variable
Cuda
If your computer has a GPU, it will be good for you to run the
algorithm on it, especially if you are in need of trying a large network
like VGG. In our case, we have the torch.cuda.is_available() which will
return a True if the computer has a GPU on it. Then the method
.cuda() can be used to move the allocated processes that are associated
with the module from CPU to GPU. Anytime we are in need of moving
the module from GPU to CPU, for example, to use numpy, we can use
the .cpu() method.
Finally, we can use the . type(dtype) to convert the torch.FloatTensor to
torch.cuda.FloatTensor for feeding GPU processes.
use_cuda = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_cuda else
torch.FloatTensor
Loading Images
To make the implementation simple, we begin by importing a content
image and a style of similar dimensions. We can then scale them to the
output image size that is desired (which is 128 or 512 in the example,
depending on the availability of the GPU) and then transform them to
get torch tensors, ready for feeding into the neural network:
# The desired size of output image
imsize = 512 if use_cuda else 128 # use a small size if there is
no gpu
loader = transforms.Compose([
transforms.Scale(imsize), # scale the imported image
transforms.ToTensor()]) # transform the image into a torch
tensor
def image_loader(image_name):
image = Image.open(image_name)
image = Variable(loader(image))
style_img =
image_loader("images/picasso.jpg").type(dtype)
content_img =
image_loader("images/dancing.jpg").type(dtype)
https://pytorch.org/tutorials/_static/img/neural-style/picasso.jpg
https://pytorch.org/tutorials/_static/img/neural-style/dancing.jpg
The imported PIL images have values ranging between 0 and 255.
After transformation in torch tensors, the values will be between 0 and
1. Neural networks from the torch library are trained with a tensor
image of between 0-1. If you attempt to feed the networks with 0-255
tensor images, then the feature maps that are activated will not have
sense. However, this is different with the pre-trained networks from
Caffe library. These are trained with 0-255 tensor images.
Displaying Images
The images will be displayed by calling plt.imshow . This is why we
should first convert them into PIL images:
unloader = transforms.ToPILImage() # reconvert them into
PIL image
plt.ion()
plt.title(title)
plt.pause(0.001) # pause for a while for plots to be updated
plt.figure()
imshow(style_img.data, title='Style Image')
plt.figure()
imshow(content_img.data, title='Content Image')
Content Loss
The content loss refers to a function that takes the feature maps as the
input at layer L in a network that is fed by X and it returns the
weighted content distance between the image and the content image.
This means that the weight and the target content are both parameters
to the function. The function is implemented as a torch module having
a constructor taking these parameters as the inputs. The Mean Square
Error between the two feature maps gives a distance, which we can
compute using the nn.MSELoss criterion which is stated as third
parameter.
We will be adding our content losses at every desired layer as additive
modules of our neural network. That way, every time we will feed our
network with an input image X, and all content losses will be
calculated at the desired layers, and autograd will calauclate all the
gradients for us. We only have to make the forward method of the
module returning the input, and the module will become a transparent
layer of the neural network. The computed loss will then be saved as a
parameter of this module.
self.criterion = nn.MSELoss()
self.loss.backward(retain_variables=retain_variables)
return self.loss
Note that the module has been given the name ContentLoss but it’s not
a true PyTorch Loss function. If you are in need of defining your
content loss as a PyTorch Loss, you should create a PyTorch autograd
Function plus then recomputed/implement the gradient by hand in
backward method.
Style Loss
For the case of the style loss, we should first define a module that
computes the gram produce when given the feature maps F XL of the
neural network that are fed by X at the layer L. The implementation of
the module can be done as follows:
class GramMatrix(nn.Module):
super(StyleLoss, self).__init__()
self.target = target.detach() * weight
self.weight = weight
self.gram = GramMatrix()
self.criterion = nn.MSELoss()
self.G.mul_(self.weight)
self.loss = self.criterion(self.G, self.target)
return self.output
cnn = models.vgg19(pretrained=True).features
content_losses = []
style_losses = []
model = nn.Sequential()
gram = GramMatrix()
if use_cuda:
model = model.cuda()
gram = gram.cuda()
i=1
for layer in list(cnn):
if isinstance(layer, nn.Conv2d):
name = "conv_" + str(i)
model.add_module(name, layer)
if name in content_layers:
# add the content loss:
target = model(content_img).clone()
content_loss = ContentLoss(target, content_weight)
model.add_module("content_loss_" + str(i), content_loss)
content_losses.append(content_loss)
if name in style_layers:
# add the style loss:
target_feature = model(style_img).clone()
target_feature_gram = gram(target_feature)
style_loss = StyleLoss(target_feature_gram, style_weight)
model.add_module("style_loss_" + str(i), style_loss)
style_losses.append(style_loss)
if isinstance(layer, nn.ReLU):
name = "relu_" + str(i)
model.add_module(name, layer)
if name in content_layers:
# add the content loss:
target = model(content_img).clone()
content_loss = ContentLoss(target, content_weight)
model.add_module("content_loss_" + str(i), content_loss)
content_losses.append(content_loss)
if name in style_layers:
# add the style loss:
target_feature = model(style_img).clone()
target_feature_gram = gram(target_feature)
style_loss = StyleLoss(target_feature_gram, style_weight)
model.add_module("style_loss_" + str(i), style_loss)
style_losses.append(style_loss)
i += 1
if isinstance(layer, nn.MaxPool2d):
Input Image
For us to make the code simple, we have to take an image of similar
dimensions to content and style images:
input_img = content_img.clone()
# if you need to use a white noise, uncomment the line given
below:
# input_img =
Variable(torch.randn(content_img.data.size())).type(dtype)
plt.figure()
imshow(input_img.data, title='Input Image')
Gradient Descent
We will be running our gradient descent using the L-BFGS algorithm:
def get_input_param_optimizer(input_img):
input_param = nn.Parameter(input_img.data)
optimizer = optim.LBFGS([input_param])
return input_param, optimizer
We should now create the loop of the gradient descent. At every step,
the network must be fed with the updated input so as to calculate the
new losses, and the backward methods of every loss must be run to
calculate the gradients dynamically and perform the gradient descent
step. The optimizer expects a closure as an argument:
def run_style_transfer(cnn, content_img, style_img,
input_img, num_steps=300,
style_weight=1000, content_weight=1):
"""Execute the style transfer."""
print('Build the style transfer model..')
model, style_losses, content_losses =
get_style_model_and_losses(cnn,
style_img, content_img, style_weight, content_weight)
input_param, optimizer =
get_input_param_optimizer(input_img)
print('Optimizing..')
run = [0]
while run[0] <= num_steps:
def closure():
# correct values of the updated input image
input_param.data.clamp_(0, 1)
optimizer.zero_grad()
model(input_param)
style_score = 0
content_score = 0
for sl in style_losses:
style_score += sl.backward()
for cl in content_losses:
content_score += cl.backward()
run[0] += 1
if run[0] % 50 == 0:
print("run {}:".format(run))
print('Style Loss : {:4f} Content Loss: {:4f}'.format(
style_score.data[0], content_score.data[0]))
print()
optimizer.step(closure)
return input_param.data
We can now run the algorithm:
output = run_style_transfer(cnn, content_img, style_img,
input_img)
plt.figure()
imshow(output, title='Output Image')
plt.ioff()
plt.show()
Conclusion
This marks the end of this guide. PyTorch is a deep learning library
that can be used with Python. It helps us build neural networks and
use them to analyze our data. For instance, neural networks are good
for image processing. This is why PyTorch is highly used to build
models to be used for image analysis.
Reviews
Please leave a review on amazon.com . Once you have read and used
this book, why not leave a review on the site that you purchased it
from? Potential readers can then see and use your unbiased opinion to
make purchase decisions; I’ll see your feedback and understand what
you think about my book. Thank you!