Train your own image classifier with Inception in TensorFlow

March 9, 2016

Posted by Jon Shlens, Senior Research Scientist



At the end of last year we released code that allows a user to classify images with TensorFlow models. This code demonstrated how to build an image classification system by employing a deep learning model that we had previously trained. This model was known to classify an image across 1000 categories supplied by the ImageNet academic competition with an error rate that approached human performance. After all, what self-respecting computer vision system would fail to recognize a cute puppy?
Image via Wikipedia
Well, thankfully the image classification model would recognize this image as a retriever with 79.3% confidence. But, more spectacularly, it would also be able to distinguish between a spotted salamander and fire salamander with high confidence – a task that might be quite difficult for those not experts in herpetology. Can you tell the difference?
Images via Wikipedia
The deep learning model we released, Inception-v3, is described in our Arxiv preprint "Rethinking the Inception Architecture for Computer Vision” and can be visualized with this schematic diagram:
Schematic diagram of Inception-v3
As described in the preprint, this model achieves 5.64% top-5 error while an ensemble of four of these models achieves 3.58% top-5 error on the validation set of the ImageNet whole image ILSVRC 2012 classification task. Furthermore, in the 2015 ImageNet Challenge, an ensemble of 4 of these models came in 2nd in the image classification task.

After the release of this model, many people in the TensorFlow community voiced their preference on having an Inception-v3 model that they can train themselves, rather than using our pre-trained model. We could not agree more, since a system for training an Inception-v3 model provides many opportunities, including:
  • Exploration of different variants of this model architecture in order to improve the image classification system.
  • Comparison of optimization algorithms and hardware setups for training this model faster or to a higher degree of predictive performance.
  • Retraining/fine-tuning the Inception-v3 model on a distinct image classification task or as a component of a larger network tasked with object detection or multi-modal learning.
The last topic is often referred to as transfer learning, and has been an area of particular excitement in the field of deep networks in the context of vision. A common prescription to a computer vision problem is to first train an image classification model with the ImageNet Challenge data set, and then transfer this model’s knowledge to a distinct task. This has been done for object detection, zero-shot learning, image captioning, video analysis and multitudes of other applications.

Today we are happy to announce that we are releasing libraries and code for training Inception-v3 on one or multiple GPU’s. Some features of this code include:
  • Training an Inception-v3 model with synchronous updates across multiple GPUs.
  • Employing batch normalization to speed up training of the model.
  • Leveraging many distortions of the image to augment model training.
  • Releasing a new (still experimental) high-level language for specifying complex model architectures, which we call TensorFlow-Slim.
  • Demonstrating how to perform transfer learning by taking a pre-trained Inception-v3 model and fine-tuning it for another task.
We can train a model from scratch to its best performance on a desktop with 8 NVIDIA Tesla K40s in about 2 weeks. In order to make research progress faster, we are additionally supplying a new version of a pre-trained Inception-v3 model that is ready to be fine-tuned or adapted to a new task. We demonstrate how to use this model for transfer learning on a simple flower classification task. Hopefully, this provides a useful didactic example for employing this Inception model on wide range of vision tasks.

Want to get started? See the accompanying instructions on how to train, evaluate or fine-tune a network.

Releasing this code has been a huge team effort. These efforts have taken several months with contributions from many individuals spanning research at Google. We wish to especially acknowledge the following people who contributed to this project:
  • Model Architecture – Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Jon Shlens and Zbigniew Wojna
  • Systems Infrastructure – Sherry Moore, Martin Wicke, David Andersen, Matthieu Devin, Manjunath Kudlur and Nishant Patil
  • TensorFlow-Slim – Sergio Guadarrama and Nathan Silberman
  • Model Visualization – Fernanda Viégas, Martin Wattenberg and James Wexler