Deep Learning On Retina Images As Screening Tool For Diagnostic Decision Support

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Deep Learning on Retina Images as Screening Tool for Diagnostic

Decision Support

Maria Camila Alvarez Triviño1, Jérémie Despraz2, Jesús Alfonso López Sotelo1 and Carlos Andrés
Peña2
1
Engineering faculty, Universidad Autónoma de Occidente, Cali, Colombia
2
School of Business and Engineering Vaud (HEIG-VD), University of Applied Sciences of Western
Switzerland (HES-SO), Yverdon-les-Bains, Switzerland
{mcalvarez, jalopez}@uao.edu.co, {jeremie.despraz, carlos.pena}@heig-vd.ch

Keywords: Deep-Learning, Convolutional Neural Networks, Diabetic Retinopathy, Image Processing, Medical
Imaging.

Abstract: In this project, we developed a deep learning system applied to human retina images for medical diagnostic
decision support. The retina images were provided by EyePACS (Eyepacs, LLC). These images were used
in the framework of a Kaggle contest (Kaggle INC, 2017), whose purpose to identify diabetic retinopathy
signs through an automatic detection system. Using as inspiration one of the solutions proposed in the
contest, we implemented a model that successfully detects diabetic retinopathy from retina images. After a
carefully designed preprocessing, the images were used as input to a deep convolutional neural network
(CNN). The CNN performed a feature extraction process followed by a classification stage, which allowed
the system to differentiate between healthy and ill patients using five categories. Our model was able to
identify diabetic retinopathy in the patients with an agreement rate of 76.73% with respect to the medical
expert's labels for the test data.

1 INTRODUCTION of blindness in working age adults (CDC, Centers


for Disease Control and Prevention). It is
Nowadays, medical images can be used to assess the characterized by a deterioration of the retinal blood
health of a patient, enabling clinicians to perform vessels that can cause them to swell and leak blood
diagnostics and start the corresponding treatment for into the vitreous or, in advanced stages, it can
each pathology; such as breast cancer, kidney produce an abnormal growth of new blood vessels.
deficiencies, diabetic retinopathy, among others. In this project, we developed an automated detection
Applying computer aided diagnostic (CAD) in the system for DR based on image pre-processing and
healthcare industry can help to automate the decision deep learning.
making process, enhance the visualization of data,
and improve the extraction of complex features
from medical images. Because of the above, CAD 2 RELATED WORK
can greatly beneficiate medical experts in making
clinical diagnostics. CAD uses different machine Deep learning is a discipline that allows the
learning techniques such as deep learning, a development of algorithms capable of learning by
technique which generally requires large amount of themselves from a large set of data, without the need
data in order to be able to extract relevant of being programed explicitly. In 2016, Google
information and automatically highlight the specific researchers developed an automated system using
characteristics that differentiate each pathology. deep learning that was able to detect diabetic
retinopathy (DR) and macular edema in colour
Diabetic retinopathy (DR) is a complication of retinal fundus photographs (Gulshan, Peng, Coram,
diabetes that has become one of the leading causes & Stumpe, 2016). This method, thanks to an
automated system for DR detection has different
potential benefits such as increased efficiency, the use of two operating points selected from the
reproducibility, and coverage of the screening development set, one selected for high specificity
programs around the word. As a consequence of and the other for high sensitivity. Using the first
these novel techniques, it is hoped that the blindness operating point, high specificity, for EyePACS-1 the
due to DR can be reduced through early detection algorithm’s sensitivity was 90.3% and the specificity
and treatment. Researchers used the kind of artificial was 98.1%. In Messidor-2, the values were 87.0%
neural networks optimized for image classification, and 98.5% for sensitivity and specificity
namely deep convolutional neural network, training respectively. Using the second operating point, high
it with 128.175 labeled retinal images. These images sensitivity on the development set, for EyePACS-1
were provided by EyePACS (Eyepacs, LLC) and the values were 97.5% for sensitivity and 93.4% for
three eye hospitals in India (Aravind Eye Hospital, specificity. While, in Messidor-2 the sensitivity was
Sankara Nethralaya, and Narayana Nethralaya). The 96.1% and the specificity 93.9%. In spite of these
images were rated between 3 to 7 times for diabetic promising results, it was concluded that more
retinopathy, diabetic macular edema, and image investigation is needed to evaluate the viability of
gradability by a group of 54 US-licensed applying the developed algorithm in a clinical
ophthalmologists or ophthalmology trainees in their environment or to conclude that this technique
last year of residency. The diabetic retinopathy would represent a significant improvement
severity was graded in agreement with the compared with the care and outcomes of the current
International Clinical DR scale. According to this ophthalmologic assessments.
scale, the levels are: none, mild, moderate, severe,
and proliferative DR. Figure 1 illustrates an example
of retinal images with its corresponding DR grade,
and at the bottom the mentioned scale itself.
3 METHODS
The optimization algorithm used to train the network
weights was distributed stochastic gradient descent. 3.1 Data
To speed up the learning phase, they implemented
batch normalization and pre-initialization of the This project used two datasets. The first dataset was
weights with the same network trained for object downloaded from the crowd-sourcing platform
classification with the ImageNet dataset. The Kaggle (Kaggle INC, 2017), in which a Diabetic
network’s performance was measured as the area Retinopathy Detection contest was held. These
under the receiver operating curve (AUC) resulting images were provided by EyePACS (Eyepacs,
from plotting sensitivity vs specificity. Predictions LLC). The dataset consisted of 88.000 high
were made using an ensemble of 10 networks resolution retina images (training and test), taken
trained with the same data, and the final prediction under various conditions. A clinician evaluated the
was computed by a linear average over the ensemble images rating them with a DR level of 0 to 4,
predictions. corresponding to the International Clinical DR scale
showed in Figure 1.

The second dataset was provided by the Lausanne


University Hospital (CHUV, Centre hospitalier
universitaire vaudois) and consisted of a small set of
retina images taken in a sample of individuals from
the Swiss population in the context of an
epidemiological study.

Figure 1: International Clinical diabetic retinopathy 3.2 Preprocessing


scale, image from (Gulshan, Peng, Coram, & Stumpe,
2016) . Because of the different conditions in which the
images were obtained, a preprocessing stage was
The resulting algorithm was validated at the required in order to obtain a uniform set of inputs.
beginning of 2016 using two separated datasets:
Images were cropped in such a way that the retina
EyePACS-1 and Messidor-2. Both datasets were
was positioned at the center and occupy as much
evaluated by at least 7 US board-certified
ophthalmologists (ABMS), with high intragrader space as possible, thus reducing the amount of black
consistency. The algorithm was evaluated through pixels corresponding to the background. This was
achieved by removing pixel rows and lines of the Table 1: Detailed CNN architecture
images that did not sum up above a given threshold,
LAYER TYPE OUTPUT SHAPE
starting from the center and moving towards the
margins. Convolution 2D 16, 512, 512
Batch Normalization
Convolution 2D 16, 512, 512
Afterwards, using methods described in (Graham, Batch Normalization
2015), we subtracted the local average color from Max pooling 16, 255, 255
each pixel. The average local color of each pixel was Convolution 2D 32, 255, 255
Batch Normalization
obtained by passing the image through a Gaussian Convolution 2D 32, 255, 255
filter (Jain & Kasturi, 1995). This removed most of Batch Normalization
the background color and highlighted finer details of Max pooling 32, 127, 127
Convolution 2D 64, 127, 127
the image such as veins and stains (see Figure 3). Batch Normalization
Afterwards, we computed the retina diameter by Convolution 2D 64, 127, 127
comparing the pixel values in a line crossing the Batch Normalization
Max pooling 64, 63, 63
middle of the image. This allowed us to compute a Convolution 2D 96, 63, 63
circular mask that we used to segment the original Batch Normalization
Max pooling 96, 31, 31
image into background and retina regions. Note that,
Convolution 2D 96, 31, 31
in order to remove boundaries effects on the outer Batch Normalization
regions of the retina, the mask area was reduced by Max pooling 96, 15, 15
Convolution 2D 128, 15, 15
10% of its original size. The entire preprocessing is Batch Normalization
summarized schematically in Figure 2. Max pooling 128, 7, 7
Dropout
Flatten 6272
Finally, the images were re-scaled to a size of Dense + Regularizer 96
512x512 while maintaining the same aspect ratio Dropout
and the background pixels were mapped to 50% Batch Normalization
Dense 5
grey level. softmax

The methodology used to find the model which best


fits the problem was through experimentation,
modifying the model’s hyperparameters according to
the issues that occurred after observation (e.g.
overfitting, strong class imbalance) in order to
maximize K for the test data. These modifications
consisted in adding L2 regularizers, batch
normalization layers and additional dropout layers to
reduce the overfitting.
The deep-learning software library used in this
project was Keras (Chollet, 2017) with Theano
(LISA Lab, 2017) as backend, and the training of the
network itself was preformed on a GeForce GTX
Figure 2: Summary of image preprocessing steps. 980 Ti. The metric used to evaluate the network’s
performance was the quadratic weighted kappa (K),
(Cohen, 1968). This metric has the advantage of
3.3 Implementation penalizing correct but random guesses by taking into
account the labels distribution.
As starting point, we used the architecture
proposed by the DeepSense team (Deepsense.Io,
2015) as solution for the Kaggle’s contest. This 3.4 Balancing Data Distribution
architecture was then modified to obtain our final
best performing model whose architecture is The main issue with the available data was the
presented in Table 1. strong class imbalance between healthy and ill
subjects, the former being overrepresented in the
samples. In order to solve this issue, we used data
augmentation techniques. labelling all images as belonging to the class which
Since images were analyzed by batches, we is most frequent in the data.
constructed these batches in such a way that their This also demonstrates the usefulness of the
inner class distribution was uniform. This was weighted kappa metric that allows for a more
achieved by implementing a custom image selection accurate assessment of the models performance,
process to fill these batches. The image selection since it takes into account the probability of the
was based on randomly selecting images from the agreement occurring by chance, producing a score of
entire dataset with a probability inversely zero for the above described example.
proportional to their original distribution.
In addition, we applied random rotations of angles This motivated and justified the use of dynamic data
re-sampling as detailed in section 3.4, which allowed
between 0-360 deg and performed horizontal and/or
the network to always receive a batch with
vertical flips with a probability of 50% to artificially
uniformly distributed classes. However, this re-
augment the data, thus avoiding the issue of having sampling method made the images belonging to the
several time the exact same image in the batch. less represented classes be repeated multiple times.
Our experiments demonstrated that this can lead the
network to unwanted overfitting, given that it can
4 RESULTS AND DISCUSSION memorize the training samples rather than actually
learn from them. It appeared clearly in the huge
4.1 EyePACS’ Dataset difference between the kappa value for the training
data (0,96), and the validation dataset (0,25). To
solve this issue, we implemented different
The preprocessing allowed the retina to be well
centred on the image and occupy as much space as techniques: augmentation of the amount of dropout
possible. On the other hand, local mean substraction layers and dropout probability, addition of
highlighted the specific details that can be signs of regularizer layers, and artificial data augmentation
process as mentioned in section 3.4.
the pathology such as haemorrhage as shown in the
proliferative DR case of Figure 3.
Multiple experiments and network configurations
where explored in order to find the best model for
our problem. The solution which obtained the
highest K score for the test data, illustrated in Table
1, had the following architecture: nine convolutional
layers, six max pooling layers, batch normalization
after each convolutional layer, two fully connected
(FC) layers, L2 regularization, and dropout layers of
25% before each fully connected layer. The
activation functions were ReLu for the convolutional
layers and the first FC layer, and Softmax for the last
Figure 3: Comparison between an image without and layer. The loss function used during training was the
with preprocessing. categorical_crossentropy and the optimizer was
adadelta. While K (Kappa) was used as the indicator
Our first experiments demonstrated the importance to measure the network’s performance, it was not
of balancing the class distribution. As an example, if used directly as a cost function because, in our
the data is introduced into the network in batches experiments, it was noticed that such an
without considering class distribution, this generates implementation did not allow for an optimal
results appearing to be good in terms of training and convergence of the model.
validation scores in the accuracy vs. epochs graph. Due to memory limitations, it was necessary to use a
However, the kappa value quickly decreases to zero, Python generator to load the data batch by batch to
as the networks starts to strongly overfit the majority be sent to the network. In this generator, we
class of the dataset. As aforementioned, the above is implemented the operations of data augmentation
caused by a strong class imbalance: the zero class described in section 3.4. The overall amount of
(healthy patients) is represented by 74% of the data, parameters was 927.911 and the training took
while classes three and four, corresponding to high around 36 hours in total.
illness levels, are merely represented by 4,52% of
the data. This causes the network to obtain a high The Kappa’s behavior for our best model in its first
accuracy (74%) through the trivial solution of epochs can be observed in Figure 4. We observe in
particular how both the training and validation
curves grow together. In total, 144 epochs were ran 4.1 CHUV’s Dataset
but are not shown in the graph because the training
process was stopped and restarted from the saved Finally, we used the CHUV’s dataset to qualitatively
weights in different occasions due to technical explore the network’s performance in situations
issues. With this network, the value of K for the test close to clinical conditions. 192 images
data was 0,70309. corresponding to 79 patients were introduced into
the neural network. The network classified 8 of these
images with a DR level different from zero.
Visually analyzing this unlabeled images, we found
that these individuals indeed had signs similar to
those present in DR patient’s images. However,
since we did not have experts to label precisely the
presence or absence of DR, we could not guarantee
that these images came from patients with DR, as
the signs may also come from other retinopathy
types such as hypertensive or pigmentary
retinopathy.

In the Figure 5 we can observe some examples of


patients’ retina images that the network identified
with a DR level different from zero. The image in
Figure 4: Kappa’s behavior with respect to the epochs Figure (a) was classified as DR level 2, we observe
for our best model.
little yellow points that could be signs of the illness,
like hard exudates. In the same way, we see in
To increase the Kappa value further, a fine tuning of Figure (b), another retina image classified as DR
the trained network’s results was performed. During level 4, i.e. proliferative DR; in this image we can
fine tuning, the dynamic re-sampling method was see the presence of what could be a blood stain that
omitted in order to train the network with the could be indicative of a vitreous hemorrhage.
original class distribution. With this modification,
the K for the test data rose to 0,75265.

To further increase the predictive accuracy of our


model, we took advantage of the correlation of
images coming from the same patient. Inspired in
DeepSesnse’s solution, we implemented a post-
processing which took into account the probabilities
predicted for each eye (left and right) before
outputting a final correlated prediction. With this
additional step, the final K rose to 0,76736,
representing a high level of agreement between the
clinical and the automatic system predictions.
(a) DR class: 2 (b) DR class: 4
Table 2 summarizes the K values obtained on the
test set in all of the stages of the final model. Figure 5: Patients’ retina images with predicted DR level
different from zero
Table 2. Test K summary for each final model
stage

MODEL STAGE K EVALUATED IN TEST DATA 5 CONCLUSION


144 training epochs 0,70309
Fine tuning 0,75265
This project presented the development of an
automated system for diabetic retinopathy detection
Post-processing analysing 0,76736
in color retina images, through the implementation
both eyes
of deep learning techniques. The adopted deep
learning tool was a deep convolutional neural
network, trained with a large set of pre-processed
images. Quadratic weighted Kappa values for our
best model, evaluated with the test data was 76,74%, LISA Lab. (2017). Theano at a Glance. Retrieved
representing a good strength of agreement between May 22, 2017, from University of
the predicted and the expected grading. The GitHub Montreal:
repository with the code can be reached at: http://deeplearning.net/software/theano/intr
https://github.com/mcamila777/DL-to-retina- oduction.html
images.

REFERENCES
CDC, Centers for Disease Control and Prevention.
(n.d.). Vision Health Initiative (VHI):
Diabetic Retinopathy. Retrieved Feb 21,
2017, from Centers for Disease Control and
Prevention:
https://www.cdc.gov/visionhealth/basics/ce
d/
Chollet, F. (2017). Keras: Deep Learning library for
Theano and TensorFlow. Retrieved May
22, 2017, from Keras: https://keras.io/
CHUV, Centre hospitalier universitaire vaudois.
(n.d.). Retrieved May 22, 2017, from
http://www.chuv.ch/
Cohen, J. (1968, October). Weighted kappa: nominal
scale agreement with provision for scaled
disagreement or partial credit.
Psychological Bulletin, 70(4), 213–220.
Deepsense.Io. (2015). Diagnosing diabetic
retinopathy with deep learning. Retrieved
May 22, 2017, from
https://deepsense.io/diagnosing-diabetic-
retinopathy-with-deep-learning/
Eyepacs, LLC. (n.d.). Eyepacs. Retrieved Feb 21,
2017, from
http://www.eyepacs.com/eyepacssystem/
Graham, B. (2015). Kaggle Diabetic Retinopathy
Detection competition report. Retrieved
May 21, 2017, from University of
Warwick, Coventry, UK:
https://kaggle2.blob.core.windows.net/foru
m-message-
attachments/88655/2795/competitionreport.
pdf
Gulshan, V., Peng, L., Coram, M., & Stumpe, M. C.
(2016). Development and Validation of a
Deep Learning Algorithm for Detection of
Diabetic Retinopathy in Retinal Fundus
Photographs. JAMA, 316(22), 2402- 2410.
Jain, R., & Kasturi, R. (1995). Machine vision:
Image Filtering. New York: McGraw-Hill,
Inc.
Kaggle INC. (2017). Diabetic Retinopathy
Detection. Retrieved Feb 21, 2017, from
https://www.kaggle.com/c/diabetic-
retinopathy-detection

You might also like