Multi-Disease Detection in Retinal Imaging Based On Ensembling Heterogeneous Deep Learning Models

MULTI-DISEASE DETECTION IN RETINAL IMAGING
BASED ON ENSEMBLING HETEROGENEOUS DEEP LEARNING MODELS

Dominik Müller1, Iñaki Soto-Rey1,2 and Frank Kramer1
1
IT-Infrastructure for Translational Medical Research, University of Augsburg, Germany
2
Medical Data Integration Center, University Hospital Augsburg, Germany
ABSTRACT In this study we push towards creating a highly

accurate and reliable multi-disease detection pipeline based
Preventable or undiagnosed visual impairment and blindness on ensemble, transfer and deep learning techniques.
affect billion of people worldwide. Automated multi-disease Furthermore, we utilize the new Retinal Fundus Multi-
detection models offer great potential to address this problem Disease Image Dataset (RFMiD) containing various rare and
via clinical decision support in diagnosis. In this work, we challenging conditions to demonstrate our detection
proposed an innovative multi-disease detection pipeline for capabilities for uncommon diseases.
retinal imaging which utilizes ensemble learning to combine
the predictive capabilities of several heterogeneous deep 2. METHODS
convolutional neural network models. Our pipeline includes
state-of-the-art strategies like transfer learning, class The implemented medical image classification pipeline can
weighting, real-time image augmentation and Focal loss be summarized in the following core steps and is illustrated
utilization. Furthermore, we integrated ensemble learning in Fig. 1:
techniques like heterogeneous deep learning models, bagging - Stratified multi-label 5-fold cross-validation
via 5-fold cross-validation and stacked logistic regression - Class weighted Focal loss and up-sampling
models. Through internal and external evaluation, we were - Extensive real-time image augmentation
able to validate and demonstrate high accuracy and reliability - Multiple deep learning model architectures
of our pipeline, as well as the comparability with other state- - Ensemble learning strategies: bagging and stacking
of-the-art pipelines for retinal disease prediction. - Individual training for multi-disease labels and disease risk
detection utilizing transfer learning on ImageNet
Index Terms— Retinal Disease Detection, Ensemble Learning, Class - Stacked binary logistic regression models for distinct
Imbalance, Multi-label Image Classification, Deep Learning
classification
1. INTRODUCTION 2.1. Retinal Imaging Dataset
Even if the medical progress in the last 30 years made it The RFMiD dataset consists of 3200 retinal images for which
possible to successfully treat the majority of diseases causing 1920 images were used as training dataset [7]. The fundus
visual impairment, growing and aging populations lead to an images were captured by three different fundus cameras
increasing challenge in retinal disease diagnosis [1]. The having a resolution of 4288x2848 (277 images), 2048x1536
World Health Organization (WHO) estimates the prevalence (150 images) and 2144x1424 (1493 images), respectively.
of blindness and visual impairment to 2.2 billion people
worldwide, of whom at least 1 billion affections could have Tab. 1. Annotation frequency for each class in the dataset.
been prevented or is yet to be addressed [2]. Early detection
and correct diagnosis are essential to forestall disease course Disease Samples Disease Samples Disease Samples
and prevent blindness. D. Risk 1519 DR 376 ARMD 100
The use of clinical decision support (CDS) systems MH 317 DN 138 MYA 101
for diagnosis has been increasing over the past decade [3]. BRVO 73 TSLN 186 ERM 14
Recently, modern deep learning models allow automated and LS 47 MS 15 CSR 37
reliable classification of medical images with remarkable ODC 282 CRVO 28 TV 6
accuracy comparable to physicians [4]. Nevertheless, these AH 16 ODP 65 ST 5
models often lack capabilities to detect rare pathologies such AION 17 PT 11 RT 14
as central retinal artery occlusion or anterior ischemic optic RS 43 CRS 32 EDN 15
neuropathy [5], [6]. RPEC 22 MHL 11 RP 6
OTHER 34
Preprint - March 2021 Page 1 / 6

Fig. 1. Flowchart diagram of the implemented medical image analysis pipeline for multi-disease detection in retinal
imaging. The workflow is starting with the retinal imaging dataset (RFMiD) and ends with computed predictions for novel
images.
The images were annotated with 46 conditions, including input sizes according to the neural network architecture,
various rare and challenging diseases, through adjudicated which was 380x380 for EfficientNetB4, 299x299 for
consensus of two senior retinal experts. These 46 conditions InceptionV3 and 244x244 for all remaining architectures [9]–
are represented by the following classes, which are also listed [12].
in Tab. 1: An overall normal/abnormal class, 27 specific Before feeding the image to the deep convolutional
condition classes and 1 ‘OTHER’ class consisting of the neural network, we applied value intensity normalization as
remaining extremely rare conditions. Besides the training last preprocessing step. The intensities were zero-centered via
dataset, the organizers of the RIADD challenge hold 1280 the Z-Score normalization approach based on the mean and
images back for external validation and testing datasets to standard deviation computed on the ImageNet dataset [13].
ensure robust evaluation [7], [8].
2.3. Deep Learning Models
2.2. Preprocessing and Image Augmentation
The state-of-the-art for medical image classification are the
In order to simplify the pattern finding process of the deep unmatched deep convolutional neural network models [4],
learning model, as well as to increase data variability, we [14]. Nevertheless, the hyper parameter configuration and
applied several preprocessing methods. architecture selection are highly dependent on the required
We utilized extensive image augmentation for up- computer vision task, as well as the key difference between
sampling to balance class distribution and real-time pipelines [4], [15]. Thus, our pipeline combines two different
augmentation during training to obtain novel and unique types of image classification models: The disease risk
images in each epoch. The augmentation techniques detector for binary classifying normal/abnormal images and
consisted of rotation, flipping, and altering in brightness, the disease label classifier for multi-label annotation of
saturation, contrast and hue. Through the up-sampling, it was abnormal images.
ensured that each label occurred at least 100 times in the Both model types were pretrained on the ImageNet
dataset which increased the total number of training images dataset [13]. For the fitting process, we applied a transfer
from 1920 to 3354. learning training, with frozen architecture layers except for
Afterwards, all images were square padded in order the classification head, and a fine-tuning strategy with
to avoid aspect ratio loss during posterior resizing. The retinal unfrozen layers. Whereas the transfer learning fitting was
images were also cropped to ensure that the fundus is center performed for 10 epochs using the Adam optimization with
located in the image. The cropping was performed an initial learning rate of 1-E04, the fine-tuning had a
individually for each microscope resolution and resulted in maximal training time of 290 epochs and using a dynamic
the following image shapes: 1424x1424, 1536x1536 and learning rate for the Adam optimization starting from 1-E05
3464x3464 pixels. The images were then resized to model to a maximum decrease to 1-E07 (decreasing factor of 0.1

after 8 epochs without improvement on the monitored reliability of a prediction. This strategy resulted in an
validation loss) [16]. Furthermore, an early stopping and ensemble of 10 disease risk detector models (2 architectures
model checkpoint technique was utilized for the fine-tuning with each 5 folds) and 20 disease label classifier models (4
process, stopping after 20 epochs without improvement (after architectures with each 5 folds).
epoch 60) and saving the best model measured according to
the validation loss. Instead of defining an epoch as a cycle 2.4.2 Stacking
through the full training dataset, we establish an epoch to
have 250 iterations. This allowed to increase the number of For combining the predictions of our, in total, 30 models, we
seen batches and, thus, to increase the information given to integrated a stacking setup. On top of all deep convolutional
the model during the fitting process of an epoch. As training neural networks, we applied a binary logistic regression
loss function, we utilized the weighted Focal loss from Lin et algorithm for each class, individually. Thus, the predictions
al. [17]. of all models were utilized as input for computing the
classification of a single class. This approach allowed
FL(𝑝𝑡 ) = −𝛼𝑡 (1 − 𝑝𝑡 )𝛾 log (𝑝𝑡 ) (1) combining the information of all other class predictions to
derive an inference for one single class. Overall, this strategy
In the above formula, pt is the probability for the correct
resulted in 29 distinct logistic regression models (1 for
ground truth class t, γ a tunable focusing parameter (which
disease risk and 28 for each disease-label including the
we set to 2.0) and αt the associated weight for class t.
‘other’ class). The individual predicted class probabilities are
then concatenated to the final prediction.
2.3.1 Disease Risk Detector
The logistic regression models were also trained
with the same 5-fold cross-validation sampling on a heavily
The disease risk detector was established as a binary classifier
augmented version of the training dataset to avoid overfitting
of the disease risk class for general categorizing between
as well as avoiding training the logistic regression models on
normal and abnormal retinal images. Thus, this model type
already seen images from the neural network models. As
was trained using only the disease risk class and ignoring all
logistic regression solver, we utilized the large-scale bound-
multi-label annotations. Rather than using a single model
constrained optimization (short: ‘LBFGS’) from Zhu et al.
architecture, we trained multiple models based on the
[18].
DenseNet201 and EfficientNetB4 architecture [9], [10]. For
class weight computation, we divided the number of samples
3. RESULTS AND DISCUSSION
by the multiplication of the number of classes (2 for a binary
classification) with the number of class occurrences in the
The sequential training of a complete cross-validation for one
dataset.
architecture on a single NVIDIA TITAN RTX GPU took
around 13.5 hours with 63 epochs on average for each deep
2.3.2 Disease Label Classifier
convolutional neural network model. Logistic Regression
training required less than 30 minutes for all class models
In contrast, the disease label classifier was established as
combined. No signs of overfitting were observed for the
multi-label classifier of all 28 remaining classes (excluding
disease label classifiers through validation monitoring, as it
disease risk) and was trained on the one hot encoded array of
can be seen in Fig. 2. However, the disease risk detectors
the disease labels. Furthermore, we utilized four different
showed a strong trend to overfit after the transfer learning
architectures for this model type: ResNet152, InceptionV3,
phase. Through our strategy to use the model with the best
DenseNet201 and EfficientNetB4 [9]–[12]. Identical to class
weight computation of the disease risk detector, we computed
the weights individually as binary classification for each
class. Even if this classifier is provided with all classes, the
binary weights balance the decision for each label
individually.
2.4. Ensemble Learning Strategy
2.4.1 Bagging
Next to the utilization of multiple architecture, we also

applied a 5-fold cross-validation based as a bagging approach Fig. 2. Loss course during the training process for training
for ensemble learning. Our aim was to create a large variety and validation data. The lines were computed via locally
of models which were trained on different subsets of the estimated scatterplot smoothing and represent the average
training data. This approach not only allowed a more efficient loss across all folds. The gray areas around the lines
usage of the available training data, but also increased the represent the confidence intervals.

Fig. 3. Receiver operating characteristic (ROC) curves for each model type applied in our pipeline. The ROC curves showing
the individual model performance measured by the true positive and false positive rate. The cross-validation models were
macro-averaged for each model type to reduce illustration complexity.
validation loss, it was still possible to obtain a powerful including the ‘OTHER’ class consisting of various extremely
model for detection. rare conditions. Nevertheless, the two classes ‘EDN’ and
‘CRS’ were the most challenging conditions for all classifier
3.1. Internal Performance Evaluation models. Both classes belong to very rare conditions,
combined with less than 1.2% occurrence in the original and
For estimating the performance of our pipeline, we utilized 2.5% occurrence in the up-sampled dataset. Still, our stacked
the validation subsets of the 5-fold cross-validation models logistic regression algorithm was able to balance this issue
from the heavily augmented version of our dataset. This and infer the correct ‘EDN’ and ‘CRS’ classifications through
approach allowed to obtain testing samples which were never context. Overall, our applied ensemble learning strategies
seen in the training process for reliable performance resulted in a significant performance improvement compared
evaluation. For the complex multi-label evaluation, we to the individual deep convolutional neural network models.
computed the popular area under the receiver operating More details on the internal performance evaluation are listed
characteristic (AUROC) curve, as well as the mean average in Tab. 2.
precision (mAP). Both scores were macro-averaged over
classes and cross-validation folds to reduce complexity. 3.2. External Evaluation through the RIADD Challenge
Our multi-disease detection pipeline revealed a
strong and robust classification performance with the Furthermore, we participated at the RIADD challenge which
capability to also detect rare conditions accurately in retinal was organized by the authors of the RFMiD dataset [7], [8].
images. Whereas the disease label classifier models The challenge participation allowed not only an independent
separately only achieved an AUROC of around 0.97 and a
mAP of 0.93, the disease risk detectors demonstrated to have Tab. 2. Achieved results of the internal performance
a really strong predictive power of 0.98 up to 0.99 AUROC evaluation showing the average AUROC and mAP score for
and mAP. However, for the classifiers the InceptionV3 each model utilized in our pipeline. The scores were macro-
architecture indicated to have the worst performance averaged across all cross-validation folds and classes.
compared to the other architectures with only 0.93 AUROC Model Type Architecture AUROC mAP
and 0.66 mAP. The associated receiver operating
Classifier DenseNet201 0.973 0.931
characteristics of the models are illustrated in Fig. 3.
Training a strong multi-label classifier is in general Classifier EfficientNetB4 0.969 0.929
a complex task, however, the extreme class imbalance Classifier ResNet151 0.970 0.930
between the conditions revealed a hard challenge for building Classifier InceptionV3 0.932 0.663
a reliable model [19], [20]. Our applied up-sampling and
Detector DenseNet201 0.980 0.997
class weighting technique demonstrated to have a critical
boost on the predictive capabilities of the classifier models. Detector EfficientNetB4 0.993 0.999
Nearly all labels were able to be accurately detected, Ensembler Logistic Regression 0.999 0.999

evaluation of the predictive power of our pipeline on an architectures to create an ensemble of models. With a
unseen and unpublished testing set, but also the comparison stacking approach of class-wise distinct logistic regression
with the currently best retinal disease classifiers in the world. models, we combined the knowledge of all neural network
In our participation, we were able to reach rank 19 models to compute highly accurate and reliable retinal
from a total of 59 teams in the first evaluation phase and rank condition predictions. Next to an internal performance
8 in the final phase. In the independent evaluation from the evaluation, we also proved the precision and comparability of
challenge organizers, we achieved an AUROC of 0.95 for the our pipeline through the participation at the RIADD
disease risk classification. For multi-label scoring, they challenge.
computed the average between the macro-averaged AUROC
and the mAP, for which we reached the score 0.70. The top APENDIX
performing ranks shared only a marginal scoring difference
which is why we had only a final score difference of 0.05 to In order to ensure full reproducibility and to create a base for
the first ranked team. Furthermore, the participation results further research, the complete code of this study, including
demonstrated that ensemble learning based classification for extensive documentation, is available in the following public
deep convolutional neural network models is compatible or Git repository:
even superior to other approaches in the scientific field such https://github.com/frankkramer-lab/riadd.aucmedi
as focusing on a single large architecture. Furthermore, the trained models, evaluation results and
metadata are available in the following public Zenodo
3.3. Experiments and Improvements repository:
https://doi.org/10.5281/zenodo.4573990
Additionally, we experimented with using weighted cross-
entropy loss for training our both model types. This resulted ACKNOWLEDGMENTS
in inferior models for disease label classification, however,
the cross-entropy loss fitted disease risk detector models We want to thank Dennis Klonnek, Edmund Müller and
showed less overfitting with equal performance. Further Johann Frei for their useful comments and support.
experimentation with loss functions for the disease risk
detector models could provide the solution to avoid COMPLIANCE WITH ETHICAL STANDARDS
overfitting.
An important point for the RIADD challenge This research study was conducted retrospectively using
participation would be the utilization of more training data, human subject data made available in open access by
especially for the difficult ‘CRS’ and ‘EDN’ classes. Pachade et al. [7], [8]. Ethical approval was not required as
According to the challenge rules, other public available confirmed by the license attached with the open access data.
datasets like Kaggle DR, IDRiD, Messidor or APTOS are
allowed to be used as additional training data [8]. Our CONFLICT OF INTEREST
pipeline, which was trained exclusively on the RFMiD
dataset, could be further improved with more retinal images None declared.
of very rare conditions. Besides the training data, more
improvement points for further research in retinal disease FUNDING
detection would be the inclusion of image cropping strategies
to reduce information loss through resolution resizing, the This work is a part of the DIFUTURE project funded by the
usage of more architectures (especially with different input German Ministry of Education and Research
resolutions) to increase the model ensemble, and the (Bundesministerium für Bildung und Forschung, BMBF)
utilization of specific retinal filters or retinal vessel grant FKZ01ZZ1804E.
segmentation as additional information to utilize for the
predictions. REFERENCES
4. CONCLUSIONS [1] J. D. Adelson et al., “Causes of blindness and vision impairment

in 2020 and trends over 30 years, and prevalence of avoidable
blindness in relation to VISION 2020: the Right to Sight: an
In this study, we introduced a powerful multi-disease analysis for the Global Burden of Disease Study,” Lancet Glob.
detection pipeline for retinal imaging which exploits Heal., vol. 9, no. 2, pp. e144–e160, Feb. 2021, doi:
ensemble learning techniques to combine the predictions of 10.1016/S2214-109X(20)30489-7.
[2] World Health Organization, “Blindness and vision impairment.”
various deep convolutional neural network models. Next to
https://www.who.int/news-room/fact-sheets/detail/blindness-and-
state-of-the-art strategies, such as transfer learning, class visual-impairment (accessed Feb. 27, 2021).
weighting, extensive real-time image augmentation and Focal [3] R. T. Sutton, D. Pincock, D. C. Baumgart, D. C. Sadowski, R. N.
loss utilization, we applied 5-fold cross-validation as bagging Fedorak, and K. I. Kroeker, “An overview of clinical decision
support systems: benefits, risks, and strategies for success,” npj
technique and used multiple convolutional neural network
Digital Medicine, vol. 3, no. 1. Nature Research, pp. 1–10, Dec.

01, 2020, doi: 10.1038/s41746-020-0221-y.
[4] G. Litjens et al., “A survey on deep learning in medical image
analysis,” Med. Image Anal., vol. 42, no. December 2012, pp. 60–
88, 2017, doi: 10.1016/j.media.2017.07.005.
[5] J. Y. Choi, T. K. Yoo, J. G. Seo, J. Kwak, T. T. Um, and T. H.
Rim, “Multi-categorical deep learning neural network to classify
retinal images: A pilot study employing small database,” PLoS
One, vol. 12, no. 11, p. e0187336, Nov. 2017, doi:
10.1371/journal.pone.0187336.
[6] G. Quellec, M. Lamard, P. H. Conze, P. Massin, and B. Cochener,
“Automatic detection of rare pathologies in fundus photographs
using few-shot learning,” Med. Image Anal., vol. 61, p. 101660,
Apr. 2020, doi: 10.1016/j.media.2020.101660.
[7] S. Pachade et al., “Retinal Fundus Multi-Disease Image Dataset
(RFMiD): A Dataset for Multi-Disease Detection Research,” Data,
vol. 6, no. 2, p. 14, Feb. 2021, doi: 10.3390/data6020014.
[8] “Home - RIADD (ISBI-2021) - Grand Challenge.”
https://riadd.grand-challenge.org/Home/ (accessed Feb. 27, 2021).
[9] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger,
“Densely Connected Convolutional Networks,” Proc. - 30th IEEE
Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-
January, pp. 2261–2269, Aug. 2016, Accessed: Feb. 27, 2021.
[Online]. Available: http://arxiv.org/abs/1608.06993.
[10] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for
Convolutional Neural Networks,” 36th Int. Conf. Mach. Learn.
ICML 2019, vol. 2019-June, pp. 10691–10700, May 2019,
Accessed: Feb. 27, 2021. [Online]. Available:
http://arxiv.org/abs/1905.11946.
[11] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the Inception Architecture for Computer Vision,” in
Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, Dec. 2016, vol. 2016-
December, pp. 2818–2826, doi: 10.1109/CVPR.2016.308.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Dec.
2016, vol. 2016-December, pp. 770–778, doi:
10.1109/CVPR.2016.90.
[13] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition
Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec.
2015, doi: 10.1007/s11263-015-0816-y.
[14] S. Muhammad et al., “Medical Image Analysis using
Convolutional Neural Networks A Review,” J. Med. Syst., vol. 42,
no. 11, pp. 1–13, Nov. 2018, doi: 10.1007/s10916-018-1088-1.
[15] J. Ker, L. Wang, J. Rao, and T. Lim, “Deep Learning Applications
in Medical Image Analysis,” IEEE Access, vol. 6, pp. 9375–9379,
2017, doi: 10.1109/ACCESS.2017.2788044.
[16] D. P. Kingma and J. Lei Ba, “Adam: A Method for Stochastic
Optimization,” 2014. https://arxiv.org/abs/1412.6980.
[17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss
for Dense Object Detection,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 42, no. 2, pp. 318–327, Aug. 2017, Accessed: Feb. 27,
2021. [Online]. Available: http://arxiv.org/abs/1708.02002.
[18] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “L-BFGS-B: Fortran
Subroutines for Large-Scale Bound-Constrained Optimization,”
ACM Trans. Math. Softw., vol. 23, no. 4, pp. 550–560, Dec. 1997,
doi: 10.1145/279232.279236.
[19] P. Kaur and A. Gosain, “Issues and challenges of class imbalance
problem in classification,” Int. J. Inf. Technol., pp. 1–7, Oct. 2020,
doi: 10.1007/s41870-018-0251-8.
[20] L. Gao, L. Zhang, C. Liu, and S. Wu, “Handling imbalanced
medical image data: A deep-learning-based one-class
classification approach,” Artif. Intell. Med., vol. 108, p. 101935,
Aug. 2020, doi: 10.1016/j.artmed.2020.101935.

Multi-Disease Detection in Retinal Imaging Based On Ensembling Heterogeneous Deep Learning Models

Uploaded by

Copyright:

Available Formats

Multi-Disease Detection in Retinal Imaging Based On Ensembling Heterogeneous Deep Learning Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Disease Detection in Retinal Imaging Based On Ensembling Heterogeneous Deep Learning Models

Uploaded by

Copyright:

Available Formats

MULTI-DISEASE DETECTION IN RETINAL IMAGING

BASED ON ENSEMBLING HETEROGENEOUS DEEP LEARNING MODELS

ABSTRACT In this study we push towards creating a highly

Preprint - March 2021 Page 1 / 6

Preprint - March 2021 Page 2 / 6

2.4. Ensemble Learning Strategy

Next to the utilization of multiple architecture, we also

Preprint - March 2021 Page 3 / 6

Preprint - March 2021 Page 4 / 6

4. CONCLUSIONS [1] J. D. Adelson et al., “Causes of blindness and vision impairment

Preprint - March 2021 Page 5 / 6

Preprint - March 2021 Page 6 / 6

You might also like