Multi-Disease Detection in Retinal Imaging Based On Ensembling Heterogeneous Deep Learning Models
Multi-Disease Detection in Retinal Imaging Based On Ensembling Heterogeneous Deep Learning Models
Multi-Disease Detection in Retinal Imaging Based On Ensembling Heterogeneous Deep Learning Models
The images were annotated with 46 conditions, including input sizes according to the neural network architecture,
various rare and challenging diseases, through adjudicated which was 380x380 for EfficientNetB4, 299x299 for
consensus of two senior retinal experts. These 46 conditions InceptionV3 and 244x244 for all remaining architectures [9]–
are represented by the following classes, which are also listed [12].
in Tab. 1: An overall normal/abnormal class, 27 specific Before feeding the image to the deep convolutional
condition classes and 1 ‘OTHER’ class consisting of the neural network, we applied value intensity normalization as
remaining extremely rare conditions. Besides the training last preprocessing step. The intensities were zero-centered via
dataset, the organizers of the RIADD challenge hold 1280 the Z-Score normalization approach based on the mean and
images back for external validation and testing datasets to standard deviation computed on the ImageNet dataset [13].
ensure robust evaluation [7], [8].
2.3. Deep Learning Models
2.2. Preprocessing and Image Augmentation
The state-of-the-art for medical image classification are the
In order to simplify the pattern finding process of the deep unmatched deep convolutional neural network models [4],
learning model, as well as to increase data variability, we [14]. Nevertheless, the hyper parameter configuration and
applied several preprocessing methods. architecture selection are highly dependent on the required
We utilized extensive image augmentation for up- computer vision task, as well as the key difference between
sampling to balance class distribution and real-time pipelines [4], [15]. Thus, our pipeline combines two different
augmentation during training to obtain novel and unique types of image classification models: The disease risk
images in each epoch. The augmentation techniques detector for binary classifying normal/abnormal images and
consisted of rotation, flipping, and altering in brightness, the disease label classifier for multi-label annotation of
saturation, contrast and hue. Through the up-sampling, it was abnormal images.
ensured that each label occurred at least 100 times in the Both model types were pretrained on the ImageNet
dataset which increased the total number of training images dataset [13]. For the fitting process, we applied a transfer
from 1920 to 3354. learning training, with frozen architecture layers except for
Afterwards, all images were square padded in order the classification head, and a fine-tuning strategy with
to avoid aspect ratio loss during posterior resizing. The retinal unfrozen layers. Whereas the transfer learning fitting was
images were also cropped to ensure that the fundus is center performed for 10 epochs using the Adam optimization with
located in the image. The cropping was performed an initial learning rate of 1-E04, the fine-tuning had a
individually for each microscope resolution and resulted in maximal training time of 290 epochs and using a dynamic
the following image shapes: 1424x1424, 1536x1536 and learning rate for the Adam optimization starting from 1-E05
3464x3464 pixels. The images were then resized to model to a maximum decrease to 1-E07 (decreasing factor of 0.1
2.4.1 Bagging
validation loss, it was still possible to obtain a powerful including the ‘OTHER’ class consisting of various extremely
model for detection. rare conditions. Nevertheless, the two classes ‘EDN’ and
‘CRS’ were the most challenging conditions for all classifier
3.1. Internal Performance Evaluation models. Both classes belong to very rare conditions,
combined with less than 1.2% occurrence in the original and
For estimating the performance of our pipeline, we utilized 2.5% occurrence in the up-sampled dataset. Still, our stacked
the validation subsets of the 5-fold cross-validation models logistic regression algorithm was able to balance this issue
from the heavily augmented version of our dataset. This and infer the correct ‘EDN’ and ‘CRS’ classifications through
approach allowed to obtain testing samples which were never context. Overall, our applied ensemble learning strategies
seen in the training process for reliable performance resulted in a significant performance improvement compared
evaluation. For the complex multi-label evaluation, we to the individual deep convolutional neural network models.
computed the popular area under the receiver operating More details on the internal performance evaluation are listed
characteristic (AUROC) curve, as well as the mean average in Tab. 2.
precision (mAP). Both scores were macro-averaged over
classes and cross-validation folds to reduce complexity. 3.2. External Evaluation through the RIADD Challenge
Our multi-disease detection pipeline revealed a
strong and robust classification performance with the Furthermore, we participated at the RIADD challenge which
capability to also detect rare conditions accurately in retinal was organized by the authors of the RFMiD dataset [7], [8].
images. Whereas the disease label classifier models The challenge participation allowed not only an independent
separately only achieved an AUROC of around 0.97 and a
mAP of 0.93, the disease risk detectors demonstrated to have Tab. 2. Achieved results of the internal performance
a really strong predictive power of 0.98 up to 0.99 AUROC evaluation showing the average AUROC and mAP score for
and mAP. However, for the classifiers the InceptionV3 each model utilized in our pipeline. The scores were macro-
architecture indicated to have the worst performance averaged across all cross-validation folds and classes.
compared to the other architectures with only 0.93 AUROC Model Type Architecture AUROC mAP
and 0.66 mAP. The associated receiver operating
Classifier DenseNet201 0.973 0.931
characteristics of the models are illustrated in Fig. 3.
Training a strong multi-label classifier is in general Classifier EfficientNetB4 0.969 0.929
a complex task, however, the extreme class imbalance Classifier ResNet151 0.970 0.930
between the conditions revealed a hard challenge for building Classifier InceptionV3 0.932 0.663
a reliable model [19], [20]. Our applied up-sampling and
Detector DenseNet201 0.980 0.997
class weighting technique demonstrated to have a critical
boost on the predictive capabilities of the classifier models. Detector EfficientNetB4 0.993 0.999
Nearly all labels were able to be accurately detected, Ensembler Logistic Regression 0.999 0.999