Abstract
Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our ensemble method compares favorably with the state-of-the-art, while running faster and using less memory. Moreover, our experiments are done using a substantially larger dataset than earlier papers on this topic. Since the labeling time could vary tremendously across examples, we model the labeling time and optimize the return on investment. We validate this idea by core-set selection on our large labeled dataset and by growing it with data from the wild.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Clinical applications set very high bars for machine learning algorithms, because any misdiagnosis could impact treatment plans and gravely harm the patient. To reach the required performance, supervised learning is the leading technique, and its success is well established. However, a challenge in supervised learning is that it requires a large amount of labeled data, especially when deep neural networks are used. Unfortunately, expert labeling of medical images requires enormous time and cost. The problem is exacerbated when accurate pixelwise labeling is required. Accordingly, medical segmentation datasets tend to be relatively small [1, 2].
Active learning (AL) aims to address the paucity of labeled data by reasoned choice of which available unlabeled examples to annotate [3,4,5,6,7]. A limitation of many prior studies of AL is that they validated AL only in a core-set selection setting, [8] rather than demonstrating its utility in growing the labeled data, and also did not attempt to model the cost of labeling [3, 4, 7]. However, the potential value/use of AL is not in achieving comparable performance with less data, but in improving the model while also minimizing labeling costs. On other problems it has been shown that labeling costs vary greatly from one example to another [3, 9, 10]. In the case of intracranial hemorrhage, we observe that times needed for pixelwise labeling vary up to 3 orders of magnitude for different cases (See Fig. 3). Most AL studies to date select examples without addressing this wide variation in labeling time [4,5,6,7,8].
In this paper, we propose a cost-sensitive AL system by combining the query-by-committee [5] approach with labeling time prediction for each example. Our uniform-cost AL system compares favorably with the state of the art [4], while the cost-sensitive system gives a further boost under labeling time constraints. All experiments are conducted on our pixelwise-labeled dataset (29095 frames), which is about two orders of magnitude larger than standard MICCAI segmentation datasets [1, 2]. Moreover, our system is simpler, faster, and uses less memory than earlier works [4, 8]. Through the example of intracranial hemorrhage detection, we demonstrate the potential of cost-sensitive active learning to scale up medical datasets efficiently.
2 Supervised Learning System
As a machine learning system we use a convolutional neural network (CNN). More specifically we use a fully convolutional neural network (FCN). FCNs are able to make pixelwise predictions. The standard approach for using an FCN is to input the entire image into the FCN and obtain pixelwise predictions with a single forward pass [11, 12]. We instead use an FCN which uses a patch as input and makes predictions for presence of hemorrhage for each pixel within a specific patch at a time, which we call PatchFCN. This architecture has the advantage that the network has to make its predictions based on the local morphology and hence is less prone to overfit into the global context, which results in better test time accuracy than standard FCNs. At test time we apply the PatchFCN in a sliding window fashion (see Fig. 2). We extensively tested this network architecture in a separate technical report [13] and established that it outperforms whole image baselines for various underlying FCN architectures. We use the 38 layer dilated residual net (DRN) as specific FCN architecture. It uses dilated convolutions to preserve spatial resolution together with residual connections [14]. We also group the pixelwise predictions into regions using connected component analysis and aggregate the pixelwise predictions into frame and stack classification scores. This facilitates hemorrhage detection at the pixel, region, frame and stack level.
3 Cost-Sensitive Active Learning
Let us define our active learning problem as follows: given a labeled seed set S and an unlabeled pool set U, find a small subset P from U for labeling that maximizes a suitable test set metric. Our system which is depicted in Fig. 1 estimates an uncertainty score for each example (see Sect. 3.1) and the labeling time (see Sect. 3.2). The goal is to select the set of examples such that the sum of their uncertainty is maximized under the constraint that the total estimated labeling time stays within a given budget. The optimal selection of items reduces to the well-known 0-1 Knapsack problem, which can be solved with dynamic programming.
3.1 Uncertainty Measure
Uncertainty (or informativeness) is at the core of active learning techniques. It can be estimated by single model outputs [6] or a committee of models [5]. The idea of query-by-committee (QBC) is to run multiple models on the same example and use their disagreement to estimate uncertainty. Experimentally, we found that QBC consistently works better than single-model uncertainty. Within the QBC framework, we have tried various uncertainty measures and found the Jensen-Shannon (JS) divergence to work best. Concretely, let’s assume we have N models in the committee and the output distribution of model i is \(P_i\). The JS divergence is then defined as:
where H is the entropy function.
We average all pixelwise uncertainties within each patch to obtain the uncertainty of a patch. The stack uncertainty is obtained by averaging the top K uncertain patches within the stack. The choice of K is a balance between taking the max (\(K=1\)) or the mean (\(K=\infty \)) of the whole stack. In all AL experiments in this paper, we set \(K=200\) and number of models \(N=4\). We have tried larger N but didn’t gain any performance. Visualization of such uncertainty can be found in Fig. 6.
3.2 Labeling Time Prediction
First, we need to ask what is the optimal unit of labeling – patch, frame or stack? Employing our neuro-radiology expertise, we settled on labeling stacks. While labeling patches/frames may seem more effective from a machine learning perspective, it comes with a severe overhead, i.e. the whole stacks need to be retrieved and examined by radiologists anyway. Therefore, it is less efficient than labeling the stacks.
To apply active learning in practice, we need to ensure it actually saves labeling cost or efforts. This is crucial as per-stack labeling times in our data span 3 orders of magnitude. We utilize linear regression to predict the log labeling time \(\log {t}\) based on two features: (1) mask boundary length B, and (2) number of connected components M under the log-transform.
Figure 3 shows the effectiveness of our log-transform and the goodness of fit on both features. 61 data points were used to fit the linear model, which we found to be sufficient. In order to compute the features at test time we use the pixelwise predictions of our network. We also tried using deep FCN features from an intermediate layer directly but found the prediction to be less stable.
4 Data Collection
Our pixelwise labeled dataset contains 1247 clinical head CT scans (29095 valid frames) performed from 2010–2017 on 64-detector-row CT scanners (GE, Siemens) at our affiliated hospitals. Each scan is a stack of 27-38 frames with in-plane resolution close to 0.5mm and z-axis resolution of 5mm. Scans were anonymized by removing all protected health information as well as skull, scalp and face. A board-certified neuroradiologist with specialization in traumatic brain injury (TBI) identified areas of acute intracranial hemorrhage at the pixel level. We randomly split the dataset into a trainval/test set of 934/313 stacks, called \(S_{trainval}\), \(S_{test}\) respectively (S for seed).
The unlabeled set was collected using key phrase searches of radiology reports. We searched independently for positive and negative cases. The search for positive cases over 1 year yielded 1755 cases. A separate search over a shorter period identified 640 negative cases. We call this set of cases set U (for unlabeled) to be distinguished from set S. Also, 120 randomly selected cases from U (called \(U_{test}\)) were annotated at stack level in order to benchmark our system in this domain.
5 Experiments
5.1 Core-Set Active Learning
A core-set is a subset of the training set where the empirical loss of a model is similar to that on the entire training set. In this experiment, we grow the core-set iteratively and study how the performance improves [4, 8]. For fair comparison, we strip away the cost prediction and Knapsack-solving part of our full system (See Fig. 1), and select examples based on their uncertainty scores alone.
We use the average precision (AP) metric to compare algorithms. Figure 4 shows the performance of our query-by-committee system (QBC), suggestive annotation system (QBC + Similarity) [4], and random baseline. In this comparison, we improve [4] by using the patch-based approach for QBC + Similarity baseline, because PatchFCN [13] gives better uncertainty and similarity measures than vanilla FCN. Without it, we observed a significant performance drop. Following [4], we tried diversifying the ensemble with bootstrapping, but did not see benefit.
The experiment began with a seed set 1 / 32 of the training set, and doubled it by either random sampling or active learning. In the next round, this doubled set becomes the new seed set and the process repeats. In each round, we trained an ensemble for all methods in order to compute QBC uncertainty. Figure 4 shows that our system’s performance at half the dataset (S2) closely matches the performance of using the whole dataset (S1) for every AP, similar to [4, 8]. However, here we use a dataset that is two orders of magnitude larger and much harder to overfit on.
Our experiment indicates that on a large dataset, QBC uncertainty alone could be sufficient to yield competitive performance, if not state-of-the-art. Without bootstrapping or pairwise similarity, our system beats the random baseline by a good margin and compares favorably with [4] in performance and time complexity. The time complexity of core-set approaches [4, 8] are dominated by the pairwise similarity computation, which is quadratic and can be expensive in practice when the seed set is too large to be grown by brute-force labeling. In contrast, our system has linear time complexity because it computes everything on-the-fly.
5.2 Cost-Sensitive Active Learning
After validating the core-set AL, we model the cost with the full system described in Fig. 1. We randomly select half of our labeled training set as the seed set to mimic the scenario where the seed set is large enough to render naive labeling impractical for growing the data. Yet at the same time we want the pool to be at least as large as the seed. In each iteration, we increment the data by allocating additional time to add labeled examples by solving the Knapsack problem. For the random baseline, we randomly select examples to add until no example can fit in the given time anymore. Figure 5 shows the superiority of our system (QBC) over both uniform-cost AL (UAL) and the random baseline in such setting. The result supports Fig. 6 where UAL is biased toward examples with large bleeds and long labeling times. In fact, UAL selected 8/11 stacks in the first/second rounds, whereas cost-sensitive AL (CAL) selected 94/107 stacks. Due to lack of stack diversity, UAL performs worse than CAL at the stack level.
The strong gain of CAL at (+10%) not carrying over to (+20%) is explained by the ratio of unlabeled pool to the labeled training set. When the ratio is small, the data is insufficient for AL system to choose from. In Fig. 4, the ratio starts with 3100% and stops with 100% at S2. In Fig. 5, the ratio started with 100%. After (+10%) round, the ratio is 66% for CAL and 80% for Rand. The leveling off of CAL performance shows that most of the informative examples were already selected in the (+10%) round.
5.3 Active Learning in the Wild
Finally, we apply our system on the unlabeled pool described in Sect. 4. First, we train an ensemble on the entire labeled set. Then we select examples from the unlabeled pool under a budget of 100 h. A neuroradiologist examined the selected cases and determined there were 115 negatives and 64 positives. There were also 51 subacute or postsurgical cases we excluded. The actual labeling time turned out to be within \(10\%\) of our estimate. We call these newly annotated examples \(U_{train}\), to be distinguished from \(S_{trainval}\) defined in Sect. 4. To qualitatively assess the impact of cost modeling, we show examples mined by both uniform-cost and cost-sensitive AL in Fig. 6.
For quantitative benchmarking, we trained an ensemble of 4 PatchFCNs from scratch with the newly augmented data (Ensemble \(S_{trainval}\)+\(U_{train}\)) and compared them with the ensemble trained on the original data (Ensemble \(S_{trainval}\)). The results on \(S_{test}\) and \(U_{test}\) are shown in Table. 1. We benchmark on two test sets here because we care about the performance on both seed S and pool U domains, which in practice are often not exactly the same. The gain on \(S_{test}\) shows that our method works despite the domain shift, and the strong gain on \(U_{test}\) demonstrates how a model trained on large data can be improved by collecting a little more data judiciously.
6 Conclusion
In this paper, we proposed a cost-sensitive, query-by-committee active learning system for intracranial hemorrhage detection. We validated it on a substantially larger pixelwise labeled dataset than earlier works and applied it to improve the model by annotating new data from the wild. Our study demonstrates the potential of growing large medical datasets to the next level with cost-sensitive active learning.
References
Sirinukunwattana, K., et al.: Gland segmentation in colon histology images: the glas challenge contest. Med. Image Anal. 35, 489–502 (2017)
Zhang, Y., Ying, M.T., Yang, L., Ahuja, A.T., Chen, D.Z.: Coarse-to-fine stacked fully convolutional nets for lymph node segmentation in ultrasound images. In: BIBM (2016)
Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs. In: NIPS Workshop on Cost-sensitive Learning (2008)
Yang, L., Zhang, Y., Chen, J., Zhang, S., Chen, D.Z.: Suggestive annotation: a deep active learning framework for biomedical image segmentation. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 399–407. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_46
Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Workshop on Computational Learning Theory (1992)
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR (1994)
Mahapatra, D., Schüffler, P.J., Tielbeek, J.A.W., Vos, F.M., Buhmann, J.M.: Semi-supervised and active learning for automatic segmentation of crohn’s disease. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 214–221. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40763-5_27
Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: ICLR (2018)
Settles, B.: Active learning. In: Lectures on AI and ML (2012)
Tomanek, K.: Resource-aware annotation through active learning (2010)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Yuh, E., Mukherjee, P., Manley, G.: Interpretation and quantification of emergency features on head computed tomography. Provisional Application no. 62/269,778 (2015)
Kuo, W.C., Häne, C., Yuh, E., Mukherjee, P., Malik, J.: PatchFCN for intracranial hemorrhage detection. In: arXiv preprint arXiv:1806.03265 (2018)
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: CVPR (2017)
Acknowledgments
This work was supported in part by California Initiative to Advance Precision Medicine. Christian Häne received funding from the Swiss National Science foundation (165245). Amazon Web Services provided part of the compute time.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Kuo, W., Häne, C., Yuh, E., Mukherjee, P., Malik, J. (2018). Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11072. Springer, Cham. https://doi.org/10.1007/978-3-030-00931-1_82
Download citation
DOI: https://doi.org/10.1007/978-3-030-00931-1_82
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00930-4
Online ISBN: 978-3-030-00931-1
eBook Packages: Computer ScienceComputer Science (R0)