Keywords

1 Introduction

Clinical applications set very high bars for machine learning algorithms, because any misdiagnosis could impact treatment plans and gravely harm the patient. To reach the required performance, supervised learning is the leading technique, and its success is well established. However, a challenge in supervised learning is that it requires a large amount of labeled data, especially when deep neural networks are used. Unfortunately, expert labeling of medical images requires enormous time and cost. The problem is exacerbated when accurate pixelwise labeling is required. Accordingly, medical segmentation datasets tend to be relatively small [1, 2].

Fig. 1.
figure 1

Overview. First, the stack runs through the ensemble PatchFCNs trained on the seed set S, which produces the mean hemorrhage heatmap and the Jensen-Shannon (JS) divergence uncertainty heatmap. From the mean hemorrhage heatmap, we apply multiple thresholds to compute the mean boundary length \(B_i\) and number of connected components \(N_i\). Our log-regression model then takes \(B_i\) and \(N_i\) to predict the stack labeling time \(T_i\). The sum of uncertainty of the top-K uncertain patches is defined to be the stack uncertainty \(V_i\). Given any fixed labeling budget(time) Q, we treat each stack in the unlabeled pool as an item of weight \(T_i\) and value \(V_i\). The optimal set of items for annotation is obtained by solving a 0-1 Knapsack problem with dynamic programming.

Active learning (AL) aims to address the paucity of labeled data by reasoned choice of which available unlabeled examples to annotate [3,4,5,6,7]. A limitation of many prior studies of AL is that they validated AL only in a core-set selection setting, [8] rather than demonstrating its utility in growing the labeled data, and also did not attempt to model the cost of labeling [3, 4, 7]. However, the potential value/use of AL is not in achieving comparable performance with less data, but in improving the model while also minimizing labeling costs. On other problems it has been shown that labeling costs vary greatly from one example to another [3, 9, 10]. In the case of intracranial hemorrhage, we observe that times needed for pixelwise labeling vary up to 3 orders of magnitude for different cases (See Fig. 3). Most AL studies to date select examples without addressing this wide variation in labeling time [4,5,6,7,8].

In this paper, we propose a cost-sensitive AL system by combining the query-by-committee [5] approach with labeling time prediction for each example. Our uniform-cost AL system compares favorably with the state of the art [4], while the cost-sensitive system gives a further boost under labeling time constraints. All experiments are conducted on our pixelwise-labeled dataset (29095 frames), which is about two orders of magnitude larger than standard MICCAI segmentation datasets [1, 2]. Moreover, our system is simpler, faster, and uses less memory than earlier works [4, 8]. Through the example of intracranial hemorrhage detection, we demonstrate the potential of cost-sensitive active learning to scale up medical datasets efficiently.

2 Supervised Learning System

As a machine learning system we use a convolutional neural network (CNN). More specifically we use a fully convolutional neural network (FCN). FCNs are able to make pixelwise predictions. The standard approach for using an FCN is to input the entire image into the FCN and obtain pixelwise predictions with a single forward pass [11, 12]. We instead use an FCN which uses a patch as input and makes predictions for presence of hemorrhage for each pixel within a specific patch at a time, which we call PatchFCN. This architecture has the advantage that the network has to make its predictions based on the local morphology and hence is less prone to overfit into the global context, which results in better test time accuracy than standard FCNs. At test time we apply the PatchFCN in a sliding window fashion (see Fig. 2). We extensively tested this network architecture in a separate technical report [13] and established that it outperforms whole image baselines for various underlying FCN architectures. We use the 38 layer dilated residual net (DRN) as specific FCN architecture. It uses dilated convolutions to preserve spatial resolution together with residual connections [14]. We also group the pixelwise predictions into regions using connected component analysis and aggregate the pixelwise predictions into frame and stack classification scores. This facilitates hemorrhage detection at the pixel, region, frame and stack level.

3 Cost-Sensitive Active Learning

Let us define our active learning problem as follows: given a labeled seed set S and an unlabeled pool set U, find a small subset P from U for labeling that maximizes a suitable test set metric. Our system which is depicted in Fig. 1 estimates an uncertainty score for each example (see Sect. 3.1) and the labeling time (see Sect. 3.2). The goal is to select the set of examples such that the sum of their uncertainty is maximized under the constraint that the total estimated labeling time stays within a given budget. The optimal selection of items reduces to the well-known 0-1 Knapsack problem, which can be solved with dynamic programming.

3.1 Uncertainty Measure

Uncertainty (or informativeness) is at the core of active learning techniques. It can be estimated by single model outputs [6] or a committee of models [5]. The idea of query-by-committee (QBC) is to run multiple models on the same example and use their disagreement to estimate uncertainty. Experimentally, we found that QBC consistently works better than single-model uncertainty. Within the QBC framework, we have tried various uncertainty measures and found the Jensen-Shannon (JS) divergence to work best. Concretely, let’s assume we have N models in the committee and the output distribution of model i is \(P_i\). The JS divergence is then defined as:

$$\begin{aligned} JS(P_1,P_2,\ldots ,P_N) = H(\frac{1}{N}\sum _i^{N}P_i) - \frac{1}{N}\sum _i^{N}H(P_i) \end{aligned}$$
(1)

where H is the entropy function.

Fig. 2.
figure 2

PatchFCN system. We train the network on patches and test it in a sliding window fashion. The optimal crop size is found to be 160\(\,\times \,\)160 for our task.

Fig. 3.
figure 3

Left: Time vs Log (Boundary Length). Right: Time vs Log (Number of Connected Components). Both plots show the goodness of our linear fit and the normality of residuals after the log transform. Note that the y-axis is actually displayed in log-scale.

We average all pixelwise uncertainties within each patch to obtain the uncertainty of a patch. The stack uncertainty is obtained by averaging the top K uncertain patches within the stack. The choice of K is a balance between taking the max (\(K=1\)) or the mean (\(K=\infty \)) of the whole stack. In all AL experiments in this paper, we set \(K=200\) and number of models \(N=4\). We have tried larger N but didn’t gain any performance. Visualization of such uncertainty can be found in Fig. 6.

3.2 Labeling Time Prediction

First, we need to ask what is the optimal unit of labeling – patch, frame or stack? Employing our neuro-radiology expertise, we settled on labeling stacks. While labeling patches/frames may seem more effective from a machine learning perspective, it comes with a severe overhead, i.e. the whole stacks need to be retrieved and examined by radiologists anyway. Therefore, it is less efficient than labeling the stacks.

To apply active learning in practice, we need to ensure it actually saves labeling cost or efforts. This is crucial as per-stack labeling times in our data span 3 orders of magnitude. We utilize linear regression to predict the log labeling time \(\log {t}\) based on two features: (1) mask boundary length B, and (2) number of connected components M under the log-transform.

$$\begin{aligned} \log {t} = \alpha \log {B} + \beta \log {M} + \gamma \end{aligned}$$
(2)

Figure 3 shows the effectiveness of our log-transform and the goodness of fit on both features. 61 data points were used to fit the linear model, which we found to be sufficient. In order to compute the features at test time we use the pixelwise predictions of our network. We also tried using deep FCN features from an intermediate layer directly but found the prediction to be less stable.

Fig. 4.
figure 4

Core-set selection curves. Our system (QBC) starts to outperform [4] (QBC + Similarity) on region, frame and stack level as the dataset grows beyond one fourth of the whole set. Both QBC algorithms maintain a large gap with random baselines on pixel and region APs. For the frame and stack APs, our system still maintains a healthy margin above the random baseline for all data sizes. The region AP is computed following the definition of [13].

4 Data Collection

Our pixelwise labeled dataset contains 1247 clinical head CT scans (29095 valid frames) performed from 2010–2017 on 64-detector-row CT scanners (GE, Siemens) at our affiliated hospitals. Each scan is a stack of 27-38 frames with in-plane resolution close to 0.5mm and z-axis resolution of 5mm. Scans were anonymized by removing all protected health information as well as skull, scalp and face. A board-certified neuroradiologist with specialization in traumatic brain injury (TBI) identified areas of acute intracranial hemorrhage at the pixel level. We randomly split the dataset into a trainval/test set of 934/313 stacks, called \(S_{trainval}\), \(S_{test}\) respectively (S for seed).

The unlabeled set was collected using key phrase searches of radiology reports. We searched independently for positive and negative cases. The search for positive cases over 1 year yielded 1755 cases. A separate search over a shorter period identified 640 negative cases. We call this set of cases set U (for unlabeled) to be distinguished from set S. Also, 120 randomly selected cases from U (called \(U_{test}\)) were annotated at stack level in order to benchmark our system in this domain.

5 Experiments

5.1 Core-Set Active Learning

A core-set is a subset of the training set where the empirical loss of a model is similar to that on the entire training set. In this experiment, we grow the core-set iteratively and study how the performance improves [4, 8]. For fair comparison, we strip away the cost prediction and Knapsack-solving part of our full system (See Fig. 1), and select examples based on their uncertainty scores alone.

We use the average precision (AP) metric to compare algorithms. Figure 4 shows the performance of our query-by-committee system (QBC), suggestive annotation system (QBC + Similarity) [4], and random baseline. In this comparison, we improve [4] by using the patch-based approach for QBC + Similarity baseline, because PatchFCN [13] gives better uncertainty and similarity measures than vanilla FCN. Without it, we observed a significant performance drop. Following [4], we tried diversifying the ensemble with bootstrapping, but did not see benefit.

The experiment began with a seed set 1 / 32 of the training set, and doubled it by either random sampling or active learning. In the next round, this doubled set becomes the new seed set and the process repeats. In each round, we trained an ensemble for all methods in order to compute QBC uncertainty. Figure 4 shows that our system’s performance at half the dataset (S2) closely matches the performance of using the whole dataset (S1) for every AP, similar to [4, 8]. However, here we use a dataset that is two orders of magnitude larger and much harder to overfit on.

Our experiment indicates that on a large dataset, QBC uncertainty alone could be sufficient to yield competitive performance, if not state-of-the-art. Without bootstrapping or pairwise similarity, our system beats the random baseline by a good margin and compares favorably with [4] in performance and time complexity. The time complexity of core-set approaches [4, 8] are dominated by the pairwise similarity computation, which is quadratic and can be expensive in practice when the seed set is too large to be grown by brute-force labeling. In contrast, our system has linear time complexity because it computes everything on-the-fly.

5.2 Cost-Sensitive Active Learning

After validating the core-set AL, we model the cost with the full system described in Fig. 1. We randomly select half of our labeled training set as the seed set to mimic the scenario where the seed set is large enough to render naive labeling impractical for growing the data. Yet at the same time we want the pool to be at least as large as the seed. In each iteration, we increment the data by allocating additional time to add labeled examples by solving the Knapsack problem. For the random baseline, we randomly select examples to add until no example can fit in the given time anymore. Figure 5 shows the superiority of our system (QBC) over both uniform-cost AL (UAL) and the random baseline in such setting. The result supports Fig. 6 where UAL is biased toward examples with large bleeds and long labeling times. In fact, UAL selected 8/11 stacks in the first/second rounds, whereas cost-sensitive AL (CAL) selected 94/107 stacks. Due to lack of stack diversity, UAL performs worse than CAL at the stack level.

The strong gain of CAL at (+10%) not carrying over to (+20%) is explained by the ratio of unlabeled pool to the labeled training set. When the ratio is small, the data is insufficient for AL system to choose from. In Fig. 4, the ratio starts with 3100% and stops with 100% at S2. In Fig. 5, the ratio started with 100%. After (+10%) round, the ratio is 66% for CAL and 80% for Rand. The leveling off of CAL performance shows that most of the informative examples were already selected in the (+10%) round.

Fig. 5.
figure 5

Cost-sensitive active learning. At the first iteration, the system achieves much better performance than the random baseline for all metrics. The random baseline does not improve over the seed set. In the next round, the random baseline improves the stack AP while the ALs remain the same. The error bars of AL come from the network initialization and the stochastic gradient (SGD) training. The error bars of random baseline mostly come from the random addition of data, plus the same sources of AL randomness. The time increment is \(10\%\) of the total labeling time of the pool, which simulates the situation where our budget is only a small fraction of the total labeling cost.

5.3 Active Learning in the Wild

Finally, we apply our system on the unlabeled pool described in Sect. 4. First, we train an ensemble on the entire labeled set. Then we select examples from the unlabeled pool under a budget of 100 h. A neuroradiologist examined the selected cases and determined there were 115 negatives and 64 positives. There were also 51 subacute or postsurgical cases we excluded. The actual labeling time turned out to be within \(10\%\) of our estimate. We call these newly annotated examples \(U_{train}\), to be distinguished from \(S_{trainval}\) defined in Sect. 4. To qualitatively assess the impact of cost modeling, we show examples mined by both uniform-cost and cost-sensitive AL in Fig. 6.

Table 1. Left: Performance on \(S_{test}\). Compared to Ensemble \(S_{trainval}\), Ensemble \((S \cup U)_{train}\) performs just as well on the pixel level and slightly outperform on the stack level. Right: Performance on \(U_{test}\). Ensemble \((S \cup U)_{train}\) beats Ensemble \(S_{trainval}\) by a good margin on the pool set.
Fig. 6.
figure 6

Examples selected by cost-sensitive and uniform-cost AL systems. Blue boxes are the original images, while orange boxes are the images overlaid with Jensen-Shannon divergence. The brightness of the green color indicates uncertainty. The examples selected by uniform-cost system mostly contain massive bleeds and are substantially more time-consuming for annotation, whereas examples by the cost-sensitive system are diverse and meaningful, maximizing the return on investment.

For quantitative benchmarking, we trained an ensemble of 4 PatchFCNs from scratch with the newly augmented data (Ensemble \(S_{trainval}\)+\(U_{train}\)) and compared them with the ensemble trained on the original data (Ensemble \(S_{trainval}\)). The results on \(S_{test}\) and \(U_{test}\) are shown in Table. 1. We benchmark on two test sets here because we care about the performance on both seed S and pool U domains, which in practice are often not exactly the same. The gain on \(S_{test}\) shows that our method works despite the domain shift, and the strong gain on \(U_{test}\) demonstrates how a model trained on large data can be improved by collecting a little more data judiciously.

6 Conclusion

In this paper, we proposed a cost-sensitive, query-by-committee active learning system for intracranial hemorrhage detection. We validated it on a substantially larger pixelwise labeled dataset than earlier works and applied it to improve the model by annotating new data from the wild. Our study demonstrates the potential of growing large medical datasets to the next level with cost-sensitive active learning.