1. Introduction
The urban population is expected to grow at the highest rates in human history in the next decades, with an increase of 1.2 billion urban residents worldwide by 2030 [
1]. Densely populated areas are hotspots of numerous environmental problems, including air pollution [
2,
3] and hydrological disturbance [
4,
5], and are also linked to mental illness and health. [
6,
7]. Global climate change affects climate patterns, and the increase of surface temperature has led to more frequent, longer, and more severe heatwaves [
8] and, likewise, increased the occurrence of floods [
9]. In this scenario, urban forests could play an important role in mitigating some of these threats [
10] and filling the gap between sustainable and livable cities. These systems are important assets to achieve urban sustainability, which was established as one of the Sustainable Development Goals (SDG 11) by the United Nations [
11].
Urban forests are composed of trees in urban areas, from individuals to clusters of trees, and publicly accessible green spaces [
12]. These systems provide an array of ecosystem services and help to mitigate the ills caused by the concentrated poverty and sickness that often occur in cities [
13]. However, the proper management of these systems requires accurate data on the quantity and distribution of individual trees throughout cities. Individual tree detection is key information for multiple applications, including resource inventory and the management of hazards and stress [
14].
The urban tree inventory is increasing due to rapid urbanization, climate change, and global trade [
15]. Nonetheless, there is a lack of information in urban tree inventories due to the costs associated with tree mapping and monitoring [
16]. Individual tree detection is still an open challenge that is especially difficult since there are different vegetation canopies, crown sizes, and density as well as the overlapping of crowns, among other situations [
17]. Further, due to the proximity and overlapping of tree crowns, it is not always possible to conduct segmentation as a strategy to detect each tree individually.
In this sense, methods that can do this task on RGB images could unlock data at larger scales [
18] and provide insights for policymakers and the community. Developing approaches to detect trees automatically is decisive to building more sustainable cities, by helping with the planning and management of urban forests and trees. Most urban tree inventory is done manually, which is slow, costly, and difficult to map large areas and follow the temporal evolution of these assets.
To that end, remote sensing has been seen as a less time-consuming alternative to tree field surveys [
19]. The recent development of these platforms has allowed researchers to collect data with higher spatial, temporal, and spectral resolutions, unlocking new scales of Earth observation. High-resolution images are recommended for individual tree detection [
20], especially in urban areas where images are heterogeneous and complex. Nonetheless, this increase in data has made it difficult to process it manually. As an alternative, object detection methods based on deep learning have been applied successfully in remote sensing applications, mainly using convolutional neural networks [
21,
22,
23,
24,
25,
26].
Object detection methods can be divided into anchor-based and anchor-free detectors. Anchor-based methods first build an extensive number of anchors on the image, predict the category, refine each anchor’s coordinates, and then output the refined anchor as the prediction. These types of techniques can be categorized into two groups: one and two-stage detectors. One-stage detectors commonly use a direct fully convolution architecture, while two-stage detectors first filter the region that may contain the object and feed a convolution network with this region [
27].
Usually, one-stage methods (e.g., Yolo and SSD) provide high inference speed while two-stage methods (e.g., Faster R-CNN) present high localization and accuracy [
28]. In contrast to the anchor-based detectors, anchor-free methods directly find objects beyond the anchor using a key or center point or region of the object and have similar performance to anchor-based methods [
29]. The design of object detection methods using deep learning often uses natural images and, with the constant development of new methods, it is essential to evaluate the latest methods in remote sensing.
Within remote sensing and deep learning, many data sources have been investigated for individual tree detection. Multi and hyperspectral images, LiDAR (Light Detection And Ranging) data, and their combinations have all been investigated [
30,
31,
32,
33,
34,
35,
36]. However, these data sources are costly and could be problematic to process due to their high dimensionality. Alternatively, studies were conducted that combined RGB images with other data sources [
18,
37,
38]. Compared to other data sources, RGB sensors are cheaper, and RGB imagery is easier to process with absence of three-dimensional information about the tree crown shape [
18]. However, few studies tackled this task using only remote sensing RGB data [
19,
39,
40,
41,
42].
Santos at al. [
19] applied three deep-learning methods (Faster R-CNN, YOLOv3, and RetinaNet) to detect one tree species,
Dipteryx alata Vogel (Fabaceae), in Unmanned Aerial Vehicle (UAV) high-resolution RGB images. The authors found that RetinaNet achieved the best results. Culman et al. [
39] implemented RetinaNet to detect palm trees in aerial high-resolution RGB images achieving a mean average precision of 0.861. Further, Oh et al. [
40] used YOLOv3 to count cotton plants. Roslan et al. [
41] applied RetinaNet for this task in super-resolution RGB images in a tropical forest. For a tropical forest again, [
42] evaluated RetinaNet. However, most of the research on this field has been done using methods, such Faster R-CNN and RetinaNet, both being dated before 2018. With the constant development of new methods, there is a need to assess the performance of these methods in remote sensing applications.
Despite these initial efforts, there is a lack of studies assessing the performance of the novel deep-learning methods for individual tree crown detection, regarding tree species or size, in urban areas. This task is challenging in the urban context due to the heterogeneity of these scenes [
20], with different tree types and sizes combined with overlap between objects, shades, and other situations. Our objective is to benchmark anchor-based and anchor-free detectors for tree crown detection in high-resolution RGB images in urban areas.
To the best of our knowledge, our study is the first to present a large assessment of novel deep learning detection methods for individual tree crown detection in urban areas. Further, we also provide an analysis covering the main lines of research in computer vision for anchor-based methods (one and two-stages) and anchor-free methods. Different from previous studies, our focus is to detect all trees, regarding tree species or size in an urban environment. Thus, our study intends to fill the gap and demonstrate the performance of the most advanced object detection methods in remote sensing applications.
Two high-resolution RGB orthoimages were manually annotated and split into non-overlapping patches. We evaluate 21 novel deep-learning methods for the proposed task, covering the main directions in object detection research. We present a quantitative and qualitative analysis of the performance for each method and for each main types of detectors. The dataset is publicly provided for further investigation in:
https://github.com/pedrozamboni/individual_urban_tree_crown_detection (accessed on 21 June 2021).
3. Results
Here, we present the results of our experiments. First, we performed a quantitative and qualitative analysis for all 20 methods. Therefore, the results were separated by the type of method, i.e., anchor-based (AB-OS: one-stage; AB-TS: two-stage; and AB-MS: multi-stage) and anchor-free (AF). In the quantitative analysis, we evaluated the methods using the IoU threshold of 0.5 (). The qualitative analysis was conducted to identify in which situations the models had good and bad performance over different conditions, such as shadow and overlap by other objects.
Later, we present the results for the second part of the experiments with the top five models, Faster R-CNN, and RetinaNet. The images presented in this section were from the test set; therefore, the images provide a better indication of the performance of the models. Even though different areas (with different tree species, tree crown sizes, and distributions) were used to train and test the model, the two images are from the same city. Thus, it is not possible to comment on the capacity for generalizability of these models on different datasets.
3.1. Anchor-Based (AB) Detectors
In this section, we discuss the performance of the one, two, and multi-stage anchor-based detectors. For one-stage methods, the average
was 0.657 ± 0.032.
Table 3 shows the test set results for the one-stage methods. We observed that the Gradient Harmonized Single-stage Detector outperformed all the others in
. The increase in performance ranged from 1.4% to 10%. RetinaNet, NAS-FPN, and SABL provided similar results. Probabilistic Anchor Assignment and Generalized Focal Loss presented similar performances, and YoloV3 was the worst method.
Table 4 shows the performance for the two-stage and multi-stage (DetectoRS) methods. During the test, on average, the two-stage and multi-stage methods reached 0.669 ± 0.023 for
. The Double Heads method achieved the best performance for these methods when analyzing the
, outperforming the others from 0.2% to 6.8%. The CARAFE and Empirical Attention methods obtained performances similar to Double Heads in terms of the
. Faster R-CNN, DetectoRS, Deformable ConvNets v2, and Dynamic R-CNN reached similar performances, and Weight Standardization provided the worst results.
Figure 5 and
Figure 6 show the tree detection achieved using the one-stage methods. As we can see in
Figure 5, for smaller tree crowns and even medium-sized ones, the one-stage methods had good assertiveness. However, for larger crowns (
Figure 6), we observed a decrease in the performance, with Probabilistic Anchor Assignment being the unique method with good performance. For more irregular trees, where the crown did not have a circular shape, the methods usually detected more than one bounding box for a given ground-truth annotation. In areas where there were large agglomeration of trees, the methods did not detect the trees or detected only a part.
Figure 7 and
Figure 8 present the detection for two-stage methods. Similar to the one-stage methods, the two-stage methods presented good performance in detecting smaller and medium-sized tree crowns. For larger ones and in areas with a greater agglomeration of objects (
Figure 8), the two-stage methods performed substantially better than the one-stage methods. Thus, these methods appeared to generalize the problem better with better assertiveness in detecting the tree crowns in more complex scenes. Further, we observed that the presence of shadow did not cause a great decrease in the detection. We observed that the main challenge was to detect single trees with larger crowns and areas where the limits of each object were not clear. In such cases (
Figure 6 and
Figure 9), even for the human eye, it is difficult to separate the trees from each other.
3.2. Anchor-Free (AF) Detectors
The results obtained for anchor-free (AF) methods are described in
Table 5. In the test, the anchor-free methods achieved an average performance of 0.686 ± 0.014. FSAF reached the best performance in terms of the
with 0.701. This demonstrated a superior performance over the others, ranging from 0.9% to 3.7%. FoveaBox, ATSS, and VarifocalNet (2) had similar results in terms of the
, and VarifocalNet (1) had the worst in performance.
Anchor-free methods demonstrated similar behavior when compared with the one-stage methods. These models performed well for small trees (
Figure 10). For occluded objects and more irregular tree crowns, we observed a decrease in the performance, with multiple detections and the detection of only part of the object. For areas with larger tree crowns and more agglomerations, the performance also decreased. VarifocalNet (2) was the only method that managed to produce relatively good detection in the most complex images (
Figure 9). This highlights that these areas with larger canopies and more agglomerations are the main challenges for the methods.
3.3. Analysis of the Best Methods
Here, we present the best five methods considering
, which were FSAF, Double Heads, CARAFE, ATSS, and FoveaBox. We also included Faster R-CNN and RetinaNet, since these two are commonly used in remote sensing. We noticed that none of the five best were a one-stage method. As seen in the previous sections, the anchor-free methods showed better average performance compared with the one and two-stage methods in terms of the
.
Figure 11 shows the box plot for the methods.
Figure 12,
Figure 13,
Figure 14 and
Figure 15 show some results for the best methods.
We observed that Double Heads, a two-stage method, achieved the best average (0.732), with differences ranging from 0.4%, when compared to ATSS, and 4.6%, when compared to RetinaNet. ATSS and CARAFE achieved similar values with averages of 0.728 and 0.724, respectively, which were close to Double Heads. FSAF and FoveaBox had slightly worse performances with average values of 0.720 and 0.719. Faster R-CNN (average of 0.700) and RetinaNet (average of 0.686) obtained the worst average results.
Despite the performance analysis conducted using the , we performed One-Way ANOVA to assess if the averages of the values of the best methods were significantly different. One-Way ANOVA for the top five, Faster R-CNN, and RetinaNet indicated a P-value of 0.019, which is less than the significance level ( = 0.05). Therefore, we can reject the null hypothesis that the results were similar. We continued the evaluation using a post hoc test to identify differences between pairs of algorithms.
A simple strategy in multiple comparisons is to use
to evaluate the
P-value, which is the Bonferroni correction. However, this value is rigorous and can lead to the rejection of a true null hypothesis (Type I error). Holm–Bonferroni adjusts the rejection criteria for each comparison reducing the chance of a Type I error. The Holm–Bonferroni sorts the
p-values in increasing order creating a rank of
and compares them with
where
k is the ranking order in the comparison. When
is false, the procedure stops, and we cannot reject the null hypothesis of the subsequent
.
Table 6 shows the results of the Holm–Bonferroni test. For simplicity, the column
P-value corr represents this comparison, and, when its value is lower than 0.05, we can reject the null hypothesis.
The results indicate that the results of RetinaNet were significantly different from ATSS, CARAFE, and Double Heads. Further, a comparison between the other methods showed no statistically significant differences. The test indicates that RetinaNet, among the tested models, was not indicated for the proposed task.
4. Discussion
Anchor-based one-stage methods achieved the worst average precision (0.657). Gradient Harmonized Single-Stage Detector was the best with an of 0.691, and YoloV3 was the worst with 0.591 precision. The commonly used RetinaNet had an of 0.650, being the second worst one-stage method.
A previous study [
39] implemented RetinaNet to detect Phoenix palms with the best AP value of 0.861; however, the authors split the dataset only into training and validation sets and used a score threshold of 0.2 and an IoU of 0.4. This may be lead to better performance. They also considered only one tree species as the target. Roslan et al. [
41] used RetinaNet to detect individual trees and achieved superior results with a precision of 0.796, and similar results were found by [
42]. Refs. [
41,
42] utilized images of non-urban areas (tropical forests). In our experiments, RetinaNet provided less accurate results among the one-stage methods.
Anchor-based two-stage methods had the second best highest average
of 0.669. Double Heads had the best performance among these methods, and DetecoRS had the worst. Faster R-CNN and RetinaNet (baseline) had similar results. Santos et al. [
19] investigated both methods and concluded that RetinaNet outperformed Faster R-CNN and YOLOv3 in the detection of a single tree species, achieving an
higher than 0.9. Wu et al. [
65] proposed a model that used Faster R-CNN as a detector in a hybrid model to detect and segment apple tree crowns in UAV imagery.
In the detection section, the authors achieved high-precision for the task. However, these authors considered only one tree species and used images with higher resolution with small variation in scale. These factors may lead to better performance for the methods. In the other hand, the anchor-free methods had the best average precision with 0.686. FSAF, ATSS, and FoveaBox stood out among others. The results for the anchor-free methods corroborate with the study of Gomes et al. [
23], where ATSS also outperformed Faster R-CNN and RetinaNet by about 4%.
Previous studies [
28,
29] also reported that two-stage methods had higher performance over the one-stage methods, which corroborates our findings. We found that anchor-free methods performed similarly to anchor-based two-stage methods. This behavior has already been reported in the literature. The advantage of anchor-free detectors is the removal of the hyper-parameters associated with the anchors, implying potentially better generalization [
29]. RetinaNet (one-stage) and Faster R-CNN (two-stage) showed relatively poor results when compared with the top five methods selected. It is important to note that these two methods have been reported in the literature as having superior performance in other remote sensing applications [
19,
24].
As previously presented, our experiment aimed to detect all the trees in urban scenes. Compared to the previous work that only targeted a single tree species, our objective was considerably more challenging. First, urban scenes are more complex and heterogeneous. Second, our dataset presented various tree species and tree crown sizes with overlap between objects, shadows, and other situations. In the Campo Grande city urban area, there are 161 tree species and more than 150 thousand trees. Thus, this complexity in the task led to better performance of the two-stage anchor-based methods, especially in more challenging images as can be seen in
Figure 15. These methods first filter the region that may contain an object and then they eliminate most negative regions [
27]. Comparatively, [
66] proposed the identification of trees in urban areas using street-level imagery and Mask-RCNN [
67]. They found an
between 0.620 and 0.682.
5. Conclusions
Here, we presented a large assessment of the performance of novel deep-learning methods to detect single tree crowns in urban high-resolution aerial RGB images. We evaluated a total of 21 object detection methods, including anchor-based (one, two, and multi-stage) and anchor-free detectors in a remote sensing relevant application. We provided a quantitative and qualitative analysis of each type of method. We also provided a statistical analysis of the best methods as well as RetinaNet and Faster R-CNN.
Our results indicate that the anchor-free methods showed the highest average , followed by anchor-based two-stage and anchor-based one-stage. Our findings suggest that the best methods for the current task were the two-stage anchor-based and anchor-free detectors. For the one-stage anchor-based detectors, only the Gradient Harmonized Single-stage Detector performed slightly worse than the best methods. This may be an indication that one-stage methods are not recommended for the proposed task. Meanwhile, the two-stage (Double Heads and CARAFE) and anchor-free (FSAF, ATSS, and FoveaBox) detectors achieved superior performance, which is the study’s suggestion for urban single tree crown detection.
Our experimental results demonstrated that RetinaNet, one of the most used methods in remote sensing, did not have satisfactory performance for the proposed task and unperformed several of the best methods (ATSS, CARAFE, and Double Heads). This may indicate that this method is not suitable for the proposed task. Faster R-CNN had slightly inferior results compared with the best methods; however, no statistically significant difference was found. However, it is worth mentioning that research aimed at detecting single trees in an urban environment is still incipient, and further investigation regarding the most appropriate techniques is needed. In our work, we set out to detect all tree crowns in an urban environment. This task is considerably more complex than detecting specific species or types of trees since there will be a greater variety of trees. Likewise, images from the urban environment are more complex and challenging than rural environments as they present a more heterogeneous environment.
Our work demonstrates the potential of the existing techniques based on deep learning by leveraging the application of different methods for remote sensing data. This study may contribute to innovations in remote sensing based on deep-learning object detection. The majority of the research applying deep learning in remote sensing was done using methods dated before 2018 (e.g., Faster R-CNN and RetinaNet), and, with the development of new methods, it is essential to evaluate their performance in these tasks. The development of techniques capable of accurately detecting trees using RGB images is essential in preserving and maintaining forest systems. These tools are essential for cities, where accelerated population growth and climate change are becoming significant threats. Future works will focus on developing a method capable of working with high density objects. We also intend to increase the size of the dataset with images from different cities in order to obtain models with better generalization capabilities.