Lit Review Rainfall Springer
Lit Review Rainfall Springer
Lit Review Rainfall Springer
net/publication/359750167
CITATIONS READS
2 1,668
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Eslam A. Hussein on 18 December 2022.
1 Introduction
Natural processes on Earth can be classified into several categories, including
hydrological processes like storm waves and groundwater; biological processes
?
E.A.H. acknowledges financial support from the South African National Research
Foundation (NRF CSUR Grant Number 121291 for the HIPPO project) and from
the Telkom-Openserve-Aria Technologies Center of Excellence at the Department of
Computer Science of the University of the Western Cape.
like forest growth; atmospheric processes like thunderstorms and rainfall; human
processes like urban development; and geological processes like earthquakes. The
field of physical geography seeks to investigate the distribution of the different
features/parameters that describe the landscape and functioning of the Earth
by analyzing the processes that shape it. These features/parameters have been
referred to as geophysical parameters in the literature [38].
Rainfall is a key geophysical parameter that is essential for many applications
in water resource management, especially in the agriculture sector. Predicting
rainfall can help managers in various sectors to make decisions regarding a range
of important activities such as crop planting, traffic control, the operation of
sewer systems, and managing disasters like droughts and floods [32]. A number
of countries such as Malaysia and India depend on the agriculture sector as a
major contributor to the economy [32, 59] and as a source of food security. Hence,
an accurate prediction of rainfall is needed to make better future decisions to
help manage activities such as the ones mentioned above.
Rainfall is considered to be one of the most complicated parameters to fore-
cast in the hydrological cycle [32, 34, 53]. This is due to the dynamic nature of
environmental factors and random variations, both spatially and temporally, in
these factors [32]. Therefore, to address random variations in rainfall, several ma-
chine learning (ML) tools including artificial neural networks (ANN), k-nearest
neighbours (KNNs), decision trees (DT), etc. are used in the literature to learn
patterns in the data to forecast rainfall. In this chapter, a review of past work
in the area of rainfall prediction using ML models is carried out.
A number of related review papers exist as follows. The authors in [52] fo-
cused on reviewing studies that use ML for flood prediction, which closely resem-
bles rainfall prediction. The authors in [71] focused on the use of ML for generic
spatiotemporal sequence forecasting. Finally, the authors in [59] conducted a
survey on the use of ML for rainfall prediction: however the study was limited
to rainfall prediction in India.
This chapter serves as an addition to the field by surveying recent relevant
studies focusing on the use of ML in rainfall prediction in a variety of geographic
locations from 2016–2020. After detailing the methods used to forecast rainfall,
one of the important contributions of this chapter is to demonstrate various
pitfalls that lead to an overestimation in model performance of the ML mod-
els in various papers. This in turn leads to unrealistic hype and expectations
surrounding ML in the current literature. It also leads to an unrealistic under-
standing of the advancements in, and gains by, ML research in this field. It is
therefore important to clearly state and demonstrate these pitfalls in order to
help researchers avoid them.
The rest of this review is organized as follows: Section 2 discusses the method-
ology used to survey and review the literature which defines the discussion
framework used in all subsequent sections; Section 3 describes the data sets
used; Section 4 provides a description of the output objective in the various
papers; Sections 5 – 7 describe the input features used, common methods of pre-
processing and the ML models used; Section 8 summarizes the results obtained
in various studies; and Section 9 then provides a discussion of the procedures
used, specifically pointing out the pitfalls mentioned before towards obtaining
over-estimated and unrealistic results. The section that follows concludes the
paper.
2 Methodology
This chapter carries out an in-depth review of relevant literature to reveal the
different practices authors take to predict rainfall. The review covers several
aspects which relate to the input into, output from, and methods used in the
various systems devised in the literature for this purpose. The review specif-
ically focuses on studies that use supervised learning for both regression and
classification problems.
Google scholar was used to collect papers from 2016 to 2020, with the follow-
ing key words: (”machine learning” OR ”deep learning”) AND (”precipitation
prediction” OR ”rainfall prediction” OR ”precipitation nowcasting”). Almost
1240 results were obtained, and of these only supervised rainfall prediction pa-
pers that used meteorological data from e.g. radar, satellites and stations were
selected, while papers that used data from normal cameras e.g. photographs
were excluded. Even though this review focuses on the prediction of rainfall, the
methods used to achieve this can be extended and applied to other geophysical
parameters like temperature and wind. Hence, the conclusions and discussions
of this chapter can be adapted to other parameters.
The total number of reviewed papers are 66, which are a combination of
conferences and journal papers published from 2016–2020, except for one paper
[69] which was published in 2015 and is a seminal work in this field. Figure 1
shows the reviewed studies per year. Tables which summaries the reviewed paper
can be found in Appendices A and B.
Fig. 1. Pie chart showing proportions by publication year for papers in this review .
Figure 2 shows the generic structure of supervised ML models. This structure
was used as a guideline to construct a set of questions used to systematically
categorize and analyze the 66 papers. The questions are as follows:
1. What data sets are used and where are they sourced?
2. What is the output objective in the various papers in terms of what the goal
of prediction/forecasting?
3. What input features are extracted from the data set(s) to be used to achieve
the output objective?
4. What pre-processing methods are used prior to classification/regression?
5. What ML models are used to achieve classification/regression towards the
output objective?
6. What results were obtained from the above-mentioned steps, and how were
they reported?
Fig. 2. Basic flow for building machine learning (ML) models [52]
These questions provide the framework for the rest of this paper. Sections
3–8 address questions 1–6 in sequence. Section 9 discusses the findings in the
previous six sections, and Section 10 provides conclusions.
3 Data Sets
This section provides a breakdown of the data sets used in the 66 studies sur-
veyed, based on the sources of the data sets, availability, and geographical loca-
tions where the data sets were collected.
Figure 3 (left) provides a breakdown of the studies based on the sources/availability
of the data sets used in those studies. About 75% of the studies used private data,
sourced from meteorological stations of their prospective countries [61, 62, 85, 86,
83, 72, 56, 7, 27, 47, 39, 69, 73, 37, 6, 18, 76, 68, 79, 77, 70, 26, 16, 87, 28, 23, 20, 41, 63,
74, 4, 1, 64, 67, 8, 49, 10, 55, 2, 40, 31, 29, 81, 5, 19, 11, 14, 48, 80, 30, 50]. Most of these
data sets are not readily available for use. Only 10% of the studies use data
sourced from freely available sources such as Kaggle (www.kaggle.com), and the
National Oceanic and Atmospheric Administration (NOAA) [65, 15, 57, 84, 60,
22, 5]. The remaining 13% of studies in this review use data from both private
and publicly available sources [78, 82, 25, 17, 33, 12, 13, 42, 3].
Figure 3 (right) summarizes the geographical regions included in this review.
The continent of Asia accounts for around 68% of all studies [62, 82, 83, 47, 39, 69,
65, 70, 12, 13, 74, 4, 8, 49, 22, 42, 10, 29, 19, 11, 48, 80, 86, 17, 27, 33, 37, 18, 76, 68, 15,
79, 77, 26, 28, 81, 30, 7, 23, 41, 63, 64, 40, 50]. Of these, studies that focus on China
and India make up almost one quarter and one tenth respectively of all studies in
this review. The remaining Asian studies focus on countries such as Iran, South
Korea and Japan.
The rest of the chart is distributed as follows: the Americas make up 12.1%
of studies [78, 72, 73, 57, 84, 16, 87, 14]; Europe accounts for 9.1% [25, 6, 20, 67, 55,
3]; Australia comprises 6.1%; [56, 1, 2, 31], and the remaining 4.5% either involve
multiple regions, or involve the use of the whole global map [61, 60, 5].
Fig. 3. Pie chart of the percentage of data sets in this survey in terms of
source/availability (top) and geographical region (bottom).
4 Output Objectives
5 Input Features
In order to make future predictions, studies make use of data from one or more
time steps (called “lags” or “time lags”) as input features to predict one or more
future lags. For example, to predict rainfall at lag T , two previous time lags
(T − 1) and (T − 2) may be used.
The actual input features in each lag vary across studies. In general, the
input features used in the studies in this review were found to be of two types:
1D input features in which each time lag in the data set represents one or a set
of geophysical parameters that have been collected at static known locations i.e.
meteorological stations; and 2D input features in which each time lag in the data
set is a 2D spatial map of values representing rainfall in the geographical area
under review, usually collected by satellite or radar.
1D input features used include geophysical parameters such as tempera-
ture, humidity, wind speed and air pressure [63, 4, 8, 22, 3, 56, 47, 39, 44, 81]. In
a smaller number of cases,climatic indices such as the Pacific Decadal Oscilla-
tion may also be used [28, 1, 42, 30, 78]. Studies that use 1D input features tend
to use a relatively small number of overall input features, ranging from 2–12
features used for prediction.
With 2D input features, one or more images are taken as input features,
depending on the number of time lags used as input e.g. two time lags used as
input implies that two images are used as input. The number of time lags used
as input is henceforth referred to as the “sequence length”.
There is no rule of thumb for how many time lags should be used as input, and
this is mostly selected arbitrarily, and in fewer cases via trial and error. The vast
majority of the studies under review select a fixed sequence length. The sequence
length can be viewed as a hyper-parameter that affects the prediction outcome,
but the optimization of this hyper-parameter is not investigated in the studies
under review. The studies under review were found to be more focused on the
machine learning component, mostly at devising new deep learning architectures,
than selecting and tuning other aspects of their systems.
The most common sequence lengths used are 5 frames [73, 37, 65, 79] and 10
frames [73, 37, 65, 79]. Other sequence lengths are also used, such as 2 [73], 4 [65],
7 [79] and 20 [37].
Studies that use 2D input features tend to use a relatively large number
of input features. This can be attributed to the fact that the feature vectors
produced are associated with one or more 2D images, resulting in vectors of
size (Image width × Image height × Sequence length). Overall, the number of
features can grow as high as several thousands.
Typically, 1D or 2D inputs are used to predict 1D or 2D outputs, respec-
tively. As noted in the previous section, longer-term predictions tend to make
1D predictions, so it follows these studies also tend to use 1D data [28, 23, 20,
41, 63, 74, 4, 1, 64, 8, 49, 22, 42, 10, 55, 2, 40, 29, 81, 14, 11, 19, 48, 50, 3], while those
that make shorter-term predictions tend towards the use of 2D data [69, 73, 37,
65, 6, 18, 76, 68, 15, 79, 77, 70, 13]
Before ML tools are applied to make predictions on the available data, the in-
put data is usually pre-processed to reformat the data into a form that will
make training of, and prediction by, the ML tool(s) easier and faster. The pre-
processing techniques usually applied in geophysical parameter forecasting can
be broken down into three broad categories, namely data imputation; feature
selection/reduction; and data preparation for classification. The following sub-
sections describe these categories, as well as their application in the papers in
this review.
Data sets are regularly found to have missing data entries, which is caused by
a range of factors such as data corruption, data sensor malfunction etc. This
is a serious issue faced by researchers in data mining or analysis, and needs to
be addressed as part of pre-processing before feature selection/preparation and
training.
The techniques used to infer and substitute missing data are collectively
referred to as data imputation techniques. Data imputation is challenging and
is an on-going research area. In the papers in this review, it was found that
very little focus was placed on this problem, with most of the studies making
use of simple statistical techniques such as averaging to interpolate missing data
entries [74, 31, 14, 11, 83, 56]. While not used in the papers in this review, more
advanced data imputation techniques exist beyond the use of simple statistics,
such as the use of ML to impute the data. The interested reader may refer to
[75, 66, 58].
6.2 Feature Selection/Reduction
Fig. 5. Methods used to account for seasonality in studies with long-term data, by
percentage.
6.3 Data Preparation for Classification
The studies in this survey made use of a wide range of ML techniques which can
be subdivided into two main groups: “classical” techniques such as multivariate
linear regression (MLR), KNN ANNs, SVMs, and RF; and modern deep learning
methods such as CNNs and Long-Short-Term-Memory (LSTM). It was observed
that classical ML models tended to work with 1D data from meteorological
stations, such as in [46, 28, 23, 4, 62, 82, 63] for short-term data and [28, 20, 41,
63, 1, 67, 49, 22, 48, 30, 3] for long-term data.
Some papers use hybrid models that combine two or more approaches. A
popular hybrid approach is to combine ML with optimization tools such as
genetics and particle swarm optimization to optimize hyper-parameters [48, 10,
49, 26, 27]. Multiple ML techniques are combined in [19, 72, 61], and ML is used
with ARIMA in [62].
Deep learning models usually requires huge datasets to avoid overfitting on
the data, which explains their popularity among short term data sets, especially
those using 2D data [69, 73, 37, 65, 6, 18, 76, 68, 15, 79, 77, 70, 57, 26, 84, 16, 60, 87,
12, 13]. 2D data in particular has a huge feature space, which requires authors
to implement automated feature reduction models like CNNs [84, 16, 60, 87, 12].
In order to accommodate the time dimension in the data,many researchers
try to adapt time series models such as LSTMs for 1D data in [40, 29, 81, 5,
14, 85, 60]. For 2D data, models combining CNNs with LSTMs (designated as
ConvLSTMs models) were first used in in[69] in 2015, and subsequently several
variations have been implemented [73, 37, 65, 6, 18, 76, 68, 15, 79, 77, 70].
Several different metrics are used in the literature to measure the performance
of the ML models according to the type of the problem. In classification prob-
lems, authors tend use metrics such as precision, recall, and accuracy [30, 50, 3,
83, 25, 72, 56, 17, 47, 39, 33, 73, 84, 16, 60, 87, 12]. If the data is not balanced then
f1-score is used rather than the accuracy, since accuracy does not take the imbal-
ance between the classes into account [83, 25, 72, 56, 73, 60]. For sequence clas-
sification prediction, other metrics are used such as the critical success (CSI)
[65, 6, 18, 76, 68, 77, 70]. For continuous outputs, then the mean absolute error,
and the root mean squared error are the most commonly used metrics in the lit-
erature [61, 62, 78, 85, 86, 82, 28, 23, 20, 41, 63, 74, 4, 1, 64, 67, 8, 49, 22, 42, 10, 55, 2,
40, 31, 29, 81, 5, 14, 11, 19, 48]
A direct comparison of these results across different papers is a nearly im-
possible task,since each paper uses its own models, pre-processing, metrics, data
sets and parameters. However, individual authors frequently compare multiple
algorithms, and there are a few ML algorithms that stand out as being most
frequently mentioned as better performers. ANNs and deep learning are most
frequently mentioned as best performing models, for both long-term prediction
[28, 41, 4, 1, 64, 8, 2, 80, 5, 40, 31, 29, 14, 19, 48] and especially for short-term pre-
diction [69, 73, 37, 65, 6, 18, 76, 68, 15, 79, 77, 70, 57, 26, 84, 16, 60, 87, 12, 13, 85, 86,
25, 39].
Other algorithms mentioned as best performers are SVMs in 6 studies [82, 47,
17, 27, 67, 49]. ensemble in [78, 83, 63, 55, 81, 3], logistic regression in three studies
[56, 7, 30] and KNNs in two studies [33, 20].
9 Discussion
The above sections clearly demonstrate that there is a robust, growing literature
on rainfall prediction, which covers an extremely wide variety of time-scales,
features used, pre-processing techniques, and ML algorithms used. From a high-
level perspective, the field can be divided into short versus long time scales (time
intervals of a one day or less, versus intervals of a month or more), which tend
to have divergent characteristics.
Short term studies typically rely on huge datasets, and require deep learning
applied to large feature sets to find hidden patterns in those datasets. On the
other hand, long term studies rely more on pre-processing methods such as
feature selection, data imputation, and data balancing in order to make effective
predictions. ANNs and deep learning seem are becoming increasingly prevalent
in long term studies as well as short term: since 2018, 7 of 23 papers on long-term
prediction utilized deep learning tools.
There are reasons to regard the trend towards more complicated models with
skepticism. Some recent studies have shown that much simpler models such as
knns can sometimes outperform advanced ML techniques like RNNs, [43, 35, 45,
20, 36]. Similar findings have been reported for other ML applications, such as
the top n recommendation problem [21].
These results underscore the importance of providing simple but statistically
well-motivated baselines to verify whether ML truly is effective in improving
predictive accuracy. However, many papers do not provide simple baselines, but
rather compare several variations or architectures of more advanced ML methods
such as SVR or MLP [51, 11, 48, 14, 61, 17, 27, 47, 26, 84, 60, 12, 13]. Of the total
reviewed papers, almost half (48.2%) of papers did not supply simple baselines.
Of those papers that did supply baselines, a variety of methods was used. For
short-term image data, the previous image is frequently used as an untrained
predictor for the next image [76, 68, 77, 70]. For monthly data, some papers use
MLR based on multiple previous lags [19, 28, 20, 41]; while same-month averages,
though statistically well-motivated, are used much less frequently [81].
Besides the issue of baselining, the use of error bars is essential for comparison
purposes, as it highlights whether the improvement obtained by the models are
significant. Unfortunately most of the literature in ML does not provide error
bars around the measured metrics. In the case of our reviewed literature shows
that 88% of the papers did not give error bars.
A final issue of concern is data leakage. Data leakage refers to allowing data
from the testing set to influence the training set. Data leakage occurs during the
pre-processing of the data, and can take various forms as follows:
– Random shuffling, which involves choosing sequences from a common data
pool for both training and testing:
– Imputation, which involves filling missing records using statistical methods
on the entire data set (including both training and testing)
– De-seasonalization which utilizes the monthly averages from the entire data
set.
– Using current lags, e.g. using temperature at a time T to predict rainfall
at the same time T . (Depending on the application, this may or may not
constitute data leakage)
– Combination: Which uses two of the above mentioned techniques.
Figure 6, shows the reviewed papers in terms of data leakage. The top chart
focuses on long term data, where the bottom focuses on short term data. We
mentioned previously that long term data often undergoes more pre-processing
than short term data. This reflects on the graph, as leakage-producing methods
are more than twice as common for long term as for short term. Random shuffling
was performed in [28, 41, 42, 10, 29, 30, 50, 3] for long term data,and in [78, 27, 47,
57, 26, 16] for short term data. Data imputation was performed in [74, 31, 14, 11]
for long term data, and in [56] for short term data. Faulty de-seasonalization
was carried out in [49] for long term data. Using the current lags was seen only
implemented in [63]. Multiple leakage issues (denoted as “combination” in the
figure) were observed in [19, 83].
Fig. 6. Percentage of papers which introduced data leakage during pre-processing, for
long term data (top) and short term data (bottom).
10 Conclusions
This appendix contains four tables which summarize the findings for the reviewed papers for long term data 1, 2, and short
term data 3, 4. Tables 1 and 3 contain information regarding the source, period, region, input, output; while Tables 2 and4
include information about the pre-processing tools, data leakage, and the ML used.