Prediction of Algal Blooms Via Data-Driven Machine Learning Models: An Evaluation Using Data From A Well-Monitored Mesotrophic Lake

Model description paper
Geosci. Model Dev., 16, 35–46, 2023

https://doi.org/10.5194/gmd-16-35-2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.
Prediction of algal blooms via data-driven machine learning models:

an evaluation using data from a well-monitored mesotrophic lake
Shuqi Lin1,3 , Donald C. Pierson1 , and Jorrit P. Mesman1,2
1 Erken Laboratory and Limnology Department, Uppsala University, Uppsala, Sweden
2 Département F.-A. Forel des sciences de l’environnement et de l’eau, Université de Genève, Geneva, Switzerland
3 Environment and Climate Change Canada, Canada Centre for Inland Waters, Burlington, L7R 4A6 ON, Canada
Correspondence: Shuqi Lin ([email protected])
Received: 6 July 2022 – Discussion started: 2 August 2022

Revised: 11 October 2022 – Accepted: 28 November 2022 – Published: 3 January 2023
Abstract. With increasing lake monitoring data, data-driven 2011; Paerl and Huisman, 2008). Moreover, as indicated by
machine learning (ML) models might be able to capture the Carey et al. (2012) and Huisman et al. (2018), more intense
complex algal bloom dynamics that cannot be completely de- and longer periods of thermal stratification could potentially
scribed in process-based (PB) models. We applied two ML specifically favour blooms of toxic Cyanobacteria. To better
models, the gradient boost regressor (GBR) and long short- manage and mitigate the effects of algal blooms, methods to
term memory (LSTM) network, to predict algal blooms and forecast their timing and magnitude are needed. However,
seasonal changes in algal chlorophyll concentrations (Chl) in the factors regulating algal blooms are complex, variable,
a mesotrophic lake. Three predictive workflows were tested, and site-specific, often involving high-order interactions of
one based solely on available measurements and the others environmental factors and biogeochemical processes (Reich-
applying a two-step approach, first estimating lake nutrients waldt and Ghadouani, 2012; Richardson et al., 2018).
that have limited observations and then predicting Chl using Process-based (PB) models encode our understanding of
observed and pre-generated environmental factors. The third biogeochemical processes into a framework of numerical for-
workflow was developed using hydrodynamic data derived mulations, but these are inevitable simplifications that lead
from a PB model as additional training features in the two- to an incomplete description of complex biogeochemical in-
step ML approach. The performance of the ML models was teractions and low level of model confidence (Elliott, 2012).
superior to a PB model in predicting nutrients and Chl. The Based on innovative data mining and statistical techniques,
hybrid model further improved the prediction of the timing data-driven machine learning (ML) models have been ap-
and magnitude of algal blooms. A data sparsity test based on plied to identify patterns within observed data (Peretyatko
shuffling the order of training and testing years showed the et al., 2012; Mellios et al., 2020), and with the recent pro-
accuracy of ML models decreased with increasing sample liferation of lake monitoring data (Marcé et al., 2016), ML
interval, and model performance varied with training–testing models have been applied, as an alternative to PB models
year combinations. for bloom prediction (Rousso et al., 2020). Previously ap-
plied ML models, including random forest (Nelson et al.,
2018), support vector machine (Jimeno-Sáez et al., 2020),
and artificial neural network models (Xiao et al., 2017; Reck-
1 Introduction nagel et al., 1998; Wei et al., 2001), can improve predictions
of the timing and seasonality of algal Chl pattern, appar-
Harmful algal blooms, which are a serious threat to natural ently by accounting for complexity that is difficult to encode
water systems, have been increasing throughout the world within the framework of a PB model. However, a downside
(Burford et al., 2020; Watson et al., 2016), primarily as a of data-driven ML models is that they lack the interpretabil-
consequence of both climate change and increased nutrient ity and generalization found in the explicit structure of the
loading from anthropogenic activities (Brookes and Carey,
Published by Copernicus Publications on behalf of the European Geosciences Union.

36 S. Lin et al.: Prediction of algal blooms
PB model. In recent years, the process-guided deep learning beginning in May–June and ending in August–September.
(PGDL) model has emerged and has been applied to water The onset of ice cover usually begins in December–February,
temperature (Jia et al., 2019; Read et al., 2019) and water and the loss of ice occurs in March–April (Persson and Jones,
quality (Hanson et al., 2020) simulations, which explicitly 2008). Located near the Baltic coast, Lake Erken is wind-
combine well-defined physical theories into the training of exposed and susceptible to periodic wind-induced turbulent
ML models, enhancing their interpretability. While this ap- mixing.
proach has achieved promising results, it is difficult to apply Changes in algal Chl in Lake Erken have a typical sea-
it to phytoplankton dynamics due to numerous nonlinear in- sonal pattern, with spring and summer peaks in concentra-
teractions within the biogeochemical cycles and the difficulty tion (Pettersson et al., 2003). Spring blooms are dominated
in defining a measurable processes or mass balances that can by dinoflagellates and diatoms (Pettersson, 1985) and initi-
be used as a physical constraint on knowledge-guided deci- ated by overwinter species from the last autumn (Yang et al.,
sions. Also, the sparsity of lake water quality (e.g. nutrients 2016). Cyanobacteria dominate summer peaks in Chl, given
and Chl concentration) observations can limit the applica- that they can optimize their vertical position with regard to
tion of ML models in algal bloom modelling (Rousso et al., nutrients and light (Paerl, 1988; Pierson et al., 1992).
2020).
In this study, our objectives are to (1) apply the ML mod- 2.2 Data
els to predict algal bloom in a well-monitored mesotrophic
lake, (2) evaluate model performance and assess model un- Lake Erken has a long-running automated monitoring pro-
certainties, and (3) explore the approaches to improve the gramme that provides hourly meteorological data, water tem-
model performance and widen the model applications. We perature profiles between 0.5 and 15 m at 0.5 m intervals, and
first tested the ability of ML models in predicting algal Chl the flow from the inflow and outflow (Fig. 1). A manual sam-
concentrations via available environmental factors, including pling programme collects samples during ice-free time at 5–
observed lake nutrient data, and then proposed a two-step 7 d intervals for all major nutrient concentrations (e.g. NOx ,
ML approach for predicting algal dynamics that first esti- NH4 , PO4 , total P, and Si, etc.), dissolved oxygen (O2 ), and
mates lake nutrient concentrations which often have limited Chl concentration. The timing of the onset and loss of ice
observations and secondly predicts variations in algal Chl cover are also monitored yearly by the lab. More detailed in-
using these pre-generated nutrient concentrations combined formation on the sampling programme is in the Supplement
with other observed environmental factors that are collected (see Sect. S1) and Moras et al. (2019).
at higher frequency. We also tested a simple hybrid model
2.3 Modelling methods
architecture that, by adding hydrodynamic features derived
from the PB model into the training features of the two-step
2.3.1 Process-based (PB) lake model
ML approach, allowed us to include additional information
describing physical lake processes expected to affect varia- In this study, a PB hydrodynamic lake model, GOTM (Gen-
tions in algal growth and succession in the machine learning eral Ocean Turbulence Model; Burchard et al., 1999), was
prediction. used to generate water temperature profiles and other hy-
We applied the above workflows to predict changing Chl drodynamic metrics. GOTM also served as the foundation
concentration, as a proxy for the occurrence of algal blooms, of water quality simulations made with the SELMAPROT-
via the gradient boost regressor (GBR) and long short-term BAS model (Mesman et al., 2022) that is coupled to GOTM
memory network (LSTM). Two shuffling year tests were through the Framework for Aquatic Biogeochemical Models
conducted. One assessed the uncertainty of ML models in (FABM; Bruggeman and Bolding, 2014).
predicting Chl during the same 2-year period, and the other
evaluated the sensitivity of ML accuracy to various training– 2.3.2 Data-driven machine learning (ML) models
testing year combinations and lake nutrient sampling inter-
vals. Model performance and potential applications in algal Tree models have been widely applied in modelling phyto-
bloom forecasting are discussed. plankton dynamics in freshwater systems (Harris and Gra-
ham, 2017; Fornarelli et al., 2013; Rousso et al., 2020).
The gradient boosting regressor (GBR) is one of these tree
2 Methods models, iteratively generating an ensemble of estimator trees
with each tree improving upon the performance of the pre-
2.1 Study site vious. Details about the GBR model can be found in Fried-
man (2001). The hyperparameters in GBR are optimized via
The study site, Lake Erken, is a mesotrophic lake located in the RandomizedSearchCV function within the Scikit-Learn
east-central Sweden that has a surface area of 24 km2 , a max- library. The loss function of model is chosen as “huber”,
imum depth of 21 m, and an average retention time of 7 years. which is a combination of the squared error and absolute er-
The lake is dimictic, with seasonal stratification commonly ror of regression. Since the target variable in our research Chl
Geosci. Model Dev., 16, 35–46, 2023 https://doi.org/10.5194/gmd-16-35-2023

S. Lin et al.: Prediction of algal blooms 37
Figure 1. Map of Lake Erken. The locations of the monitoring systems are shown.
concentration has peak values during algal blooms, which In workflow 2 and 3, a two-step approach was applied (Ta-
could be regarded as outliers, the huber loss function is more ble 1). Daily measurements of physical factors were used
robust and gives greater weight to peak values than the mean to pre-generate daily variations in lake nutrients via sepa-
squared error function. rate ML models, and the ML models were trained at a daily
The long short-term memory (LSTM) network is part of time step using the measured environmental factors and pre-
a class of deep learning architectures, called recurrent neural generated nutrient concentrations. The time step of LSTM
networks (RNNs), built for sequential and time series mod- was then set to 7 d.
elling (Hochreiter and Schmidhuber, 1997). The core con- In workflow 3, three hydrodynamic features, i.e. mixing
cepts of LSTM are the cell and hidden states and its three layer depth (ze ), Wedderburn number (Wn), and the seasonal
gates (input gate, forget gate, and output gate; see Fig. S2 thermocline depth (thermD), derived from the GOTM model
in the Supplement). Essentially, the LSTM model defines a were regarded as daily training features in the two-step ML
transition relationship for a hidden representation through a approach. The definitions and calculations of these features
LSTM cell which combines the input features at each time are explained in the next section, Sect. 2.5, “Feature selec-
step with the inherited information from previous time steps. tion and processing for ML models”, and the Supplement
This architecture is suitable for extracting information from (Sect. S3)
sequential data (Rahmani et al., 2021; Read et al., 2019). The Following the two-step approach and using workflow 3,
hyperparameter settings in LSTM can be found in the Sup- we set up two tests. (1) To assess the uncertainty induced
plement (see Sect. S2). by variations in the data used to train the ML models, we
Compared to the GBR model, LSTM has more com- shuffled the training years, randomly taking 13 years out of
plex model architectures, carrying the “memory” from the the 2004–2018 dataset 30 times, and tested the model predic-
previous time steps. In this study, the GBR and LSTM tions of Chl during 2019–2020. And, (2) to test if the work-
were applied, respectively, to assess the performance of ML flow could be used for other water systems which may have
models with and without memory. Both ML models are less frequent lake nutrient monitoring data, we conducted a
built in Python using the Scikit-Learn (https://scikit-learn. data sparsity test that evaluated the sensitivity of models to
org/stable/, last access: 19 September 2022) and Tensor- the lake nutrient and Chl sampling interval. For this test the
Flow (https://www.tensorflow.org/, last access: 19 Septem- lake nutrient and Chl concentration observations in the train-
ber 2022) libraries. ing dataset were downsampled to a 7, 14, 21, 28, and 35 d
sampling interval. Then for each sampling interval using the
2.4 Design of predictive workflows and shuffling year 2004–2020 dataset, Chl was predicted for different consecu-
data sparsity tests tive 4-year periods when the ML models were trained by the
remaining 13 years of data. Data shuffling was conducted 13
times so that every 4-year period in our dataset was tested.
In this study, we tested three workflows using a dataset
split for training (years 2004–2016) and testing (years 2017–
2020). In all three workflows, a 5-fold cross-validation using 2.5 Feature selection and processing for ML models
the training dataset was used to optimize the hyperparam-
eters in the ML models. Workflow 1 directly predicts Chl
concentration based on available environmental observations The feature selection process is based on some a priori
(Table 1). The training and testing datasets were limited by knowledge of the underlying phenomena related to algal
the frequency of lake nutrient observations, which resulted blooms. All workflows made use of the daily automated
in 5–7 d gaps between data points. The time step of LSTM monitoring data. In addition, the temperature difference
was set to 1; that is, the environmental factors on the target (1T ) between surface water (averaged over the upper 3 m)
date and previous observation date, which may be 5–7 d ago, and bottom water (15 m) was also used to represent the ther-
were used to train the model and make predictions. mal structure of the lake, and the duration of ice cover in
https://doi.org/10.5194/gmd-16-35-2023 Geosci. Model Dev., 16, 35–46, 2023

Table 1. List of training features and target variables in each workflow. Stars (*) indicate training features, circles (o) indicate target variables,
and squares () indicate the variables are the target variables in step 1 used to daily produce a training feature for use in step 2. The order of
nutrient model sequence is from the top to bottom based on its position in the table (NOx to Si).
Variables Sample interval Workflow 1 Workflow 2 Workflow 3

Step 1 Step 2 Step 1 Step 2
Inflow Daily * * * * *
Meteorological data (air temperature, Daily * * * * *
wind speed, short-wave radiation, pre-
cipitation, humidity, and cloud cover)
1T Daily * * * * *
Ice duration Daily * * * * *
Days from ice-off date Daily * * * * *
ze Daily * *
Wn Daily * *
thermD Daily * *
NOx 1–2 weeks * * *
O2 1–2 weeks * * *
PO4 1–2 weeks * * *
Total P 1–2 weeks * * *
NH4 1–2 weeks * * *
Si 1–2 weeks * * *
Chl 1–2 weeks o o o
the previous winter and the number of days from ice-off date rolling mean. Using these data, the onset of a bloom was cat-
were used. egorized as occurring when the daily change of Chl (1Chl)
In workflow 2 and 3 nutrients are predicted sequentially, exceeded a threshold, 0.35 mg m−3 d−1 . This works well in
with each pre-generated nutrient prediction included in the Lake Erken where Chl concentrations are frequently moni-
training data of the next nutrient prediction (Table 1). Work- tored (near weekly), and the linear interpolation can be ex-
flow 3 added ze , computed using the GOTM-simulated ver- pected to be reasonably representative of the Chl concentra-
tical eddy diffusivity (Kz ) profiles; thermD, estimated us- tions between measured samples. Considering the random-
ing Lake Analyzer (Read et al., 2011) based on GOTM- ization in the ML models, we also add a 3 d window on the
simulated temperature profiles; and Wn, a dimensionless pa- bloom onset prediction; that is, we considered the prediction
rameter measuring the balance between wind stress and the of a bloom valid if the measured data suggested a bloom the
pressure gradient resulting from the slope of the interface day before or after the simulated onset. We used the true pos-
(see Sect. S3), as additional daily training features. itive rate (TPR), false positive rate (FPR), and modified ac-
curacy (kappa), which considers the possibility of the agree-
2.6 Evaluating metrics ment occurring by chance (McHugh, 2012), to identify the
potential of ML models to correctly capture the algal bloom
Model performance was evaluated by comparing the simu- onset (see Table S1). A model with 100 % TPR, 0 % FPR,
lated and measured Chl concentrations and by calculating the and 100 % kappa would constitute a perfect fit.
mean absolute error (MAE), root mean square error (RMSE),
and correlation coefficient (R 2 ). To evaluate the accuracy of
the model in detecting the onset of an algal bloom, we cal-
culated a confusion matrix in workflows 2 and 3, where the
observations were linearly interpolated to daily values, and
predicted daily Chl concentration were smoothed with a 7 d

3 Results 3 both ML models showed comparable performance in Chl

predictions compared to workflow 1. However, the predic-
3.1 Workflow 1: direct prediction based on tions of the spring bloom in all years improved compared to
observations workflows 1 and 2, in terms of the magnitude and timing of
the spring bloom (Fig. 2e). This was the case in 2019–2020
In workflow 1, both the GBR and LSTM clearly reproduced (Fig. 2a), which was an abnormally warm winter with only
spring and summer blooms (Fig. 2a) but underestimated the 5 d ice cover and had an unusually early spring algal bloom.
intensity of blooms (Fig. 2a, b). Neither ML model captured Both the GBR and LSTM in workflows 2 and 3 did not cap-
the extraordinarily high Chl (∼ 15–30 mg m−3 ) in the sum- ture the extremely intensive bloom (with peak values close
mer of 2019. Although the abnormal summer bloom in 2019 to 30 mg m−3 ) in summer of 2019, and neither did the PB
could contribute to the higher RMSE and MAE in the test- model.
ing dataset than the mean values in the training dataset, the Furthermore, adding hydrodynamic features derived from
cross-validation on the training dataset (see Table S2) shows the PB model improved predictions of the onset of algal
what appears possibly to be an overfitting issue in both mod- blooms (Figs. 2e and 4), with the overall TPR increasing
els. The achieved accuracy of models is attributed to the by 15 % and 5 % and FPR increasing around 5 % and 3 %
daily availability of physical inputs and the fact that in Lake in the GBR and LSTM models, respectively. Compared with
Erken water samples are collected frequently at 5–7 d inter- the PB model, which showed lower TPR (15 %) and FPR
vals. Workflow 1 may be most valuable in reconstructing pre- (6 %), ML models are more likely to predict algal bloom at
vious variations in algal Chl, filling the gaps between mea- the correct time. The optimal TPR was from LSTM in work-
sured Chl observations and feature importance ranking (see flow 3, which could detect the onset of algal blooms with
Fig. S4). But when using this workflow, future forecasts will TPR closed to 50 %. However, the concomitant higher FPRs
be limited by the absence of future nutrient data. indicating an incorrect warning of algal bloom is also more
likely to occur in the ML models, since the PB model is more
3.2 Workflow 2: two-step ML models based on
like to miss the bloom entirely. The kappa values of both ML
pre-generated daily nutrients and observed
models and the PB model are close to 80 %, showing that
physical factors
all models simulated the entire period (blooms and the peri-
As in workflow 1, both ML models in workflow 2 had poor fit ods between blooms) to a moderate–strong level (McHugh,
in the summer of 2019 and suffered from overfitting leading 2012).
to higher MAE and RMSE and lower R 2 in testing datasets
than training datasets (see Table S2). 3.4 Effects of shuffling training years on 2019–2020
Overall, both the GBR and LSTM showed slightly predictions
higher MAE (4.22 mg m−3 vs. 3.87 mg m−3 ) and RMSE
(6.27 mg m−3 vs. 6.00 mg m−3 ) when compared to work-
flow 1 (Table 2). But they also showed improved perfor- The results presented so far are based on a typical strategy
mance in terms of capturing the peak values of Chl during of training ML models for a historical period, in this case
spring blooms (Figs. 2, S5). Both workflows outperformed 2004–2016, and then accessing model performance in a sec-
the SELMAPROTBAS PB model in simulating concentra- ond period between 2017–2020. The accuracies of the model
tions of lake nutrients (see Fig. S6). The ML models were predictions were to some extent related to the range and vari-
more accurate in predicting the low values of NOx and peak ability in the training data. To evaluate the importance of
values of PO4 and total P. However, both ML models and the this, we randomly removed 2 years from a 2004–2018 train-
PB model failed in predicting the extremely high values of ing dataset and made 30 different predictions of Chl dur-
measured lake nutrients, such as the autumn peak of NH4 in ing 2019–2020 when the models had difficulties predicting
2017 (Fig. S6e) and the spring peak of O2 in 2018 (Fig. S6c), spring and summer blooms (Fig. 5). When trained with the
Thus, higher workflow 2 MAE and RMSE (Table 2) are pre- various shuffled combinations, both ML models were capa-
sumably due to the inaccuracies in the pre-generated nutrient ble of reproducing the seasonal variations in algal Chl with a
training data, but the improved daily predictions that better 4.5 % and 5.8 % coefficient of variation (CV) in MAE and a
capture the bloom events overshadow these flaws. 24.0 % and 16.4 % CV in TPR of GBR and LSTM, respec-
tively (see Table S3 in the Supplement). This provides an in-
3.3 Workflow 3: based on workflow 2 and including dication of the uncertainty that may arise as a consequence of
hydrodynamic training features derived from the differences in the training datasets used for in our workflows.
GOTM model And, it also shows that even a relatively long training period
of 13 years can not totally capture the system behaviour in
Including hydrodynamic training information in workflow 3 such a way as to lead to nearly similar bloom predictions.
did not significantly improve lake nutrient predictions com- Although none of the model runs captured the intensive
pared to workflow 2 (see Fig. S6), and when using workflow summer bloom in 2019, the spring bloom in both years was

Figure 2. Time series of observed and predicted Chl from GBR and LSTM models in (a) workflow 1 and (c) workflow 3, and the corre-
sponding scatter plots of observations vs ML predictions of Chl in workflow 1 and workflow 3 are shown in panels (b) and (d), with the
black and blue dots and lines representing the predictions from GBR and LSTM, respectively. Panel (e) shows the observed and predicted
algal bloom onsets in 2017–2020 using the same colour coding as the previous panels. Results from the PB model simulation in Mesman et
al. (2022) are also shown in (c) and (e).
Table 2. Comparisons of model performance during the testing period based on RMSE, MAE, and R 2 . The unit of Chl is milligrams per
cubic metre (mg m−3 ). In bold are the best fits of each statistical metric. For comparison of training and testing periods, see Table S2.
Model PB ML-workflow 1 ML-workflow 2 ML-workflow 3

GBR LSTM GBR LSTM GBR LSTM
RMSE 7.18 5.77 5.64 6.27 6.00 5.94 5.81
MAE 4.77 3.55 3.58 4.22 3.87 3.99 3.71
R2 −0.25 0.13 0.20 0.05 0.13 0.14 0.18
well represented, especially by LSTM, in terms of timing and 3.5 Shuffling year data sparsity test
magnitude.
Despite comparable RMSE and MAE in LSTM and the To examine the possible use of workflow 3 when data are less
GBR (Fig. 4c), both higher TPRs (with median of 60 %) frequently available, lake nutrient and Chl data were down-
and FPRs (with median of 18 %) in LSTM indicate that the sampled so that the effects of sampling frequency on model
LSTM model was more aggressive in making algal bloom predictions could be evaluated. Each downsampled dataset
predictions. The GBR model’s apparent advantage in FPRs was also rearranged into 13 different 13-year training peri-
(with median 10 %) is largely the result of it making a lower ods and 4-year testing periods. The variability in predictions
number of bloom predictions since the low concentrations provided a measure of model performance and uncertainty.
between spring and summer blooms in 2020 were not well Figure 5 shows the uncertainty in model predictions as a con-
represented (Fig. 4b). sequence of the chosen sampling intervals.
The MAEs and RMSEs of both GBR and LSTM models
tended to increase with the longer sample intervals. The me-
dian MAE was always slightly higher for the LSTM model,

series comparison of observed and predicted Chl from this

shuffling year data sparsity test can be found in the Supple-
ment (Figs. S7–S9).
4 Discussion
4.1 Performance of ML models
In three workflows, the ML models successfully reproduced

the Chl seasonal patterns, capturing the spring and summer
bloom events, with lower averaged RMSEs and MAEs than a
PB model simulation that was previously calibrated for Lake
Erken. And in all three workflows, LSTM model always
showed slightly lower RMSE and MAE and higher R2 in pre-
dicting Chl concentrations than the GBR model and higher
TPR in detecting the onset of algal bloom events. Workflow
1, which predicted Chl based on all available environmental
factors including lake nutrient observations, showed that both
ML models can reproduce the seasonal dynamics of algal Chl
with promising accuracy (MAE = 3.55 and 3.58 mg m−3 ,
Figure 3. TPR, FPR, and kappa of GBR and LSTM models in work- RMSE = 5.77 and 5.64 mg m−3 , and R 2 = 0.13 and 0.20, for
flow 2, 3, and the PB model. GBR and LSTM, respectively) via the direct input of avail-
able environmental observations. These ML models can be
applied to reconstruct past patterns of algal Chl, fill the gaps
between measured Chl observations, and interpret the mech-
except when trained with the original dataset (Fig. 5a). While anisms that drive phytoplankton dynamics. Workflows 2 and
our initial evaluation of TPR using 2017–2020 as the testing 3 adopted a two-step approach, first using separate ML mod-
period and 2004–2016 as the training period suggested the els to estimating daily changes in lake nutrient concentration
LSTM model was more accurate in turns of detection of al- and in Workflow 3 also including PB model derived physical
gal bloom onsets (Fig. 3), Fig. 5c showed the median TPR factors as training features of the algal ML model. These two
of GBR model calculated by the shuffling year test was over workflows allowed for daily predictions of changes in algal
50 %, higher than that found when using the original testing Chl concentration using both observations and pre-generated
and training periods. This can be explained by the fact that lake nutrient concentrations at a consistent daily time step,
the 2017–2020 testing period as in Fig. 3 and shown as large and at only a minor decrease in performance compared to
points in Fig. 5 was unusually difficult for the GBR to simu- workflow 1, workflow 2 and 3 demonstrated a wider potential
late. Consequently, even though the GBR model usually per- range of applications (e.g. interpolation, reconstruct histori-
forms better in the shuffled data test in Fig. 5, Fig. 3, which cal data, and algal bloom forecast) via making daily forecasts
show the results of the 2017–2020 testing period, presented with less-than-daily measured nutrient observations.
the opposite result. This illustrates the importance of the se- The one clear failure of both the ML- and PB-based model
quence of training and testing years for evaluating model per- predictions was that during July–August 2019, Chl concen-
formance. trations in integrated samples collected between the surface
For the first three sampling intervals the GBR model and 6–12 m exceeded 20 mg m−3 over a 5-week period. Nei-
clearly had better TPR values than the LSTM model. The ther the PB model nor ML models captured this unusu-
median TPRs of GBR model started to drop below 30 % once ally persistent bloom (Figs. 2, S3). At this time the phyto-
the sample interval reached 21 d. For LSTM, medium TPRs plankton were dominated by the Cyanobacteria Gloeotrichia
remained lower than 30 %, for all sampling intervals, but also and Anabaena, that form a resting akinete life stage at the
showed a much wider range of variability (Table S4) depen- end of their yearly bloom, which can initiate the follow-
dent on the training and tested datasets used. In general, both ing year’s bloom as they are transformed to vegetative cells
models preformed best at the original and 7 d sampling in- that migrate from the sediment to the upper water column.
terval but then showed slightly worse performance that was We hypothesize that the large summer bloom in 2019 was
consistent up to a sample interval of 21 d. In terms of the er- the result of unusually large recruitment of akinetes in this
rors evaluated over the entire 4-year testing period (Fig. 5a, year (Karlsson-Elfgren et al., 2005, 2004). The life cycle of
b) the GBR model had lower errors and, therefore, better pre- Cyanobacteria is not a process included in the PB model (but
dicted the seasonal variations of Chl concentration. The time see Hense and Beckmann, 2006, and Jöhnk et al., 2011), so

Figure 4. (a) Time series of observed (red stars) and predicted Chl from GBR (black) and LSTM (blue) models in the shuffling training year
test. The shades represent the range between minimum and maximum prediction, and the solid lines represent the median prediction. Panel
(b) shows the box plot of TPR, FPR, and kappa, and panel (c) shows the box plot of MAE and RMSE of both models in the shuffling training
year test.
Figure 5. Comparisons of (a) MAE, (b) RMSE, and (c) TPR between GBR and LSTM during the testing period created under various
sample intervals. Circles along the box show the result from the testing period of all shuffled training–testing year combinations, and the
bigger circles represent 2004–2016 training and 2017–2020 testing year combination, as used in Fig. 2.

increased recruitment of akinetes could explain the under- 2017–2020 and a modified accuracy (kappa) around 80 %,
estimation of the 2019 summer bloom. Even the LSTM al- indicating a moderate–strong level of prediction.
gorithms could not account for previous conditions so far Based on our shuffling year tests of bloom timing, the
back in time as to affect the formation and deposition of GBR model showed relatively higher median TPRs than the
Cyanobacteria akinetes (this may require the memory of the LSTM model for sample intervals less than 1 month. How-
last ice-free season). The consequent poor fit of summer ever, in some training and testing year combinations, TPRs
bloom in 2019 partially led to the higher MAE and RMSE are close to 0 % (Fig. 5), and CVs of the TPRs are highly vari-
in the testing dataset compared to the training dataset in all able, even at the original sample interval, being over 30 % for
three workflows, in both GBR and LSTM models. GBR and over 60 % for LSTM, indicating that the correct de-
Warm winters can initiate a chain of events, i.e. shortening tection of algal blooms in both models is highly dependent on
the ice cover duration, extending spring circulation, affect- the years used to train the models. Thus, while the ML mod-
ing nutrient availability, and causing an earlier spring bloom els can be better than the PB models at predicting the onset
(Adrian et al., 2006; Yang et al., 2016). According to the ice of algal blooms, they still may not be good enough for oper-
record in Lake Erken (see Fig. S1), in 2020, the lake was ational forecasting. The resulting variability provided a more
covered by very thin ice for only 5 d, which is the shortest accurate estimate of the model performance at each down-
duration since observations were first recorded in 1954. The sampled data interval and showed that increasing sample in-
spring bloom in 2020 did occur earlier than other years (see terval led to reduced performance for both ML models, in
Fig. S3), and both ML models which considered the timing terms of MAE, RMSE, and the CV of TPR. These tests also
of lake ice show fairly good performance in predicting the highlighted that the performance of both ML models, espe-
timing and magnitude of this abnormally early spring bloom cially LSTM, varied with the sampled history of events in the
(Figs. 2, 5) training period for evaluating a specific pattern of change in
the testing period. We suggest that testing strategies similar
4.1.1 Performance of hybrid PB ML models to the shuffle methods used in this study are needed to ac-
curately evaluate the expected accuracy of ML models when
applied to any given site. The estimated uncertainty in shuf-
One-dimensional PB hydrodynamic models can accurately
fling training year tests (Fig. 4) and shuffling training–testing
simulate both water temperature profiles and other hydrody-
year tests (Fig. 5) can be used to better represent the uncer-
namic features in Lake Erken using the same forcing data that
tainty of ML derived forecasts.
are commonly input to ML models. The hybrid model struc-
ture tested here provides a richer set of input data, leading to
4.2 Future applications in short-term forecasts and
more accurate ML predictions of algal Chl at little additional
water management
computational cost or data requirements. Using data from the
hydrothermal PB model allowed for the seasonal deepening
To reach the goal of incorporating ML models into opera-
of the thermocline and variations in the surface mixing layer
tional forecasts either for short-term management support or
depth and upwelling events, represented by Wn, to be en-
longer-term evaluation and planning, two steps must occur.
coded into the ML algorithms. These factors can affect the
First the ML model must be developed, trained, and evalu-
underwater light climate, the internal loading of phosphorus,
ated on the water body of interest due to the unique physical
and the transport of resting Cyanobacteria colonies from the
characteristics and water quality dynamics in different sys-
hypolimnion into the epilimnion, favouring summer blooms
tems. Secondly, future forcing data for the model must be
of Cyanobacteria (Pierson et al., 1992; Pettersson, 1998). The
obtained and integrated into a workflow that makes the fu-
inclusion of these factors did increase the accuracy of the
ture predications. In regards to the second point, a lack of
ML models, especially in the case of unusual environmental
frequent water monitoring (Stanley et al., 2019) is a major
conditions (e.g. spring of 2020, Figs. 2, 5) that did not fre-
deterrence to applying ML models to many lakes. The data
quently occur in the remaining meteorological, hydrological,
sparsity test (Fig. 5) showed that, at least for Lake Erken, the
and biogeochemical training data.
ML models can still detect the seasonal algal dynamics even
for sample intervals approaching 1 month (Figs. S7–S9). If
4.1.2 Prediction of bloom timing this result holds for other lakes, the use of the two-step ML
workflow could offer a method of forecasting seasonal vari-
For the purposes of water management, it may be most im- ations in algal Chl, even in lakes with relatively infrequent
portant to first predict the potential occurrence of a bloom nutrient monitoring but higher frequency meteorological and
and then once underway improve predictions of its magni- hydrological data.
tude. The best model performance in predicting the timing of The hybrid PB/ML models have the potential to provide
algal blooms was obtained after adding hydrodynamic fea- reasonably accurate and timely short-term algal bloom fore-
tures derived from a PB model in workflow 3, with TPR casts, working as part of an early-warning system for the
above 45 % in detecting the onset of algal bloom during water resource management (Baracchini et al., 2020), and

clearly have the ability to predict border seasonal variations JPI). Jorrit P. Mesman was funded by the European Union’s Horizon
in algal Chl concentration. However, since a large number of 2020 Research and Innovation Programme under grant agreement
water temperature and water quality samples are required for nos. 722518 (MANTEL ITN) and 101017861 (SMARTLAGOON).
ML training, and since our results only apply to one well- This study has been made possible by the Swedish Infrastructure
studied lake, obtaining more datasets to test and evaluate for Ecosystem Science (SITES), in this case by data from the Erken
Laboratory of Uppsala University. SITES receives funding through
the workflows developed here is necessary. Monitoring net-
the Swedish Research Council under grant no. 2017-00635.
works (e.g. Global Lake Ecological Observatory Network,
GLEON; https://gleon.org/, last access: 19 September 2022),
could provide the data to allow more extensive testing and Financial support. This research has been supported by the Sven-
application of hybrid PB/ML models, and we are presently ska Forskningsrådet Formas (grant no. 2018-02771).
working in the GLEON network to test the methods devel-
oped in this paper on many other lakes.
Review statement. This paper was edited by Le Yu and reviewed by
two anonymous referees.
Code availability. Model version 1.0 has been archived in Zen-
odo under https://doi.org/10.5281/zenodo.7149563 (Lin, 2022) and
is available at https://github.com/Shuqi-Lin/Erken_Algal_Bloom_
Machine_Learning_Model.git (last access: 21 December 2022). References
Adrian, R., Wilhelm, S., and Gerten, D.: Life-history traits

of lake plankton species may govern their phenological re-
Data availability. All data from this study have been archived with
sponse to climate warming, Glob. Change Biol., 12, 652–661,
the code in Zenodo under https://doi.org/10.5281/zenodo.7149563
https://doi.org/10.1111/j.1365-2486.2006.01125.x, 2006.
(Lin, 2022) in the “training data” folder. Here we also provide
Baracchini, T., Wüest, A., and Bouffard, D.: Meteolakes:
the model forcing data in the format used in the machine learn-
An operational online three-dimensional forecasting plat-
ing models. Data collected by the Erken laboratory in the archived
form for lake hydrodynamics, Water Res., 172, 115529,
format used by the Swedish Infrastructure for Ecosystem Science
https://doi.org/10.1016/j.watres.2020.115529, 2020.
(SITES) are available from the SITES data archive at https://hdl.
Brookes, J. D. and Carey, C. C.: Resilience to Blooms, Science, 334,
handle.net/11676.1/qZYc4CMTOyxgvjv_gTAW08SO (Erken Lab-
46–47, https://doi.org/10.1126/science.1207349, 2011.
oratory, 2022).
Bruggeman, J. and Bolding, K.: A general framework for aquatic
biogeochemical models, Environ. Modell. Softw., 61, 249–265,
https://doi.org/10.1016/j.envsoft.2014.04.002, 2014.
Supplement. The supplement related to this article is available on- Burchard, H., Bolding, K., and Villarreal, M. R.: GOTM,
line at: https://doi.org/10.5194/gmd-16-35-2023-supplement. a General Ocean Turbulence Model: Theory, Implemen-
tation and Test Cases, European Commission, Joint Re-
search Centre, Space Applications Institute, 103, https:
Author contributions. The concept of ML model workflow was de- //books.google.be/books/about/GOTM_a_General_Ocean_
signed by SL and DCP. SL developed the ML model code and per- Turbulence_Model.html?id=zsJUHAAACAAJ&redir_esc=y
formed the simulations. JPM conducted the PB model simulations. (last access: 19 September 2022), 1999.
SL wrote the manuscript with contributions from DCP and JPM. Burford, M. A., Carey, C. C., Hamilton, D. P., Huisman, J., Paerl,
H. W., Wood, S. A., and Wulff, A.: Perspective: Advancing
the research agenda for improving understanding of cyanobac-
Competing interests. The contact author has declared that none of teria in a future of global change, Harmful Algae, 91, 101601,
the authors has any competing interests. https://doi.org/10.1016/j.hal.2019.04.004, 2020.
Carey, C. C., Ibelings, B. W., Hoffmann, E. P., Hamilton, D. P., and
Brookes, J. D.: Eco-physiological adaptations that favour fresh-
Disclaimer. Publisher’s note: Copernicus Publications remains water cyanobacteria in a changing climate, Water Res., 46, 1394–
neutral with regard to jurisdictional claims in published maps and 1407, https://doi.org/10.1016/j.watres.2011.12.016, 2012.
institutional affiliations. Elliott, J. A.: Is the future blue-green? A review of the cur-
rent model predictions of how climate change could affect
pelagic freshwater cyanobacteria, Water Res., 46, 1364–1371,
Acknowledgements. Shuqi Lin and this study are funded by the https://doi.org/10.1016/j.watres.2011.12.018, 2012.
EU and FORMAS project 2018-02771, in the frame of the col- Erken Laboratory: Meteorological data from Erken, Malma island,
laborative international Consortium BLOOWATER (https://www. 1988-10-12–2021-12-31, Swedish Infrastructure for Ecosys-
bloowater.eu/, last access: 19 Semptember 2022), financed under tem Science (SITES) [data set], https://hdl.handle.net/11676.
the ERA-NET WaterWorks2017 Cofounded Call. This ERA-NET 1/qZYc4CMTOyxgvjv_gTAW08SO, last access: 19 Septem-
is an integral part of the 2018 Joint Activities developed by the Wa- ber 2022.
ter Challenges for a Changing World Joint Program Initiative (Water Fornarelli, R., Galelli, S., Castelletti, A., Antenucci, J. P., and
Marti, C. L.: An empirical modeling approach to predict and

understand phytoplankton dynamics in a reservoir affected by E.: Automatic High Frequency Monitoring for Improved Lake
interbasin water transfers, Water Resour. Res., 49, 3626–3641, and Reservoir Management, Environ. Sci. Technol., 50, 10780–
https://doi.org/10.1002/wrcr.20268, 2013. 10794, https://doi.org/10.1021/acs.est.6b01604, 2016.
Friedman, J. H.: Greedy Function Approximation: A Gradient McHugh, M. L.: Interrater reliability: the kappa statistic, Biochem.
Boosting Machine, Ann. Stat., 29, 1189–1232, 2001. Medica, 22, 276–282, 2012.
Hanson, P. C., Stillman, A. B., Jia, X., Karpatne, A., Dugan, H. A., Mellios, N., Moe, S. J., and Laspidou, C.: Machine Learn-
Carey, C. C., Stachelek, J., Ward, N. K., Zhang, Y., Read, J. S., ing Approaches for Predicting Health Risk of Cyanobacte-
and Kumar, V.: Predicting lake surface water phosphorus dynam- rial Blooms in Northern European Lakes, Water, 12, 1191,
ics using process-guided machine learning, Ecol. Model., 430, https://doi.org/10.3390/w12041191, 2020.
109136, https://doi.org/10.1016/j.ecolmodel.2020.109136, 2020. Mesman, J. P., Ayala, A. I., Goyette, S., Kasparian, J., Marcé,
Harris, T. D.,and Graham, J. L.: Predicting cyanobacterial abun- R., Markensten, H., Stelzer, J. A. A., Thayne, M. W., Thomas,
dance, microcystin, and geosmin in a eutrophic drinking-water M. K., Pierson, D. C., and Ibelings, B. W.: Drivers of phy-
reservoir using a 14-year dataset, Lake Reserv. Manage., 33, 32– toplankton responses to summer wind events in a stratified
48, https://doi.org/10.1080/10402381.2016.1263694, 2017. lake: A modeling study, Limnol. Oceanogr., 67, 856–873,
Hense, I. and Beckmann, A.: Towards a model of cyanobacte- https://doi.org/10.1002/lno.12040, 2022.
ria life cycle – effects of growing and resting stages on bloom Moras, S., Ayala, A. I., and Pierson, D. C.: Historical modelling of
formation of N2 -fixing species, Ecol. Model., 195, 205–218, changes in Lake Erken thermal conditions, Hydrol. Earth Syst.
https://doi.org/10.1016/j.ecolmodel.2005.11.018, 2006. Sci., 23, 5001–5016, https://doi.org/10.5194/hess-23-5001-2019,
Hochreiter, S. and Schmidhuber, J.: Long Short- 2019.
Term Memory, Neural Comput., 9, 1735–1780, Nelson, N. G., Muñoz-Carpena, R., Phlips, E. J., Kaplan, D., Sucsy,
https://doi.org/10.1162/neco.1997.9.8.1735, 1997. P., and Hendrickson, J.: Revealing Biotic and Abiotic Controls
Huisman, J., Codd, G. A., Paerl, H. W., Ibelings, B. W., Verspagen, of Harmful Algal Blooms in a Shallow Subtropical Lake through
J. M. H., and Visser, P. M.: Cyanobacterial blooms, Nat. Rev. Mi- Statistical Machine Learning, Environ. Sci. Technol., 52, 3527–
crobiol., 16, 471–483, https://doi.org/10.1038/s41579-018-0040- 3535, https://doi.org/10.1021/acs.est.7b05884, 2018.
1, 2018. Paerl, H. W.: Nuisance phytoplankton blooms in coastal, estu-
Jia, X., Willard, J., Karpatne, A., Read, J., Zwart, J., Steinbach, arine, and inland waters, Limnol. Oceanogr., 33, 823–843,
M., and Kumar, V.: Physics Guided RNNs for Modeling Dy- https://doi.org/10.4319/lo.1988.33.4part2.0823, 1988.
namical Systems: A Case Study in Simulating Lake Temper- Paerl, H. W. and Huisman, J.: Blooms Like It Hot, Science, 320,
ature Profiles, in: Proceedings of the 2019 SIAM 558–566, 57–58, https://doi.org/10.1126/science.1155398, 2008.
https://doi.org/10.1137/1.9781611975673.63, 2019. Peretyatko, A., Teissier, S., De Backer, S., and Triest, L.: Classifi-
Jimeno-Sáez, P., Senent-Aparicio, J., Cecilia, J. M., and cation trees as a tool for predicting cyanobacterial blooms, Hy-
Pérez-Sánchez, J.: Using Machine-Learning Algorithms drobiologia, 689, 131–146, https://doi.org/10.1007/s10750-011-
for Eutrophication Modeling: Case Study of Mar Menor 0803-4, 2012.
Lagoon (Spain), Int. J. Env. Res. Pub. He., 17, 1189, Persson, I. and Jones, I. D.: The effect of water colour on lake
https://doi.org/10.3390/ijerph17041189, 2020. hydrodynamics: a modelling study, Freshwater Biol., 53, 2345–
Jöhnk, K. D., Brüggemann, R., Rücker, J., Luther, B., 2355, https://doi.org/10.1111/j.1365-2427.2008.02049.x, 2008.
Simon, U., Nixdorf, B., and Wiedner, C.: Modelling Pettersson, K.: The Availability of Phosphorus and the Species
life cycle and population dynamics of Nostocales Composition of the Spring Phytoplankton in Lake Erken, Inter-
(cyanobacteria), Environ. Modell. Softw., 26, 669–677, nationale Revue der gesamten Hydrobiologie und Hydrographie,
https://doi.org/10.1016/j.envsoft.2010.11.001, 2011. 70, 527–546, https://doi.org/10.1002/iroh.19850700407, 1985.
Karlsson-Elfgren, I., Rengefors, K., and Gustafsson, S.: Factors Pettersson, K.: Mechanisms for internal loading of
regulating recruitment from the sediment to the water column phosphorus in lakes, Hydrobiologia, 373, 21–25,
in the bloom-forming cyanobacterium Gloeotrichia echinulata, https://doi.org/10.1023/A:1017011420035, 1998.
Freshwater Biol., 49, 265–273, https://doi.org/10.1111/j.1365- Pettersson, K., Grust, K., Weyhenmeyer, G., and Blenckner,
2427.2004.01182.x, 2004. T.: Seasonality of chlorophyll and nutrients in Lake Erken
Karlsson-Elfgren, I., Hyenstrand, P., and Riydin, E.: Pelagic – effects of weather conditions, Hydrobiologia, 506, 75–81,
growth and colony division of Gloeotrichia echinu- https://doi.org/10.1023/B:HYDR.0000008582.61851.76, 2003.
lata in Lake Erken, J. Plankton Res., 27, 145–151, Pierson, D. C., Pettersson, K., and Istvanovics, V.: Temporal
https://doi.org/10.1093/plankt/fbh165, 2005. changes in biomass specific photosynthesis during the sum-
Lin, S.: Shuqi-Lin/Erken_Algal_Bloom_Machine_Learning_Model: mer: regulation by environmental factors and the importance
Erken_Algal_Bloom_Machine_Learning_Model (v1.1), Zenodo of phytoplankton succession, Hydrobiologia, 243, 119–135,
[code and data set], https://doi.org/10.5281/zenodo.7149563, https://doi.org/10.1007/BF00007027, 1992.
2022. Rahmani, F., Lawson, K., Ouyang, W., Appling, A., Oliver, S., and
Marcé, R., George, G., Buscarinu, P., Deidda, M., Dunalska, J., de Shen, C.: Exploring the exceptional performance of a deep learn-
Eyto, E., Flaim, G., Grossart, H.-P., Istvanovics, V., Lenhardt, ing stream temperature model and the value of streamflow data,
M., Moreno-Ostos, E., Obrador, B., Ostrovsky, I., Pierson, D. Environ. Res. Lett., 16, 024025, https://doi.org/10.1088/1748-
C., Potužák, J., Poikane, S., Rinke, K., Rodríguez-Mozaz, S., 9326/abd501, 2021.
Staehr, P. A., Šumberová, K., Waajen, G., Weyhenmeyer, G. Read, J. S., Hamilton, D. P., Jones, I. D., Muraoka, K., Winslow,
A., Weathers, K. C., Zion, M., Ibelings, B. W., and Jennings, L. A., Kroiss, R., Wu, C. H., and Gaiser, E.: Derivation

of lake mixing and stratification indices from high-resolution Stanley, F. K. T., Irvine, J. L., Jacques, W. R., Salgia, S. R.,
lake buoy data, Environ. Modell. Softw., 26, 1325–1336, Innes, D. G., Winquist, B. D., Torr, D., Brenner, D. R., and
https://doi.org/10.1016/j.envsoft.2011.05.006, 2011. Goodarzi, A. A.: Radon exposure is rising steadily within the
Read, J. S., Jia, X., Willard, J., Appling, A. P., Zwart, J. A., Oliver, S. modern North American residential environment, and is in-
K., Karpatne, A., Hansen, G. J. A., Hanson, P. C., Watkins, W., creasingly uniform across seasons, Scientific Reports, 9, 18472,
Steinbach, M., and Kumar, V.: Process-Guided Deep Learning https://doi.org/10.1038/s41598-019-54891-8, 2019.
Predictions of Lake Water Temperature, Water Resour. Res., 55, Watson, S. B., Miller, C., Arhonditsis, G., Boyer, G. L., Carmichael,
9173–9190, https://doi.org/10.1029/2019WR024922, 2019. W., Charlton, M. N., Confesor, R., Depew, D. C., Höök, T.
Rousso, B. Z., Bertone, E., Stewart, R., and Hamilton, D. P.: A O., Ludsin, S. A., Matisoff, G., McElmurry, S. P., Murray,
systematic literature review of forecasting and predictive models M. W., Peter Richards, R., Rao, Y. R., Steffen, M. M., and
for cyanobacteria blooms in freshwater lakes, Water Res., 182, Wilhelm, S. W.: The re-eutrophication of Lake Erie: Harm-
115959, https://doi.org/10.1016/j.watres.2020.115959, 2020. ful algal blooms and hypoxia, Harmful Algae, 56, 44–66,
Recknagel, F., Fukushima, T., Hanazato, T., Takamura, N., and Wil- https://doi.org/10.1016/j.hal.2016.04.010, 2016.
son, H.: Modelling and prediction of phyto- and zooplankton Wei, B., Sugiura, N., and Maekawa, T.: Use of artificial neural net-
dynamics in Lake Kasumigaura by artificial neural networks, work in the prediction of algal blooms, Water Res., 35, 2022–
Lakes & Reservoirs: Research & Management, 3, 123–133, 2028, https://doi.org/10.1016/S0043-1354(00)00464-4, 2001.
https://doi.org/10.1111/j.1440-1770.1998.tb00039.x, 1998. Xiao, X., He, J., Huang, H., Miller, T. R., Christakos,
Reichwaldt, E. S. and Ghadouani, A.: Effects of rainfall patterns on G., Reichwaldt, E. S., Ghadouani, A., Lin, S., Xu,
toxic cyanobacterial blooms in a changing climate: Between sim- X., and Shi, J.: A novel single-parameter approach for
plistic scenarios and complex dynamics, Water Res., 46, 1372– forecasting algal blooms, Water Res., 108, 222–231,
1393, https://doi.org/10.1016/j.watres.2011.11.052, 2012. https://doi.org/10.1016/j.watres.2016.10.076, 2017.
Richardson, J., Miller, C., Maberly, S. C., Taylor, P., Globevnik, Yang, Y., Stenger-Kovács, C., Padisák, J., and Pettersson, K.: Ef-
L., Hunter, P., Jeppesen, E., Mischke, U., Moe, S. J., fects of winter severity on spring phytoplankton development in
Pasztaleniec, A., Søndergaard, M., and Carvalho, L.: Ef- a temperate lake (Lake Erken, Sweden), Hydrobiologia, 780, 47–
fects of multiple stressors on cyanobacteria abundance 57, https://doi.org/10.1007/s10750-016-2777-8, 2016.
vary with lake type, Glob. Change Biol., 24, 5044–5055,
https://doi.org/10.1111/gcb.14396, 2018.

Prediction of Algal Blooms Via Data-Driven Machine Learning Models: An Evaluation Using Data From A Well-Monitored Mesotrophic Lake

Uploaded by

Copyright:

Available Formats

Prediction of Algal Blooms Via Data-Driven Machine Learning Models: An Evaluation Using Data From A Well-Monitored Mesotrophic Lake

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prediction of Algal Blooms Via Data-Driven Machine Learning Models: An Evaluation Using Data From A Well-Monitored Mesotrophic Lake

Uploaded by

Copyright:

Available Formats

Model description paper

Geosci. Model Dev., 16, 35–46, 2023

Prediction of algal blooms via data-driven machine learning models:

Correspondence: Shuqi Lin ([email protected])

Received: 6 July 2022 – Discussion started: 2 August 2022

Published by Copernicus Publications on behalf of the European Geosciences Union.

Geosci. Model Dev., 16, 35–46, 2023 https://doi.org/10.5194/gmd-16-35-2023

https://doi.org/10.5194/gmd-16-35-2023 Geosci. Model Dev., 16, 35–46, 2023

Variables Sample interval Workflow 1 Workflow 2 Workflow 3

Geosci. Model Dev., 16, 35–46, 2023 https://doi.org/10.5194/gmd-16-35-2023

3 Results 3 both ML models showed comparable performance in Chl

https://doi.org/10.5194/gmd-16-35-2023 Geosci. Model Dev., 16, 35–46, 2023

Model PB ML-workflow 1 ML-workflow 2 ML-workflow 3

Geosci. Model Dev., 16, 35–46, 2023 https://doi.org/10.5194/gmd-16-35-2023

series comparison of observed and predicted Chl from this

4.1 Performance of ML models

In three workflows, the ML models successfully reproduced

https://doi.org/10.5194/gmd-16-35-2023 Geosci. Model Dev., 16, 35–46, 2023

Geosci. Model Dev., 16, 35–46, 2023 https://doi.org/10.5194/gmd-16-35-2023

https://doi.org/10.5194/gmd-16-35-2023 Geosci. Model Dev., 16, 35–46, 2023

Adrian, R., Wilhelm, S., and Gerten, D.: Life-history traits

Geosci. Model Dev., 16, 35–46, 2023 https://doi.org/10.5194/gmd-16-35-2023

https://doi.org/10.5194/gmd-16-35-2023 Geosci. Model Dev., 16, 35–46, 2023

Geosci. Model Dev., 16, 35–46, 2023 https://doi.org/10.5194/gmd-16-35-2023

You might also like