Prediction of Algal Blooms Via Data-Driven Machine Learning Models: An Evaluation Using Data From A Well-Monitored Mesotrophic Lake
Prediction of Algal Blooms Via Data-Driven Machine Learning Models: An Evaluation Using Data From A Well-Monitored Mesotrophic Lake
Prediction of Algal Blooms Via Data-Driven Machine Learning Models: An Evaluation Using Data From A Well-Monitored Mesotrophic Lake
Abstract. With increasing lake monitoring data, data-driven 2011; Paerl and Huisman, 2008). Moreover, as indicated by
machine learning (ML) models might be able to capture the Carey et al. (2012) and Huisman et al. (2018), more intense
complex algal bloom dynamics that cannot be completely de- and longer periods of thermal stratification could potentially
scribed in process-based (PB) models. We applied two ML specifically favour blooms of toxic Cyanobacteria. To better
models, the gradient boost regressor (GBR) and long short- manage and mitigate the effects of algal blooms, methods to
term memory (LSTM) network, to predict algal blooms and forecast their timing and magnitude are needed. However,
seasonal changes in algal chlorophyll concentrations (Chl) in the factors regulating algal blooms are complex, variable,
a mesotrophic lake. Three predictive workflows were tested, and site-specific, often involving high-order interactions of
one based solely on available measurements and the others environmental factors and biogeochemical processes (Reich-
applying a two-step approach, first estimating lake nutrients waldt and Ghadouani, 2012; Richardson et al., 2018).
that have limited observations and then predicting Chl using Process-based (PB) models encode our understanding of
observed and pre-generated environmental factors. The third biogeochemical processes into a framework of numerical for-
workflow was developed using hydrodynamic data derived mulations, but these are inevitable simplifications that lead
from a PB model as additional training features in the two- to an incomplete description of complex biogeochemical in-
step ML approach. The performance of the ML models was teractions and low level of model confidence (Elliott, 2012).
superior to a PB model in predicting nutrients and Chl. The Based on innovative data mining and statistical techniques,
hybrid model further improved the prediction of the timing data-driven machine learning (ML) models have been ap-
and magnitude of algal blooms. A data sparsity test based on plied to identify patterns within observed data (Peretyatko
shuffling the order of training and testing years showed the et al., 2012; Mellios et al., 2020), and with the recent pro-
accuracy of ML models decreased with increasing sample liferation of lake monitoring data (Marcé et al., 2016), ML
interval, and model performance varied with training–testing models have been applied, as an alternative to PB models
year combinations. for bloom prediction (Rousso et al., 2020). Previously ap-
plied ML models, including random forest (Nelson et al.,
2018), support vector machine (Jimeno-Sáez et al., 2020),
and artificial neural network models (Xiao et al., 2017; Reck-
1 Introduction nagel et al., 1998; Wei et al., 2001), can improve predictions
of the timing and seasonality of algal Chl pattern, appar-
Harmful algal blooms, which are a serious threat to natural ently by accounting for complexity that is difficult to encode
water systems, have been increasing throughout the world within the framework of a PB model. However, a downside
(Burford et al., 2020; Watson et al., 2016), primarily as a of data-driven ML models is that they lack the interpretabil-
consequence of both climate change and increased nutrient ity and generalization found in the explicit structure of the
loading from anthropogenic activities (Brookes and Carey,
PB model. In recent years, the process-guided deep learning beginning in May–June and ending in August–September.
(PGDL) model has emerged and has been applied to water The onset of ice cover usually begins in December–February,
temperature (Jia et al., 2019; Read et al., 2019) and water and the loss of ice occurs in March–April (Persson and Jones,
quality (Hanson et al., 2020) simulations, which explicitly 2008). Located near the Baltic coast, Lake Erken is wind-
combine well-defined physical theories into the training of exposed and susceptible to periodic wind-induced turbulent
ML models, enhancing their interpretability. While this ap- mixing.
proach has achieved promising results, it is difficult to apply Changes in algal Chl in Lake Erken have a typical sea-
it to phytoplankton dynamics due to numerous nonlinear in- sonal pattern, with spring and summer peaks in concentra-
teractions within the biogeochemical cycles and the difficulty tion (Pettersson et al., 2003). Spring blooms are dominated
in defining a measurable processes or mass balances that can by dinoflagellates and diatoms (Pettersson, 1985) and initi-
be used as a physical constraint on knowledge-guided deci- ated by overwinter species from the last autumn (Yang et al.,
sions. Also, the sparsity of lake water quality (e.g. nutrients 2016). Cyanobacteria dominate summer peaks in Chl, given
and Chl concentration) observations can limit the applica- that they can optimize their vertical position with regard to
tion of ML models in algal bloom modelling (Rousso et al., nutrients and light (Paerl, 1988; Pierson et al., 1992).
2020).
In this study, our objectives are to (1) apply the ML mod- 2.2 Data
els to predict algal bloom in a well-monitored mesotrophic
lake, (2) evaluate model performance and assess model un- Lake Erken has a long-running automated monitoring pro-
certainties, and (3) explore the approaches to improve the gramme that provides hourly meteorological data, water tem-
model performance and widen the model applications. We perature profiles between 0.5 and 15 m at 0.5 m intervals, and
first tested the ability of ML models in predicting algal Chl the flow from the inflow and outflow (Fig. 1). A manual sam-
concentrations via available environmental factors, including pling programme collects samples during ice-free time at 5–
observed lake nutrient data, and then proposed a two-step 7 d intervals for all major nutrient concentrations (e.g. NOx ,
ML approach for predicting algal dynamics that first esti- NH4 , PO4 , total P, and Si, etc.), dissolved oxygen (O2 ), and
mates lake nutrient concentrations which often have limited Chl concentration. The timing of the onset and loss of ice
observations and secondly predicts variations in algal Chl cover are also monitored yearly by the lab. More detailed in-
using these pre-generated nutrient concentrations combined formation on the sampling programme is in the Supplement
with other observed environmental factors that are collected (see Sect. S1) and Moras et al. (2019).
at higher frequency. We also tested a simple hybrid model
2.3 Modelling methods
architecture that, by adding hydrodynamic features derived
from the PB model into the training features of the two-step
2.3.1 Process-based (PB) lake model
ML approach, allowed us to include additional information
describing physical lake processes expected to affect varia- In this study, a PB hydrodynamic lake model, GOTM (Gen-
tions in algal growth and succession in the machine learning eral Ocean Turbulence Model; Burchard et al., 1999), was
prediction. used to generate water temperature profiles and other hy-
We applied the above workflows to predict changing Chl drodynamic metrics. GOTM also served as the foundation
concentration, as a proxy for the occurrence of algal blooms, of water quality simulations made with the SELMAPROT-
via the gradient boost regressor (GBR) and long short-term BAS model (Mesman et al., 2022) that is coupled to GOTM
memory network (LSTM). Two shuffling year tests were through the Framework for Aquatic Biogeochemical Models
conducted. One assessed the uncertainty of ML models in (FABM; Bruggeman and Bolding, 2014).
predicting Chl during the same 2-year period, and the other
evaluated the sensitivity of ML accuracy to various training– 2.3.2 Data-driven machine learning (ML) models
testing year combinations and lake nutrient sampling inter-
vals. Model performance and potential applications in algal Tree models have been widely applied in modelling phyto-
bloom forecasting are discussed. plankton dynamics in freshwater systems (Harris and Gra-
ham, 2017; Fornarelli et al., 2013; Rousso et al., 2020).
The gradient boosting regressor (GBR) is one of these tree
2 Methods models, iteratively generating an ensemble of estimator trees
with each tree improving upon the performance of the pre-
2.1 Study site vious. Details about the GBR model can be found in Fried-
man (2001). The hyperparameters in GBR are optimized via
The study site, Lake Erken, is a mesotrophic lake located in the RandomizedSearchCV function within the Scikit-Learn
east-central Sweden that has a surface area of 24 km2 , a max- library. The loss function of model is chosen as “huber”,
imum depth of 21 m, and an average retention time of 7 years. which is a combination of the squared error and absolute er-
The lake is dimictic, with seasonal stratification commonly ror of regression. Since the target variable in our research Chl
Figure 1. Map of Lake Erken. The locations of the monitoring systems are shown.
concentration has peak values during algal blooms, which In workflow 2 and 3, a two-step approach was applied (Ta-
could be regarded as outliers, the huber loss function is more ble 1). Daily measurements of physical factors were used
robust and gives greater weight to peak values than the mean to pre-generate daily variations in lake nutrients via sepa-
squared error function. rate ML models, and the ML models were trained at a daily
The long short-term memory (LSTM) network is part of time step using the measured environmental factors and pre-
a class of deep learning architectures, called recurrent neural generated nutrient concentrations. The time step of LSTM
networks (RNNs), built for sequential and time series mod- was then set to 7 d.
elling (Hochreiter and Schmidhuber, 1997). The core con- In workflow 3, three hydrodynamic features, i.e. mixing
cepts of LSTM are the cell and hidden states and its three layer depth (ze ), Wedderburn number (Wn), and the seasonal
gates (input gate, forget gate, and output gate; see Fig. S2 thermocline depth (thermD), derived from the GOTM model
in the Supplement). Essentially, the LSTM model defines a were regarded as daily training features in the two-step ML
transition relationship for a hidden representation through a approach. The definitions and calculations of these features
LSTM cell which combines the input features at each time are explained in the next section, Sect. 2.5, “Feature selec-
step with the inherited information from previous time steps. tion and processing for ML models”, and the Supplement
This architecture is suitable for extracting information from (Sect. S3)
sequential data (Rahmani et al., 2021; Read et al., 2019). The Following the two-step approach and using workflow 3,
hyperparameter settings in LSTM can be found in the Sup- we set up two tests. (1) To assess the uncertainty induced
plement (see Sect. S2). by variations in the data used to train the ML models, we
Compared to the GBR model, LSTM has more com- shuffled the training years, randomly taking 13 years out of
plex model architectures, carrying the “memory” from the the 2004–2018 dataset 30 times, and tested the model predic-
previous time steps. In this study, the GBR and LSTM tions of Chl during 2019–2020. And, (2) to test if the work-
were applied, respectively, to assess the performance of ML flow could be used for other water systems which may have
models with and without memory. Both ML models are less frequent lake nutrient monitoring data, we conducted a
built in Python using the Scikit-Learn (https://scikit-learn. data sparsity test that evaluated the sensitivity of models to
org/stable/, last access: 19 September 2022) and Tensor- the lake nutrient and Chl sampling interval. For this test the
Flow (https://www.tensorflow.org/, last access: 19 Septem- lake nutrient and Chl concentration observations in the train-
ber 2022) libraries. ing dataset were downsampled to a 7, 14, 21, 28, and 35 d
sampling interval. Then for each sampling interval using the
2.4 Design of predictive workflows and shuffling year 2004–2020 dataset, Chl was predicted for different consecu-
data sparsity tests tive 4-year periods when the ML models were trained by the
remaining 13 years of data. Data shuffling was conducted 13
times so that every 4-year period in our dataset was tested.
In this study, we tested three workflows using a dataset
split for training (years 2004–2016) and testing (years 2017–
2020). In all three workflows, a 5-fold cross-validation using 2.5 Feature selection and processing for ML models
the training dataset was used to optimize the hyperparam-
eters in the ML models. Workflow 1 directly predicts Chl
concentration based on available environmental observations The feature selection process is based on some a priori
(Table 1). The training and testing datasets were limited by knowledge of the underlying phenomena related to algal
the frequency of lake nutrient observations, which resulted blooms. All workflows made use of the daily automated
in 5–7 d gaps between data points. The time step of LSTM monitoring data. In addition, the temperature difference
was set to 1; that is, the environmental factors on the target (1T ) between surface water (averaged over the upper 3 m)
date and previous observation date, which may be 5–7 d ago, and bottom water (15 m) was also used to represent the ther-
were used to train the model and make predictions. mal structure of the lake, and the duration of ice cover in
Table 1. List of training features and target variables in each workflow. Stars (*) indicate training features, circles (o) indicate target variables,
and squares () indicate the variables are the target variables in step 1 used to daily produce a training feature for use in step 2. The order of
nutrient model sequence is from the top to bottom based on its position in the table (NOx to Si).
the previous winter and the number of days from ice-off date rolling mean. Using these data, the onset of a bloom was cat-
were used. egorized as occurring when the daily change of Chl (1Chl)
In workflow 2 and 3 nutrients are predicted sequentially, exceeded a threshold, 0.35 mg m−3 d−1 . This works well in
with each pre-generated nutrient prediction included in the Lake Erken where Chl concentrations are frequently moni-
training data of the next nutrient prediction (Table 1). Work- tored (near weekly), and the linear interpolation can be ex-
flow 3 added ze , computed using the GOTM-simulated ver- pected to be reasonably representative of the Chl concentra-
tical eddy diffusivity (Kz ) profiles; thermD, estimated us- tions between measured samples. Considering the random-
ing Lake Analyzer (Read et al., 2011) based on GOTM- ization in the ML models, we also add a 3 d window on the
simulated temperature profiles; and Wn, a dimensionless pa- bloom onset prediction; that is, we considered the prediction
rameter measuring the balance between wind stress and the of a bloom valid if the measured data suggested a bloom the
pressure gradient resulting from the slope of the interface day before or after the simulated onset. We used the true pos-
(see Sect. S3), as additional daily training features. itive rate (TPR), false positive rate (FPR), and modified ac-
curacy (kappa), which considers the possibility of the agree-
2.6 Evaluating metrics ment occurring by chance (McHugh, 2012), to identify the
potential of ML models to correctly capture the algal bloom
Model performance was evaluated by comparing the simu- onset (see Table S1). A model with 100 % TPR, 0 % FPR,
lated and measured Chl concentrations and by calculating the and 100 % kappa would constitute a perfect fit.
mean absolute error (MAE), root mean square error (RMSE),
and correlation coefficient (R 2 ). To evaluate the accuracy of
the model in detecting the onset of an algal bloom, we cal-
culated a confusion matrix in workflows 2 and 3, where the
observations were linearly interpolated to daily values, and
predicted daily Chl concentration were smoothed with a 7 d
Figure 2. Time series of observed and predicted Chl from GBR and LSTM models in (a) workflow 1 and (c) workflow 3, and the corre-
sponding scatter plots of observations vs ML predictions of Chl in workflow 1 and workflow 3 are shown in panels (b) and (d), with the
black and blue dots and lines representing the predictions from GBR and LSTM, respectively. Panel (e) shows the observed and predicted
algal bloom onsets in 2017–2020 using the same colour coding as the previous panels. Results from the PB model simulation in Mesman et
al. (2022) are also shown in (c) and (e).
Table 2. Comparisons of model performance during the testing period based on RMSE, MAE, and R 2 . The unit of Chl is milligrams per
cubic metre (mg m−3 ). In bold are the best fits of each statistical metric. For comparison of training and testing periods, see Table S2.
well represented, especially by LSTM, in terms of timing and 3.5 Shuffling year data sparsity test
magnitude.
Despite comparable RMSE and MAE in LSTM and the To examine the possible use of workflow 3 when data are less
GBR (Fig. 4c), both higher TPRs (with median of 60 %) frequently available, lake nutrient and Chl data were down-
and FPRs (with median of 18 %) in LSTM indicate that the sampled so that the effects of sampling frequency on model
LSTM model was more aggressive in making algal bloom predictions could be evaluated. Each downsampled dataset
predictions. The GBR model’s apparent advantage in FPRs was also rearranged into 13 different 13-year training peri-
(with median 10 %) is largely the result of it making a lower ods and 4-year testing periods. The variability in predictions
number of bloom predictions since the low concentrations provided a measure of model performance and uncertainty.
between spring and summer blooms in 2020 were not well Figure 5 shows the uncertainty in model predictions as a con-
represented (Fig. 4b). sequence of the chosen sampling intervals.
The MAEs and RMSEs of both GBR and LSTM models
tended to increase with the longer sample intervals. The me-
dian MAE was always slightly higher for the LSTM model,
4 Discussion
Figure 4. (a) Time series of observed (red stars) and predicted Chl from GBR (black) and LSTM (blue) models in the shuffling training year
test. The shades represent the range between minimum and maximum prediction, and the solid lines represent the median prediction. Panel
(b) shows the box plot of TPR, FPR, and kappa, and panel (c) shows the box plot of MAE and RMSE of both models in the shuffling training
year test.
Figure 5. Comparisons of (a) MAE, (b) RMSE, and (c) TPR between GBR and LSTM during the testing period created under various
sample intervals. Circles along the box show the result from the testing period of all shuffled training–testing year combinations, and the
bigger circles represent 2004–2016 training and 2017–2020 testing year combination, as used in Fig. 2.
increased recruitment of akinetes could explain the under- 2017–2020 and a modified accuracy (kappa) around 80 %,
estimation of the 2019 summer bloom. Even the LSTM al- indicating a moderate–strong level of prediction.
gorithms could not account for previous conditions so far Based on our shuffling year tests of bloom timing, the
back in time as to affect the formation and deposition of GBR model showed relatively higher median TPRs than the
Cyanobacteria akinetes (this may require the memory of the LSTM model for sample intervals less than 1 month. How-
last ice-free season). The consequent poor fit of summer ever, in some training and testing year combinations, TPRs
bloom in 2019 partially led to the higher MAE and RMSE are close to 0 % (Fig. 5), and CVs of the TPRs are highly vari-
in the testing dataset compared to the training dataset in all able, even at the original sample interval, being over 30 % for
three workflows, in both GBR and LSTM models. GBR and over 60 % for LSTM, indicating that the correct de-
Warm winters can initiate a chain of events, i.e. shortening tection of algal blooms in both models is highly dependent on
the ice cover duration, extending spring circulation, affect- the years used to train the models. Thus, while the ML mod-
ing nutrient availability, and causing an earlier spring bloom els can be better than the PB models at predicting the onset
(Adrian et al., 2006; Yang et al., 2016). According to the ice of algal blooms, they still may not be good enough for oper-
record in Lake Erken (see Fig. S1), in 2020, the lake was ational forecasting. The resulting variability provided a more
covered by very thin ice for only 5 d, which is the shortest accurate estimate of the model performance at each down-
duration since observations were first recorded in 1954. The sampled data interval and showed that increasing sample in-
spring bloom in 2020 did occur earlier than other years (see terval led to reduced performance for both ML models, in
Fig. S3), and both ML models which considered the timing terms of MAE, RMSE, and the CV of TPR. These tests also
of lake ice show fairly good performance in predicting the highlighted that the performance of both ML models, espe-
timing and magnitude of this abnormally early spring bloom cially LSTM, varied with the sampled history of events in the
(Figs. 2, 5) training period for evaluating a specific pattern of change in
the testing period. We suggest that testing strategies similar
4.1.1 Performance of hybrid PB ML models to the shuffle methods used in this study are needed to ac-
curately evaluate the expected accuracy of ML models when
applied to any given site. The estimated uncertainty in shuf-
One-dimensional PB hydrodynamic models can accurately
fling training year tests (Fig. 4) and shuffling training–testing
simulate both water temperature profiles and other hydrody-
year tests (Fig. 5) can be used to better represent the uncer-
namic features in Lake Erken using the same forcing data that
tainty of ML derived forecasts.
are commonly input to ML models. The hybrid model struc-
ture tested here provides a richer set of input data, leading to
4.2 Future applications in short-term forecasts and
more accurate ML predictions of algal Chl at little additional
water management
computational cost or data requirements. Using data from the
hydrothermal PB model allowed for the seasonal deepening
To reach the goal of incorporating ML models into opera-
of the thermocline and variations in the surface mixing layer
tional forecasts either for short-term management support or
depth and upwelling events, represented by Wn, to be en-
longer-term evaluation and planning, two steps must occur.
coded into the ML algorithms. These factors can affect the
First the ML model must be developed, trained, and evalu-
underwater light climate, the internal loading of phosphorus,
ated on the water body of interest due to the unique physical
and the transport of resting Cyanobacteria colonies from the
characteristics and water quality dynamics in different sys-
hypolimnion into the epilimnion, favouring summer blooms
tems. Secondly, future forcing data for the model must be
of Cyanobacteria (Pierson et al., 1992; Pettersson, 1998). The
obtained and integrated into a workflow that makes the fu-
inclusion of these factors did increase the accuracy of the
ture predications. In regards to the second point, a lack of
ML models, especially in the case of unusual environmental
frequent water monitoring (Stanley et al., 2019) is a major
conditions (e.g. spring of 2020, Figs. 2, 5) that did not fre-
deterrence to applying ML models to many lakes. The data
quently occur in the remaining meteorological, hydrological,
sparsity test (Fig. 5) showed that, at least for Lake Erken, the
and biogeochemical training data.
ML models can still detect the seasonal algal dynamics even
for sample intervals approaching 1 month (Figs. S7–S9). If
4.1.2 Prediction of bloom timing this result holds for other lakes, the use of the two-step ML
workflow could offer a method of forecasting seasonal vari-
For the purposes of water management, it may be most im- ations in algal Chl, even in lakes with relatively infrequent
portant to first predict the potential occurrence of a bloom nutrient monitoring but higher frequency meteorological and
and then once underway improve predictions of its magni- hydrological data.
tude. The best model performance in predicting the timing of The hybrid PB/ML models have the potential to provide
algal blooms was obtained after adding hydrodynamic fea- reasonably accurate and timely short-term algal bloom fore-
tures derived from a PB model in workflow 3, with TPR casts, working as part of an early-warning system for the
above 45 % in detecting the onset of algal bloom during water resource management (Baracchini et al., 2020), and
clearly have the ability to predict border seasonal variations JPI). Jorrit P. Mesman was funded by the European Union’s Horizon
in algal Chl concentration. However, since a large number of 2020 Research and Innovation Programme under grant agreement
water temperature and water quality samples are required for nos. 722518 (MANTEL ITN) and 101017861 (SMARTLAGOON).
ML training, and since our results only apply to one well- This study has been made possible by the Swedish Infrastructure
studied lake, obtaining more datasets to test and evaluate for Ecosystem Science (SITES), in this case by data from the Erken
Laboratory of Uppsala University. SITES receives funding through
the workflows developed here is necessary. Monitoring net-
the Swedish Research Council under grant no. 2017-00635.
works (e.g. Global Lake Ecological Observatory Network,
GLEON; https://gleon.org/, last access: 19 September 2022),
could provide the data to allow more extensive testing and Financial support. This research has been supported by the Sven-
application of hybrid PB/ML models, and we are presently ska Forskningsrådet Formas (grant no. 2018-02771).
working in the GLEON network to test the methods devel-
oped in this paper on many other lakes.
Review statement. This paper was edited by Le Yu and reviewed by
two anonymous referees.
Code availability. Model version 1.0 has been archived in Zen-
odo under https://doi.org/10.5281/zenodo.7149563 (Lin, 2022) and
is available at https://github.com/Shuqi-Lin/Erken_Algal_Bloom_
Machine_Learning_Model.git (last access: 21 December 2022). References
understand phytoplankton dynamics in a reservoir affected by E.: Automatic High Frequency Monitoring for Improved Lake
interbasin water transfers, Water Resour. Res., 49, 3626–3641, and Reservoir Management, Environ. Sci. Technol., 50, 10780–
https://doi.org/10.1002/wrcr.20268, 2013. 10794, https://doi.org/10.1021/acs.est.6b01604, 2016.
Friedman, J. H.: Greedy Function Approximation: A Gradient McHugh, M. L.: Interrater reliability: the kappa statistic, Biochem.
Boosting Machine, Ann. Stat., 29, 1189–1232, 2001. Medica, 22, 276–282, 2012.
Hanson, P. C., Stillman, A. B., Jia, X., Karpatne, A., Dugan, H. A., Mellios, N., Moe, S. J., and Laspidou, C.: Machine Learn-
Carey, C. C., Stachelek, J., Ward, N. K., Zhang, Y., Read, J. S., ing Approaches for Predicting Health Risk of Cyanobacte-
and Kumar, V.: Predicting lake surface water phosphorus dynam- rial Blooms in Northern European Lakes, Water, 12, 1191,
ics using process-guided machine learning, Ecol. Model., 430, https://doi.org/10.3390/w12041191, 2020.
109136, https://doi.org/10.1016/j.ecolmodel.2020.109136, 2020. Mesman, J. P., Ayala, A. I., Goyette, S., Kasparian, J., Marcé,
Harris, T. D.,and Graham, J. L.: Predicting cyanobacterial abun- R., Markensten, H., Stelzer, J. A. A., Thayne, M. W., Thomas,
dance, microcystin, and geosmin in a eutrophic drinking-water M. K., Pierson, D. C., and Ibelings, B. W.: Drivers of phy-
reservoir using a 14-year dataset, Lake Reserv. Manage., 33, 32– toplankton responses to summer wind events in a stratified
48, https://doi.org/10.1080/10402381.2016.1263694, 2017. lake: A modeling study, Limnol. Oceanogr., 67, 856–873,
Hense, I. and Beckmann, A.: Towards a model of cyanobacte- https://doi.org/10.1002/lno.12040, 2022.
ria life cycle – effects of growing and resting stages on bloom Moras, S., Ayala, A. I., and Pierson, D. C.: Historical modelling of
formation of N2 -fixing species, Ecol. Model., 195, 205–218, changes in Lake Erken thermal conditions, Hydrol. Earth Syst.
https://doi.org/10.1016/j.ecolmodel.2005.11.018, 2006. Sci., 23, 5001–5016, https://doi.org/10.5194/hess-23-5001-2019,
Hochreiter, S. and Schmidhuber, J.: Long Short- 2019.
Term Memory, Neural Comput., 9, 1735–1780, Nelson, N. G., Muñoz-Carpena, R., Phlips, E. J., Kaplan, D., Sucsy,
https://doi.org/10.1162/neco.1997.9.8.1735, 1997. P., and Hendrickson, J.: Revealing Biotic and Abiotic Controls
Huisman, J., Codd, G. A., Paerl, H. W., Ibelings, B. W., Verspagen, of Harmful Algal Blooms in a Shallow Subtropical Lake through
J. M. H., and Visser, P. M.: Cyanobacterial blooms, Nat. Rev. Mi- Statistical Machine Learning, Environ. Sci. Technol., 52, 3527–
crobiol., 16, 471–483, https://doi.org/10.1038/s41579-018-0040- 3535, https://doi.org/10.1021/acs.est.7b05884, 2018.
1, 2018. Paerl, H. W.: Nuisance phytoplankton blooms in coastal, estu-
Jia, X., Willard, J., Karpatne, A., Read, J., Zwart, J., Steinbach, arine, and inland waters, Limnol. Oceanogr., 33, 823–843,
M., and Kumar, V.: Physics Guided RNNs for Modeling Dy- https://doi.org/10.4319/lo.1988.33.4part2.0823, 1988.
namical Systems: A Case Study in Simulating Lake Temper- Paerl, H. W. and Huisman, J.: Blooms Like It Hot, Science, 320,
ature Profiles, in: Proceedings of the 2019 SIAM 558–566, 57–58, https://doi.org/10.1126/science.1155398, 2008.
https://doi.org/10.1137/1.9781611975673.63, 2019. Peretyatko, A., Teissier, S., De Backer, S., and Triest, L.: Classifi-
Jimeno-Sáez, P., Senent-Aparicio, J., Cecilia, J. M., and cation trees as a tool for predicting cyanobacterial blooms, Hy-
Pérez-Sánchez, J.: Using Machine-Learning Algorithms drobiologia, 689, 131–146, https://doi.org/10.1007/s10750-011-
for Eutrophication Modeling: Case Study of Mar Menor 0803-4, 2012.
Lagoon (Spain), Int. J. Env. Res. Pub. He., 17, 1189, Persson, I. and Jones, I. D.: The effect of water colour on lake
https://doi.org/10.3390/ijerph17041189, 2020. hydrodynamics: a modelling study, Freshwater Biol., 53, 2345–
Jöhnk, K. D., Brüggemann, R., Rücker, J., Luther, B., 2355, https://doi.org/10.1111/j.1365-2427.2008.02049.x, 2008.
Simon, U., Nixdorf, B., and Wiedner, C.: Modelling Pettersson, K.: The Availability of Phosphorus and the Species
life cycle and population dynamics of Nostocales Composition of the Spring Phytoplankton in Lake Erken, Inter-
(cyanobacteria), Environ. Modell. Softw., 26, 669–677, nationale Revue der gesamten Hydrobiologie und Hydrographie,
https://doi.org/10.1016/j.envsoft.2010.11.001, 2011. 70, 527–546, https://doi.org/10.1002/iroh.19850700407, 1985.
Karlsson-Elfgren, I., Rengefors, K., and Gustafsson, S.: Factors Pettersson, K.: Mechanisms for internal loading of
regulating recruitment from the sediment to the water column phosphorus in lakes, Hydrobiologia, 373, 21–25,
in the bloom-forming cyanobacterium Gloeotrichia echinulata, https://doi.org/10.1023/A:1017011420035, 1998.
Freshwater Biol., 49, 265–273, https://doi.org/10.1111/j.1365- Pettersson, K., Grust, K., Weyhenmeyer, G., and Blenckner,
2427.2004.01182.x, 2004. T.: Seasonality of chlorophyll and nutrients in Lake Erken
Karlsson-Elfgren, I., Hyenstrand, P., and Riydin, E.: Pelagic – effects of weather conditions, Hydrobiologia, 506, 75–81,
growth and colony division of Gloeotrichia echinu- https://doi.org/10.1023/B:HYDR.0000008582.61851.76, 2003.
lata in Lake Erken, J. Plankton Res., 27, 145–151, Pierson, D. C., Pettersson, K., and Istvanovics, V.: Temporal
https://doi.org/10.1093/plankt/fbh165, 2005. changes in biomass specific photosynthesis during the sum-
Lin, S.: Shuqi-Lin/Erken_Algal_Bloom_Machine_Learning_Model: mer: regulation by environmental factors and the importance
Erken_Algal_Bloom_Machine_Learning_Model (v1.1), Zenodo of phytoplankton succession, Hydrobiologia, 243, 119–135,
[code and data set], https://doi.org/10.5281/zenodo.7149563, https://doi.org/10.1007/BF00007027, 1992.
2022. Rahmani, F., Lawson, K., Ouyang, W., Appling, A., Oliver, S., and
Marcé, R., George, G., Buscarinu, P., Deidda, M., Dunalska, J., de Shen, C.: Exploring the exceptional performance of a deep learn-
Eyto, E., Flaim, G., Grossart, H.-P., Istvanovics, V., Lenhardt, ing stream temperature model and the value of streamflow data,
M., Moreno-Ostos, E., Obrador, B., Ostrovsky, I., Pierson, D. Environ. Res. Lett., 16, 024025, https://doi.org/10.1088/1748-
C., Potužák, J., Poikane, S., Rinke, K., Rodríguez-Mozaz, S., 9326/abd501, 2021.
Staehr, P. A., Šumberová, K., Waajen, G., Weyhenmeyer, G. Read, J. S., Hamilton, D. P., Jones, I. D., Muraoka, K., Winslow,
A., Weathers, K. C., Zion, M., Ibelings, B. W., and Jennings, L. A., Kroiss, R., Wu, C. H., and Gaiser, E.: Derivation
of lake mixing and stratification indices from high-resolution Stanley, F. K. T., Irvine, J. L., Jacques, W. R., Salgia, S. R.,
lake buoy data, Environ. Modell. Softw., 26, 1325–1336, Innes, D. G., Winquist, B. D., Torr, D., Brenner, D. R., and
https://doi.org/10.1016/j.envsoft.2011.05.006, 2011. Goodarzi, A. A.: Radon exposure is rising steadily within the
Read, J. S., Jia, X., Willard, J., Appling, A. P., Zwart, J. A., Oliver, S. modern North American residential environment, and is in-
K., Karpatne, A., Hansen, G. J. A., Hanson, P. C., Watkins, W., creasingly uniform across seasons, Scientific Reports, 9, 18472,
Steinbach, M., and Kumar, V.: Process-Guided Deep Learning https://doi.org/10.1038/s41598-019-54891-8, 2019.
Predictions of Lake Water Temperature, Water Resour. Res., 55, Watson, S. B., Miller, C., Arhonditsis, G., Boyer, G. L., Carmichael,
9173–9190, https://doi.org/10.1029/2019WR024922, 2019. W., Charlton, M. N., Confesor, R., Depew, D. C., Höök, T.
Rousso, B. Z., Bertone, E., Stewart, R., and Hamilton, D. P.: A O., Ludsin, S. A., Matisoff, G., McElmurry, S. P., Murray,
systematic literature review of forecasting and predictive models M. W., Peter Richards, R., Rao, Y. R., Steffen, M. M., and
for cyanobacteria blooms in freshwater lakes, Water Res., 182, Wilhelm, S. W.: The re-eutrophication of Lake Erie: Harm-
115959, https://doi.org/10.1016/j.watres.2020.115959, 2020. ful algal blooms and hypoxia, Harmful Algae, 56, 44–66,
Recknagel, F., Fukushima, T., Hanazato, T., Takamura, N., and Wil- https://doi.org/10.1016/j.hal.2016.04.010, 2016.
son, H.: Modelling and prediction of phyto- and zooplankton Wei, B., Sugiura, N., and Maekawa, T.: Use of artificial neural net-
dynamics in Lake Kasumigaura by artificial neural networks, work in the prediction of algal blooms, Water Res., 35, 2022–
Lakes & Reservoirs: Research & Management, 3, 123–133, 2028, https://doi.org/10.1016/S0043-1354(00)00464-4, 2001.
https://doi.org/10.1111/j.1440-1770.1998.tb00039.x, 1998. Xiao, X., He, J., Huang, H., Miller, T. R., Christakos,
Reichwaldt, E. S. and Ghadouani, A.: Effects of rainfall patterns on G., Reichwaldt, E. S., Ghadouani, A., Lin, S., Xu,
toxic cyanobacterial blooms in a changing climate: Between sim- X., and Shi, J.: A novel single-parameter approach for
plistic scenarios and complex dynamics, Water Res., 46, 1372– forecasting algal blooms, Water Res., 108, 222–231,
1393, https://doi.org/10.1016/j.watres.2011.11.052, 2012. https://doi.org/10.1016/j.watres.2016.10.076, 2017.
Richardson, J., Miller, C., Maberly, S. C., Taylor, P., Globevnik, Yang, Y., Stenger-Kovács, C., Padisák, J., and Pettersson, K.: Ef-
L., Hunter, P., Jeppesen, E., Mischke, U., Moe, S. J., fects of winter severity on spring phytoplankton development in
Pasztaleniec, A., Søndergaard, M., and Carvalho, L.: Ef- a temperate lake (Lake Erken, Sweden), Hydrobiologia, 780, 47–
fects of multiple stressors on cyanobacteria abundance 57, https://doi.org/10.1007/s10750-016-2777-8, 2016.
vary with lake type, Glob. Change Biol., 24, 5044–5055,
https://doi.org/10.1111/gcb.14396, 2018.