Prediction Article - Scientific

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Journal of Marine Systems 208 (2020) 103347

Contents lists available at ScienceDirect

Journal of Marine Systems


journal homepage: www.elsevier.com/locate/jmarsys

Statistical and machine learning ensemble modelling to forecast sea surface T


temperature
Stefan Wolffa, Fearghal O'Donnchab, , Bei Chenb

a
PI, University of Bonn, Bonn, Germany
b
IBM Research, Dublin, Ireland

ARTICLE INFO ABSTRACT

Keywords: In situ and remotely sensed observations have potential to facilitate data-driven predictive models for oceano-
Machine learning graphy. A suite of machine learning models, including regression, decision tree and deep learning approaches
Sea surface temperature were developed to estimate sea surface temperatures (SST). Training data consisted of satellite-derived SST and
Forecasting atmospheric data from The Weather Company. Models were evaluated in terms of accuracy and computational
Modelling
complexity. Predictive skill were assessed against observations and a state-of-the-art, physics-based model from
Statistical models
the European Centre for Medium Weather Forecasting. Results demonstrated that by combining automated
feature engineering with machine-learning approaches, accuracy comparable to existing state-of-the-art can be
achieved. Models captured seasonal patterns in the data and qualitatively reproduce short-term variations driven
by atmospheric forcing. Further, it demonstrated that machine-learning-based approaches can be used as
transportable prediction tools for ocean variables – the data-driven nature of the approach naturally integrates
with automatic deployment frameworks, where model deployments are guided by data rather than user-para-
metrisation and expertise. The low computational cost of inference makes the approach particularly attractive
for edge-based computing where predictive models could be deployed on low-power devices in the marine
environment.

1. Introduction ocean state by combining models with observations to improve short-


term predictions by providing more representative initial conditions. A
Sea surface temperature (SST) is a common indicator of primary state-of-the-art reanalysis is the ERA5 global dataset from the European
productivity in aquaculture (O'Donncha & Grant, 2019), critical for Centre for Medium-Range Weather Forecasts (ECMWF) (Hirahara et al.,
operation of marine-based industries such as power plants (Huang 2016). It provides short-term SST forecasts (and hindcasts) on a 32 km
et al., 2014), while being central to better understanding interactions horizontal grid at hourly intervals from a numerical synthesis of ocean
between the ocean and the atmosphere (Bayr et al., 2019). Recent models, atmospheric forcing fluxes, and SST measurements.
decades has seen enormous progress in approaches to sample SST. In These analysis and forecasting systems face a number of scientific,
particular, satellite technology has vastly increased the granularity of technical, and practical challenges.
measurements that are possible, providing long-term global measure- • The computational and operational requirements for ocean si-
ments at varying spatial and temporal resolution. MODIS (or Moderate mulations at appropriate scales are immense and require high perfor-
Resolution Imaging Spectroradiometer) is a key instrument aboard the mance computing (HPC) facilities to provide forecasts and services in
Terra and Aqua satellites, which acquire imagery data for 36 spectral practical time frames (Bell et al., 2015).
bands, from which information on a range of oceanic processes, in- • Operational forecasting systems require robust data assimilation
cluding SST, can be extracted. schemes that takes account of biases and errors in models and ob-
Concurrently, improvements in high-resolution ocean models to- servations (Rawlins et al., 2007).
gether with increased computational capabilities have made sophisti- A consequence of these challenges is that operational forecasting
cated data-assimilation (DA) schemes feasible – leading to a number of systems are only feasible for large research centres or collaborations
reanalysis products that provide accurate forecasts across broad spatial who have access to large-scale computing resources and scientific ex-
and temporal scales. Reanalyses yield numerical estimates of the true pertise.


Corresponding author.
E-mail address: [email protected] (F. O'Donncha).

https://doi.org/10.1016/j.jmarsys.2020.103347
Received 19 July 2019; Received in revised form 31 March 2020; Accepted 1 April 2020
Available online 06 May 2020
0924-7963/ © 2020 Elsevier B.V. All rights reserved.
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

An alternative approach based on data-intensive computing (Hey et al., 2017; Wiewel et al., 2019). Application of ML-based approaches
et al., 2009), leverages the large datasets generated by ocean mon- has been categorised into three areas (Lary et al., 2016):
itoring and modelling tools to train machine-learning-based forecasting 1. The system's deterministic model is computationally expensive
models. Once trained, the computational expense of these products are and ML can be used as a code accelerator.
negligible, and conceptually, one can develop transportable models that 2. There is no deterministic model but an empirical ML-based
can be trained to learn features at different geographical location. This model can be derived using existing data.
paper presents a suite of data-driven modelling approaches for devel- 3. Classification problems where one wish to identify specific
oping robust systems to predict sea-surface temperature (SST). An au- spatial processes or events.
tomatic feature-engineering module was implemented to identify the A number of studies have investigated data-driven approaches to
key features at disparate geographical locations to provide a transpor- provide computationally cheaper surrogate models, applied to such
table forecasting system. Finally, the different models were averaged things as wave forecasting (James et al., 2018), air pollution (Hähnel
using a model-scoring and weighting approach to provide an ensemble et al., 2020), viscoelastic earthquake simulation (DeVries et al., 2017),
prediction that outperformed the best-performing individual model. and water-quality investigation (Arandia et al., 2018). Pertinent ex-
Contributions are as follows: amples include: ML based approaches to spatially interpolate environ-
• We evaluated the predictive skill of a range of data-driven mental variables and improve precision of solution (Li et al., 2011); DL-
modelling approaches from the perspective of (1) balancing computa- based approaches to increase the resolution of satellite imagery through
tional complexity with predictive skill and (2) leveraging ensemble down-scaling techniques (Ducournau & Fablet, 2016); and data-mining
aggregation to improve robustness. applied to the large datasets generated by ocean monitoring and
• We developed an autonomous feature-engineering module to (1) modelling tools to identify pertinent events such as harmful algal
improve the portability of the model to different geographical locations blooms (Gokaraju et al., 2011).
and (2) reduce the appetite for training data by providing a more in- Distinctive characteristics of SST are their complex temporal-de-
telligent supply of explanatory variables. pendence structure and multi-level seasonality. There are only a few
• Finally, we assessed performance of the modelling framework options to describe systems with such characteristics, including: (1)
globally, against a state-of-the-art physics-based model. Generalised Additive Models (GAMs) from classic statistics, (2) Random
While the idea of using machine learning (ML) to provide compu- Forest (RF) and extreme gradient boosting (XGBoost) from ML, and (3)
tationally cheaper surrogate models has been previously explored, the Multi-Layer Perceptron (MLP) and Long Short-Term Memory (LSTM)
distinctive characteristics of SST lie in their complex temporal depen- models from DL. These five models are all considered in this paper.
dence structure and multi-level seasonality. To our knowledge, this
application has not yet been considered in the existing literature. We 3. Machine learning
demonstrate the viability of the approach to capture the short- and
long-term trends: integrating different ML based models with different Given sufficient data, ML models have the potential to successfully
temporal performance characteristics in an ensemble approach pro- detect, quantify, and predict various phenomena in the geosciences.
vides accuracy on par with large scale complex models. While physics-based modelling involves providing a set of inputs to a
In the next section, we discuss prior research in the domain. model which generates the corresponding outputs based on a non-linear
Subsequently, the different models are introduced along with the fea- mapping encoded from a set of governing equations, supervised ma-
ture extraction and ensemble aggregation techniques. Section 5 com- chine learning instead learns the requisite mapping by being shown
pares performances of the different models along with the predictive large number of corresponding inputs and outputs. In ML parlance, the
accuracy of individual and ensemble aggregated models. The port- model is trained by being shown a set of inputs (called features) and
ability of the system to different geographical locations is discussed. corresponding outputs (termed labels) from which it learns the pre-
Finally, we present conclusions from the research and discuss future diction task – in our case, given some specific atmospheric measure-
work. ments we wish to predict the sea surface temperature. With availability
of sufficient data, the challenge reduces to selecting the appropriate ML
2. Related work model or algorithm, and prescribing suitable model settings or hy-
perparameters. A model hyperparameter is a characteristic of a model
A wide variety of operational SST forecasting products exist that that is external to the model and whose value cannot be estimated from
leverage physics-based circulation modelling and data assimilation to data. In contrast, a parameter is an internal characteristic of the model
resolve temperature distributions. A representative example is the and its value can be estimated from data during training.
forecasting system for the North-West Atlantic from the NEMO Classical works in machine learning and optimisation, introduced
Community Ocean Model, which provides a variety of ocean variables the "no free lunch” theorem (Wolpert & Macready, 1997), demon-
at 12 km resolution. Inputs to the system include: lateral boundary strating that no single machine learning algorithm can be universally
conditions from the open-ocean supplied by a (coarser) global model, better than any other in all domains – in effect, one must try multiple
atmospheric fluxes from the Met Office Unified Model and river inputs models and find one that works best for a particular problem. This study
from 320 European rivers (O'Dea et al., 2012). Other examples include considers five different machine learning algorithms to predict SST. The
the National Centers for Environmental Prediction Climate Forecast study aims to 1) evaluate the performance of each to predict SST, 2)
System, which provides global predictions of SST at 110 km resolution investigate whether simple model aggregation techniques can improve
(Saha et al., 2014), and the US Navy HYCOM Global Forecasting predictive skill and 3) provide insight that can be used to guide selec-
System, which provides 5-day forecasts at resolution ranging from 4 to tion of appropriate model for future studies. While the specifics of each
20 km (Chassignet et al., 2009), together with localised, regional individual model vary, the fundamental approach consists of solving an
models at higher resolution (Chao et al., 2009; Haidvogel et al., 2008; optimisation problem on the training data until the outputs of the
O'Donncha et al., 2015). The common feature of these modelling sys- machine learning model consistently approximates the results of the
tems is the high computational demands that generally limit either the training data. In the remainder of this section, we will describe each ML
precision (coarse global models) or the size of the domain (high-re- model used and provide heuristics for the selection of appropriate hy-
solution, regional models). perparameters. The objective can be summarised as relating a uni-
Due to the heavy computational overhead of physical models, there variate response variable y to a set of explanatory variables x = x1, x2,
is an increasing trend to apply data-driven deep-learning (DL)/ma- …, xi (representing for example, air temperature, seasonal identifier,
chine-learning methods to model physical phenomena (de Bezenac current SST, etc.).

2
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

3.1. Generalised additive models splitting criterion and selects the splitting predictor from a randomly
selected subset of predictors (the subset is different at each split). Each
Linear regression models are ubiquitous in statistical modelling and node in the regression tree corresponds to the average of the response
prediction providing a simple technique to relate predictors or features within the subdomains of the features corresponding to that node. The
to the outcome. The relationship is linear and can be written for a single node impurity gives a measure of how badly the observations at a given
instance as: node fit the model. In regression trees this is typically measured by the
residual sum of squares within that node. Each tree is constructed from
y= + 1 x1 + ...+ i x i + (1)
a bootstrap sample drawn with replacement from the original data set,
0

where the βi's are unknown parameters or coefficients that must be and the predictions of all trees are finally aggregated through majority
determined, the variables xi are features that can explain the response voting. (Boulesteix et al., 2012)
variable y, and the error ε is a Gaussian random variable with ex- While RF is popular for its relatively good performance with little
pectation zero. hyperparameter tuning (i.e. works well with the default values specified
The appeal of the linear regression model lies primarily with its in the software library), as with all machine learning models it is ne-
simplicity and ease of interpretability. Since prediction is modelled as a cessary to consider the bias-variance tradeoff – the balance between a
weighted sum of the features, one can easily quantify the effect of model that tracks the training data perfectly but does not generalise to
changes to features on the outcome. This simplicity is also its greatest new data and a model that is biased or incapable of learning the
weakness since in many real-world situations: the relationship between training data characteristics. Some of the hyperparameters to tune in-
the features and the outcome might be nonlinear, features may interact clude number of trees, maximum depth of each tree, number of features
with each other, and the assumption of a Gaussian distribution of errors to consider when looking for the best split, and splitting criteria (Probst
may be untrue. et al., 2019).
Generalised Additive Models (GAMs) extend on linear models by
instead relating the outcome to unknown smooth functions of the fea- 3.3. XGBoost
tures. Predicting y from the vector of covariates x, at time t is as (Hastie
& Tibshirani, 1990): While XGBoost shares many characteristics and advantages with RF
g (y ) = + f1 (x1) + f2 (x2 ) + ...+fi (x i ) + , (2) (namely interpretability, predictive performance and simplicity), a key
difference facilitating performance gain is that decision trees are built
where each fi(⋅) is an unspecified function and g(.) is a link function sequentially rather than independently. The XGBoost algorithm was de-
defining how the response variable relates to the linear predictor of veloped at the University of Washington in 2016 and since its in-
explanatory variables (e.g. binomial, normal, Poisson) (Wijaya et al., troduction has been credited with winning numerous Kaggle competi-
2015). tions and being used in multiple industry applications. XGBoost
The functions fi(⋅) can be estimated in many ways, most of which provides algorithmic improvements such as sparsity-aware algorithm
involve computer-intensive statistical methods. The basic building for sparse data and weighted quantile sketch for approximate tree
block of all these variations is a scatterplot smoother, which takes a learning, together with optimisation towards distributed computing, to
scatter plot and returns a fitted function that reasonably balances build a scalable tree boosting system that can process billions of ex-
smoothness of the function against fit to the data. The estimated amples (Chen & Guestrin, 2016).
function fi(xi) can then reveal possible nonlinearities in the effect of the The tree ensemble model follows a similar framework to RF with
explanatory variable xi. GAM models are particularly appealing for prediction of the form (Chen & Guestrin, 2016):
analysing time-series datasets in the geosciences due to interpretability,
K
additivity of signal and regularisation: as mentioned, GAM lends itself
yi = (x i ) = fk (x i ), fk ,
towards interpretable models where the contribution of each ex- k=1 (3)
planatory variable is easily visualised and interpreted; time-series sig-
nals can be often explained by multiple additive components such as where we consider K trees, ℱ = {f(x) = wq(x)} represents a set of
trends, seasonality and daily fluctuations which can be readily in- classification and regression trees (CART), q represents each in-
corporated in GAM models; as opposed to simpler regression models dependent decision-tree structure, and wq(x) is the weight of the leaf
targeted only at reducing the error, GAM admits a tuning parameter λ which is assigned to the input x. ℱ is computed by minimising the
that guides the” smoothness” of the model prediction (allowing us to objective function (Chen & Guestrin, 2016):
explicitly balance the bias/variance tradeoff) (Friedman et al., 2001).
= l (yi , yi ) + (fk ),
This parameter as well as the number of splines and polynomial-spline i k
order are typically specified by the user based on heuristics, experience 1
and model performance. with (f ) = w 2,
2 (4)

3.2. Random Forest with l being a differentiable convex loss function (for example the mean
squared error) of the difference between the prediction yi and the ob-
Moving from statistical learning models such as GAM to those from servation yi for each realisation i. The regularisation term, Ω, smooths
the machine learning library, Random Forests (RF) have demonstrated the final weights to avoid over-fitting (λ is a regularisation coefficient).
excellent performance in complex prediction problems characterised by Furthermore, a restriction to a maximal tree depth serves to regulate
a large number of explanatory variables and nonlinear dynamics. RF is model complexity.
a classification and regression method based on the aggregation of a
large number of decision trees. Decision trees are a conceptually simple 3.4. Multi-layer perceptron
yet powerful prediction tool that breaks down a dataset into smaller
and smaller subsets while at the same time an associated decision tree is The first DL-based approach investigated was a Multi-Layer
incrementally developed. The resulting intuitive pathway from ex- Perceptron (MLP) model. An MLP network solves an optimisation
planatory variables to outcome serves to provide an easily interpretable problem to compute the weights and biases that represent the nonlinear
model. function mapping inputs to the best representation of outputs, y :
In RF (Breiman, 2001), each tree is a standard Classification or
g (x; ) = y . (5)
Regression Tree (CART) that uses what is termed node” impurity” as a

3
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

Fig. 1. Schematic of an MLP machine learning network as illustrated in (James et al., 2018).

Θ denotes the mapping matrix of weights and biases that represents the space of nonlinear functions mapping from x to y.
the relationship between SST and explanatory variables, x in the form Hyperparameter tuning is required to balance the effective capacity
of a neural network. of the model and the complexity of the task. In neural network type
An MLP model is organised in sequential layers made up of inter- approaches, increasing the number of layers and of hidden units per
connected neurons. As illustrated in Fig. 1, the value of neuron n in layer increases the capacity of the model to represent complicated
hidden layer ℓ is calculated as: functions. Hence increasing the depth of the network can improve
N
performance on the training data but run the risk of overfitting – thereby
1
an( )
=f wk(, n) ak( 1)
+ bn( )
, reducing generalisation potential. Standard hyperparameters to tune in
k=1 (6) neural networks include the number of layers, number of nodes and the
regularisation coefficient λ.
where f is the activation function, N 1 is the number of nodes in layer
ℓ − 1, wk, n(ℓ) is the weight projecting from node k in layer ℓ − 1 to node
n in layer ℓ, ak(ℓ−1) is the activation of neuron k in hidden layer ℓ − 1, 3.5. Long short-term memory model
and bn(ℓ) is the bias added to hidden layer ℓ contributing to the sub-
sequent layer. The activation function selected for this application was Cognisant of the temporal nature of the data, we investigated the
the rectified linear unit (ReLU) (Nair & Hinton, 2010): performance of recurrent neural network (RNN) type models. A fun-
damental extension of RNNs compared to MLP is parameter sharing
f (z ) = max(0, z ). (7) across different parts of the model. This has intuitive applicability to the
A loss function is defined in terms of the squared error between the forecasting of time-series variables with historical dependency. An RNN
observations and the machine-learning prediction plus a regularisation with a single cell recursively computes the hidden vector sequence h
contribution controlled by λ: and output vector sequence y iteratively from t=1,…, T in the form
m
(Graves et al., 2013):
1 2 2
= y (k ) y (k ) 2 + 2,
2 (8) ht = (Wxh xt + Whh ht 1 + bh ),
k=1
yt = Why yt + b y . (9)
where the ‖ ⋅ ‖2 indicates the L2 norm. The regularisation term penalises
complex models by enforcing weight decay, which prevents the mag- where the W terms denote weight matrices (e.g. Wxh is the input-hidden
nitude of the weight vector from growing too large because large layer weight matrix), the b terms denote bias vectors (e.g. bh is hidden
weights can lead to overfitting – a condition where the model fits the layer bias vector) and ℋ is the hidden layer function which is typically
training data well but does not generalise to new data (Goodfellow implemented as a sigmoid function. In effect, the RNN has two inputs,
et al., 2015). the present state and the past.
By minimising the loss function, the supervised machine learning Standard RNN approaches have been shown to fail when lags be-
algorithm identifies the Θ that yields y y . As shown in Fig. 1, a tween response and explanatory variables exceed 5–10 discrete time-
machine learning algorithm transforms an input vector (layer) to an steps (Gers, 1999). Repeated applications of the same parameters can
output layer through a number of hidden layers. The machine learning give rise to vanishing, or exploding gradients leading to model stag-
model is trained on a data set to establish the weights parameterising nation or instability (Goodfellow et al., 2015). A number of approaches

4
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

have been proposed in the literature to address this, with the most pattern of the data is evident with yearly cycle capturing a significant
popular being LSTM. portion of the data variance. The data residuals largely represent short-
Instead of a simple weighted dependency, ‘LSTM cells’ also have an term fluctuations in the data (together with sensor uncertainty com-
internal recurrence (a self-loop), that serves to guide the flow of in- ponent). The objective of the modelling was to learn the nonlinear re-
formation and reduce susceptibility to vanishing or exploding gra- lationships between the explanatory variables and the long- and short-
dients. Each cell has the same inputs and outputs as an ordinary re- term signals of the data.
current network, but also has more parameters and a system of gating For machine learning forecasts, the raw data themselves are rarely
units that controls the flow of information. An LSTM model has a the most informative and a number of combinations and transforma-
number of gates: input, output and forget gates that decide whether to tions of the raw data must be considered. The feature variables used for
let information in, forget information because it is not important, or let this study consisted of SST historical time series data from MODIS Aqua
it impact output at the current timestep, respectively. As new input satellite, atmospheric data from The Weather Company (TWC), and
comes in, it's impact can be accumulated to the cell, forgotten or pro- time features (season, day of year, etc.). From these raw data, several
pagated to the final state depending on the activation of the relevant different types of features were designed and investigated. The feature
gates (Shi et al., 2015). In analogy to the MLP, we use L2-regularisation engineering process combined domain expertise to initially select
of weights represented by the parameter λ, in an equivalent manner to known variables influencing SST, with statistical analysis to explore
Eq. (8). More details on LSTM are provided in Gers (Gers, 1999). strength of relationship between a large number of features and the
response variable. The dependence or correlation between each feature
3.6. Feature engineering and response variable was determined based on a univariate feature
selection using the SciKit-Learn (Buitinck et al., 2013) feature selection
In traditional modelling based on solving a set of partial differential library. A subset of features with highest F-scores (Guyon et al., 2008)
equations (PDE), the relationship between inputs and outputs are clear were retained. The implementation of the feature selection approach is
– founded on well-understood physics. Machine learning on the other described in more detail in Section 4.2.
hand relies on the concept of learning complex, nonlinear relationships
between inputs and outputs. While the outputs are clear (the variable 4. Methodology
we wish to predict), the inputs are more opaque and one wishes to
consider all variables that potentially contribute to the output response, Application of ML techniques can be reduced to a number of steps
while avoiding superfluous data that may hinder performance. When related to: selection of appropriate ML algorithms, providing sufficient
predicting SST, some of the variables that may contribute include a data to train the models, and selecting the correct model hyperpara-
wide range of atmospheric conditions (air temperature, solar radiation, meters (settings or parameters that must be defined outside the learning
cloud cover, precipitation, wind speed, etc.), autoregressive features algorithm) for the model. In the remainder of this section, we will de-
(i.e. past values of the response variable – SST), temporal information scribe the training data used, provide details on each model considered
(e.g. season, day of year, time of day), and potentially values at and outline the application of each model to the problem of forecasting
neighbouring spatial locations. Feature engineering is the process of SST.
using domain expertise and statistical analysis to extract the most ap-
propriate set of features for a particular problem from the entire set of 4.1. Input data
data that may contribute. The role of feature engineering is to improve
predictive accuracy and expedite model convergence by selecting the Training data were from the MODIS instrument aboard the NASA
most appropriate features that explain the response variable and pro- Aqua satellite. MODIS SSTs are produced and made available to the
vide maximum value. Excluding important data will limit the predictive public by the NASA GFSC Ocean Biology Processing Group. The MODIS
skill of the model while superfluous data tends to add noise to the sensor measures ocean temperature (along with other ocean products
model. such as salinity and Chlorophyll concentration) from a layer < 1 mm
Fig. 2 shows multi-year SST data illustrating primary patterns. A thick at the sea surface. Data are available from 2002 to present at 4 km
monthly rolling mean of the data (middle plot) was subtracted from the horizontal resolution and daily intervals (Ocean Biology Processing
raw data (top plot) with residuals presented (bottom plot). The seasonal Group, 2015). Calibration of the Pathfinder algorithm coefficients and
tuning of instrument configurations produce accurate measurements of
SST with mean squared error (MSE) against in situ sensors < 0.2 °C
(Kilpatrick et al., 2015). These accurate global SST measurement over a
multi-decade period, serve as an ideal dataset to extract insights using
ML. Daily, weekly (8 day), monthly and annual MODIS SST products
are available at both 4.63 and 9.26 km spatial resolution and for both
daytime and nighttime passes. The particular dataset we used was the
MODIS Aqua, thermal-IR SST level 3, 4 km, daily, daytime product
downloaded from the Physical Oceanography Distributed Active Ar-
chive Center (PODAAC) (Ocean Biology Processing Group, 2015). This
MODIS SST data served as labels to the machine learning algorithm
while the data were also used as autoregressive (lagged) features to the
model.
As described in Section 3.6, various combinations of atmospheric
variables were provided as features to the model, extracted from The
Weather Company through their public API (IBM, 2018). The variables
used were the 18 atmospheric quantities included as part of the stan-
dard weather variables described in the API documentation (The
Weather Company, 2018). While we do not have rights to redistribute
Fig. 2. SST time series from MODIS measurements (upper panel), monthly The Weather Company data, a free API key can be obtained to down-
rolling mean (middle panel), and residuals after subtraction of the monthly load the data from the vendor.
rolling mean from the SST data (lower panel). A key part of any modelling study is validation of the prediction and

5
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

comparison against benchmark values. While not provided as inputs to dependent atmospheric quantities reporting standard meteorological
the models, we used data from ECMWF model data to assess predictive variables such as, air temperature, solar radiation flux, cloud cover and
skill. ECMWF ERA5 dataset provides an atmospheric reanalysis of the winds (IBM, 2018). Three different model scenarios were created from
global climate at 32 km horizontal grid at hourly intervals from a nu- these with different combinations of atmospheric data, namely:
merical synthesis of ocean models and atmospheric forcing fluxes • all 18 atmospheric quantities at the desired time were fed to the
(Hirahara et al., 2016). We downloaded SST data at the nearest grid cell model (we refer to this scenario as TWC1).
to the MODIS dataset using the ECMWF Climate Data Store API • To reduce the number of covariates (and hence network size and
(CDSAPI) to serve as a validation dataset. associated demands for training data), a feature-selection module
quantified the most important variables. Univariate feature selection
4.2. Model setup and training was performed by computing F-scores from the correlation of each
single features with the output label (Guyon et al., 2008) and retaining
As described in Section 3, there are three primary steps to deploy- the three atmospheric features with highest scores as described in
ment of a machine learning model: Section 3.6 (referred to as TWC2).
• Feature engineering, where the requisite explanatory data are • This concept was further extended with time-dependent in-
extracted, processed and combined to be fed to the model (described in formation by assigning univariate scores to lagged values of the selected
Section 3.6). atmospheric features in scenario 2, and choosing the lags with the
• Selection of the most appropriate hyperparameters for the model. highest scores as features (referred to as TWC3). This reflected that SST
• Training the model by feeding the training data to the model is also likely to be influenced by atmospheric conditions (e.g. air tem-
which finds patterns in the data that map the input data attributes to perature) at previous days.
the target. The resultant set of features to be considered for each model were,
The first two steps were conducted on a dataset extracted from an AR features with specific lag (AR), time features (time) and most ap-
arbitrary location in the North Atlantic: (27∘28'46.45'' N, 32∘25'43.71'' propriate combination of weather features (TWC1, TWC2 and TWC3).
W). MODIS SST data were collected over 16 years from July 2002 For all five models these set of features were investigated and for each
(earliest available data) to December 2018. Satellite measurements are model the best performing combination were selected that minimised
prone to missing data – for instance due to cloud cover. For this loca- error against the test dataset. Emanating from the different character-
tion, data was missing 57% of days, which was representative of data istics and complexities of the models it was not expected that a single
availability at other locations also. As data gaps are problematic for the feature combination would provide best performance across all models.
training of time-series models (due to auto regressive dependencies), Instead effective machine learning implementations requires a careful
linear interpolation between adjacent values replaced the missing data. balance of appropriate features, model or algorithm complexity, and
While this can introduce artefacts to the data, the fact that missing hyperparameter selection.
values were evenly distributed across the entire dataset and the time The next stage of model setup considered hyperparameter optimi-
series nature of the data made interpolation the best approach. sation for each of the models. In general machine learning models have
Moreover, the secondary weather input, TWC reanalysis data, were a number of hyperparameters, and the selection of the most appropriate
complete. is a combination of heuristics, expertise and trial-and-error. For each
As previously described, the experiments compared a number of model, hyperparameter optimisations adopted a greedy, grid-search
feature selection approaches to incorporate atmospheric and auto- approach over the user-defined parameter ranges summarised in
regressive effects. Initially the most appropriate number of lags to Table 1. The x and y data were split into two groups, to form the
specify as autoregressive SST features were selected based on heuristics training-data set composed of 90% of the 6018 rows of data, and the
(different temporal scales involved such as daily, seasonal, year) and a test-data set the remaining 10%. For each model the learning algorithm
trial-and-error search of a limited number of possible lags. To simplify was trained on the training data and then applied to the test data set
analysis of lag selection, the models were considered as autoregressive and the MSE between test data vector, y, and its machine-learning re-
models for this stage (i.e. we only supplied SST at previous timesteps as presentation y was calculated. The hyperparameter combination that
inputs and did not include weather data). We observed that these minimised this MSE was selected for each model. The selected values
simple autoregressive models provided adequate predictive skill for are presented in Table 1 and discussed in more detail in Section 5 where
short-term forecasting of up to two days (for longer-term predictions, we evaluate model performance.
atmospheric features were critical for performance). Nevertheless, this A number of Python toolkit libraries were used to access high-level
simplified modelling study enabled insight into the most suitable
number of lags (or number of AR steps) to include in each model de-
ployment. For the GAM, RF and XGBoost models, the optimal lags were Table 1
found to be approximately 30 days, which balanced computational Hyperparameters and ranges used for model design. See Section 3 for
tractability with predictive skill. To incorporate seasonal effects (and details on each model hyperparameter.
also due to greater computational efficiency), the MLP and LSTM model Model Hyperparameters
were fed data from up to the previous 400 days (to extend beyond one
year of historical trend). It's worth noting that when including AR GAM # of splines/features ∈{10,15,20}
polynomial-spline order ∈{3,5,8}
features, it introduces a temporal dependency which is important if one
λ ∈ {0.001,0.01,0.1,1,10,100}
wishes to make forecast multiple days in advance – i.e. to make forecast RF # of trees/features ∈{100,200,500}
for day t + 2, predicted SST for day t + 1 is provided as a feature. This max # of features ∈{3,5,10}
allowed for long-term prediction but introduced the possibility of sys- max depth ∈{5,10,15,20}
XGBoost # of trees/features ∈{500,700,1000}
tematic model error and bias (i. e. prediction error accumulated). This
max depth ∈{5,10,15,20}
is analogous to model drift observed in numerical modelling studies λ ∈ {0.01,0.05,0.1,0.5}
where model forecast can diverge from true state over time (Lorenz, MLP # of layers ∈{5,10,20}
1963). # of nodes/layer ∈{20,50,75}
The AR features described above were combined with time features λ ∈ {0.001,0.01,0.1,1}
LSTM # of layers ∈{1,2}
(season, month and week of year), and various combinations of atmo-
# of units/layer ∈{1,2,3}
spheric features to construct different model scenario inputs. Weather λ ∈ {0.001,0.01,0.1,1}
feature were selected from atmospheric data consisting of 18 time-

6
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

programming interfaces to statistical and machine learning libraries predictive skill. Table 2 presents model-selection results considering
and to cross validate results. The GAM model was implemented using hyperparameters and feature engineering. The combination of model
the LinearGAM API from pyGAM (Servn & Brummitt, 2018), Random complexity and size of the features datasets are evident. Relatively
Forest from the widely-used SciKit-Learn (Buitinck et al., 2013) toolkit, simple models like GAM and RF provided best performance with more
and XGBoost from the python implementation of the software library sophisticated feature engineering that reduced the size of the dataset.
(Chen & Guestrin, 2016). The deep learning models, namely MLP and However, MLP and XGBoost both yielded the lowest test MAPE when
LSTM were implemented using the popular Keras library which serves provided with the full atmospheric dataset and allowed to infer re-
as a high-level neural network API (Chollet et al., 2015). lationships from all variables and data labels.
In addition to MAPE accuracy, Table 2 also lists the run times
4.3. Model scoring and aggregation needed to train the corresponding models (on a commodity laptop).
Training times were within acceptable limits for all models, although
To assess the different modelling approaches, hyperparameters, and significant variability existed. As expected, the LSTM had the largest
combinations of input features, the time series was split into training computational demand – however it also had the highest MAPE. This
(90%) and test (10%) sets. The models were trained to make a pre- non-intuitive result demonstrates the need to balance model complexity
diction one day ahead based on feeding the previously described fea- with the nature of the data. That is, naïve selection might suggest that
tures and labels (measured value of SST for that day). The test datasets an RNN-based model such as LSTM is most suitable for a time-series
were then used to evaluate the performance of the model prediction dataset. However, results demonstrated that the LSTM model failed to
against measured values. capture the high-frequency variations in the data and only captured the
As prediction depended on historic estimates of SST (i.e. a predic- general seasonal patterns (the monthly rolling-mean trends reported in
tion for one day ahead required information on the current SST), the Fig. 2). The inability to capture short-scale variations is due to the “long
model prediction was fed back as a feature to the model in a recurrent memory” for this model that interfered with learning short-term var-
fashion. Specifically, the first test prediction (t=1) was made with iations. In contrast, simpler models with time-series information ex-
measured values of SST (at time t=0) as a feature. For future predic- plicitly included as features better learned short-term dynamics.
tions, the measured SST feature was replaced with the prediction from Fig. 3 compares model predictions (see Table 2) for the test period
the previous day, i.e. prediction at time t=2 received as input feature, to MODIS data. Observing the time evolution of SST reveals that a
model prediction for time t=1 instead of the satellite derived value of suitable model must represent two distinct time scale components. On
SST, which in practise would not be available for forecasting multiple the one hand there is the smooth SST evolution governed by season-
days in advance. Scoring for the entire test dataset (562 days) pro- ality. This component of SST evolution benefited from suppression of
ceeded in this manner. This made the study sensitive to propagation of large fluctuations. Of the models studied, this criterion was fulfilled by
error, where a low skill prediction propagates through the entire fore- the GAM approach, which yielded lowest test MAPE with a large reg-
casting period (in a similar manner that error in initial condition or ularisation parameter λ = 10 (i. e., the penalty on the second-order
boundary forcing formulation can propagate through the prediction of a derivative of fitted single-feature functions). The large regularisation
physics-based model). effect together with the piecewise polynomial components of GAM
The mean average error (MAE) and mean absolute percentage error models contributed to a smoother time-series prediction that still cap-
(MAPE) assessed the accuracy of each model: tured long-term trends, including correlation of data between years.
Ntest Ntest
Similarly, the RF approach led to a comparably smooth SST evolution
1 100 (y y) but at significantly lower MAPE than the GAM model. The most obvious
MAE = (y y ) , MAPE = ,
Ntest i=1
Ntest i=1
y (10) reflection of the seasonal pattern is evident in the LSTM prediction
which produces a highly smoothed representation of the training data.
where Ntest is the size of the training data, y is the measured data, and y
The model fails to capture any small-scale dynamics at the daily or
the model-predicted equivalent. Finally, the models were aggregated
weekly level instead reproducing the seasonal heating/cooling effects
into a single best prediction weighted by the inverse MAPE of the test
only. Further analysis of model parameters suggested this to be a result
data (Adhikari & Agrawal, 2012). Convex weights for the models con-
of the retained long-term memory informing the broader trend only. On
sidered in this study were computed of the form:
the other hand, the seasonal cycle has superimposed on it short-term
1 behaviour dominated by peak events occurring at daily to weekly time
MAPEm scales. This is particularly evident in the XGBoost and the MLP ap-
Wm = p proaches where both yielded best performances for smaller λ, which
(MAPEm ) enabled them to better capture short-term events. It's worth noting that
m (11)
while XGBoost and MLP captured the small-scale fluctuations better, RF
where Wm is the weight for model m, p is the number of models, and returned lowest MAPE. To simultaneously take both aspects into ac-
MAPEm is the MAPE of model m. count, the final plot aggregates models based on an inverse MAPE
weighting as presented in Eq. (11). As a preprocessing step, due to the
5. Results comparably poor performance of the LSTM model, it was excluded from
the ensemble. The ensemble average generates lowest MAPE indicating
For each model implementation, we focused on identifying the op- that a relatively simple model-weighting aggregation approach can
timal combination of features and hyperparameters that maximise outperform an individual best performing model (O'Donncha et al.,

Table 2
Model-selection result for the two North Atlantic locations.
Model Features Hyperparameters Training time [sec] MAPE

GAM TWC3, time # of splines = 20, spline order = 8, λ = 10 1.67 2.27


RF AR, TWC3, time # of trees = 500, max # of features = 3, max depth = 20 19.09 1.97
XGBoost TWC1, time # of trees = 1000, max depth = 5, λ = 0.05 0.21 2.17
MLP TWC1, time # of layers = 20, of units/layer = 20, λ = 0.01 11.18 2.07
LSTM TWC2 # of layers = 1, # of units/layer = 2, λ = 0.1 71.67 2.54

7
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

estimates on SST which returned values of 0.56 °C and 12.3%, respec-


tively. While ECMWF reports lower absolute error, relative errors are
noticeably higher. This suggests a tendency of the numerical outputs to
perform poorer in periods when temperatures are lower (increasing
relative error).
Fig. 4 indicates some spatial variations in performance. In general
MAE is lower in the inter-tropics region than in southern or northern
latitudes. This effect is more pronounced when we consider MAPE va-
lues due to lower ambient temperatures making relative differences
more pronounced. Further analysis indicates that this spatial bias is
largely driven by reduced data availability in locations away from the
tropics. Fig. 5 presents the percentage of days for which data was
available for the study period (100% indicates that data is available
every day). We see a distinct pattern of higher data availability over the
inter-tropical regions which is possibly a result of increased cloud
coverage in southern and northern latitudes (Ruiz-Arias et al., 2013;
O'Carroll et al., 2019). For all locations we replaced missing data using
linear interpolation which enables the models to act on data-sparse
regions but limits the amount of true data available to learn the com-
plex SST relationship. Further all error metrics were computed on the
raw data (without linear interpolation) which biases the evaluation
further towards locations with higher data coverage.

5.2. Discussion of results

Time-series forecasts are vital in many areas of scientific, industrial,


and economic activity. Many ML methods have been applied to such
problems and the advantages of RNN-type approaches are well docu-
mented. The ability of these DL algorithms to implicitly include effects
from preceding time steps is intuitively a natural fit. However, this
study demonstrated that when the relationship between predicted va-
lues was not based solely on AR features – well-designed feature se-
Fig. 3. Test-prediction (orange curve) of SST at 27∘28'46.45'' N, −32∘25'43.71'' lection in conjunction with simpler ML methods capable to more ra-
W from different ML models trained on 12 years of preceding historical data
pidly adjust to short-scale fluctuations outperformed the DL
compared to measured SSTs (blue curve). Bottom figure presents ensemble
approaches.
average of all models based on individual model MAPE. Feature combinations
and hyperparameters adopted for each model are summarised in Table 2. (For
Table 2 demonstrated that the important features needed to predict
interpretation of the references to colour in this figure legend, the reader is at time t + 1 are SST values at time t (and values at earlier time steps
referred to the web version of this article.) dependent on selected AR features), atmospheric information at time
t + 1 (and potentially AR features of those), and time of year in-
formation. Hence, prediction required inclusion of AR features while
2018; ODonncha et al., 2019).
inferring relationships between forecasted values of atmospheric data
and the response variable, SST. Fig. 3 illustrated that the LSTM model
5.1. Transportability and comparison to state-of-the-art failed to adequately learn the relationship between explanatory vari-
ables and SST. Specifically, the model closely approximated seasonal
As the feature engineering and hyperparameter selection process is behaviour (i. e., the long-term characteristics of the SST) while failing
complex and cumbersome, it is desirable to execute this procedure once to capture high-frequency variations (i. e., variations in response to
and then use the selected model at different locations. The objective atmospheric inputs). In effect, the DL approach maintained “memory”
being to identify the most appropriate model inputs (features) and of the long-term SST trends to the detriment of incorporating effects of
settings (hyperparameters) from a small dataset, which are then used to shorter time scales. A more focused feature-engineering module that
train (on new data) and deploy (i.e. make forecast) the models at any guided the data-length fed to the LSTM model may improve perfor-
location we wish to make SST forecast. We investigated the perfor- mance. However, this contravenes the philosophy of RNN-type ap-
mance of the model at a set of globally distributed locations. Data (SST proaches that aims to implicitly learn the nature of cyclic data. Another
measurements and TWC weather variables) were collected in a 6° × 6° point worth noting is that DL approaches have a larger appetite for
grid of points within 1° of shorelines between ± 54° latitude. This re- training data than some of the simpler models adopted. Some reduction
sulted in 730 locations globally. While the features and hyperpara- in MAPE may be possible by extending the size of the training data.
meters were selected as noted in Table 2, the models were retrained at Again, however, when evaluating different modelling approaches, as-
each location in a similar manner as previously described using a 90%/ pects such as computational complexity and ability to learn on smaller
10% train and test data split. The resulting prediction was again an datasets are key points that demand consideration (further, there are
aggregation of GAM, RF, XGBoost, and MLP results, where each model practical limits on amount of available data).
was weighted by the inverse MAPE at each location to favour models This study considered a framework to develop a transportable
with better performance in the weighted average. Fig. 4 presents the model suite applied to a nonlinear, real-world dataset. Key points
MAE and MAPE computed at these 730 locations. Results demonstrate considered were design of an automatic feature-engineering module,
that MAE and MAPE were < 1 °C and 10%, respectively, at most lo- which, together with a standard hyperparameter optimisation routine,
cations. Table 3 presents average error metrics over all locations. The facilitated ready deployment at disparate geographical locations.
MAPE-weighted ensemble average returned MAE and MAPE of 0.68 °C Results demonstrated that the different models adopted had inherent
and 7.9% respectively. These values are comparable to ECMWF characteristics that governed accuracy and level of regularisation or

8
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

Fig. 4. Predictive skill of a weighted ensemble


average of GAM, RF, XGBoost, and MLP models at a
set of locations distributed equally between ± 54°
latitude. MAE (top) and MAPE (bottom) metrics are
presented to inform on absolute and relative errors.
The features and hyperparameters prescribed are
presented in Table 2. The blue circles on the top plot
denote locations that are analysed in more detail in
Section 5.2 and presented in Fig. 6. (For interpreta-
tion of the references to colour in this figure legend,
the reader is referred to the web version of this ar-
ticle.)

Table 3 overfit to training data.


MAE and MAPE averaged across all spatial locations presented in Fig. 4. Metrics We compared performance of ML models with a state-of-the-art
are presented for each model individually, an ensemble averaged weighted by physics-based approach from ECMWF. As expected, the physics-based
the inverse MAPE and the final column presents error metrics for a benchmark model provided close agreement with satellite measurements – the
ECMWF model against MODIS measurements. ECMWF prediction is a reanalysis product which assimilates measure-
Metric GAM RF XGBoost MLP Ens. Ave. ECMWF ment (including satellite) data daily to update the accuracy of the
product. This study demonstrated, however, that the machine learning
MAE 0.78 0.72 0.79 0.89 0.68 0.56
based approaches achieve accuracy comparable to ECMWF model, at a
MAPE 9.7 9.4 8.8 10.6 7.9 12.3
fraction of the computational expense. Aggregating the models im-
proved the robustness of this approach and served to regularise small-

Fig. 5. Percentage of the time (number of days over the entire 2002–2019 study period) that the MODIS Aqua sensor reported SST estimates for all global points
considered in Fig. 4. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

9
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

Fig. 6. Time series plots comparing the performance


of individual model (green line) and weighted en-
semble average of GAM, RF, XGBoost, and MLP
models (orange line) against 1) ECMWF model esti-
mate at nearest grid cell (red line) and 2) SST test
data from MODIS satellite (blue circles). The location
of the four points considered are illustrated in Fig. 4
and described in more detail in Table 4 (note that
subfigure title denote the latitude and longitude co-
ordinates). For each of the four plots the orange line
presents a different model to provide an illustration
of different model characteristics, namely (from top-
left) GAM, RF, XGBoost and MLP. (For interpretation
of the references to colour in this figure legend, the
reader is referred to the web version of this article.)

Table 4 the concept of the” no free lunch” - no single machine learning algo-
MAE (top table) and MAPE (bottom table) for individual models and ensemble rithm necessarily outperforms all others and one must select the ap-
average of all models (GAM, RF, XGBoost, and MLP) for the selected number of propriate algorithm for the problem. However, the ensemble averaging
locations presented in Fig. 6. Forecast skill of ECMWF model also presented for approach outperforms individual models providing a framework to
illustrative purposes. improve average predictive skill. The low computational cost of pre-
Lat Lon GAM RF XGBoost MLP Ens. Ave. ECMWF diction enabled by machine learning is particularly amenable to en-
semble modelling approaches where multiple models can be readily
0 −150 0.41 0.44 0.43 0.42 0.38 0.33
deployed (O'Donncha et al., 2018; ODonncha et al., 2019).
−42 6 0.6 0.73 0.88 0.7 0.67 0.63
−12 60 0.48 0.49 0.55 0.50 0.45 0.37 Interrogating temporal evolution of model error over the 18-month
42 156 1.96 1.31 1.53 1.09 1.16 1.03 test period demonstrated some biases in individual models e.g. GAM
MAPE
outperformed RF during the summer period but is significantly poorer
0 −150 1.52 1.64 1.61 1.58 1.42 1.27 during periods of lower temperature. The ensemble aggregation fra-
−42 6 5.38 6.80 8.0 6.52 6.16 5.78 mework we implemented reduced error over the duration of the test
−12 60 1.77 1.79 2.01 1.86 1.66 1.38 period compared to arbitrarily selected individual models, but more
42 156 12.5 8.27 10.4 7.26 7.56 6.62
importantly, also served to reduce error and biases at distinct periods of
the prediction window. Table 5 presents seasonal MSE against satellite
data for each individual model and the ensemble aggregation over the
scale fluctuations or seasonal biases in individual models. Fig. 6 com-
duration of the study period averaged over the same locations as Fig. 6.
pares the ensemble predictions to the ECMWF results, satellite mea-
This study presented a time-series forecasting framework applied to
sured SST and predictions from selected ML model at four locations
satellite measurement of SST. We considered the SST data as a set of
across the globe (location details provided in Fig. 4 and Table 4). The
disparate points. In reality, the ocean surface more closely resembles an
four plots illustrate the varying temporal characteristics of SST data at
image with interaction between neighbouring points. Results demon-
different geographical points and the performance of ML models to
strated that treating the data as distinct time-series points provided
capture those characteristics. Generally the models are seen to capture
good results. However, scope exists to combine this approach with
both the seasonal patterns and shorter-scale fluctuations (e.g. un-
image-processing techniques such as convolutional neural networks
seasonably warm autumn temperatures at location [0, −150]). In-
(CNNs) to incorporate neighbouring effects into predictions. Future
dividual model prediction (green line) provides good predictive skill
work will explore the viability and value of combining CNNs with time-
comparable to ECMWF, while the aggregated model is ‘smoother’
series forecasting models to further improve the robustness of the fra-
(possibly more robust to short-scale fluctuations), while achieving
mework.
comparable accuracy.
Classical works on ensemble forecasting demonstrated that the en-
semble mean should give a better forecast than a single deterministic
forecast (Epstein, 1969; Leith, 1974). Assigning inverse MAPE weights
to individual models provides a simple and effective method to rank
model contributions based on performance. To illustrate forecast skill of
Table 5
different models, Table 4 presents MAE and MAPE for each individual
Seasonal MSE for individual models and ensemble average.
model, an ensemble model aggregation, and the ECMWF estimate
against MODIS measurements for the locations plotted in Fig. 6. Results Season GAM RF XGBoost MLP Ens. Ave.

demonstrate that the variation in error of individual models can be” Spring 1.37 1.04 1.01 0.90 0.94
regularised” by the ensemble approach. We observe that individual Summer 0.18 0.40 0.45 0.68 0.30
models perform better at different locations (e.g. GAM performs best at Autumn 0.06 0.10 0.40 0.80 0.17
location [−42,6], while MLP performs best at [42, 156]), illustrating Winter 0.26 0.14 0.39 0.34 0.22

10
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

6. Conclusions S., Cornillon, P., Weisberg, R., Barth, A., He, R., Werner, F., Wilkin, J., 2009. US
GODAE: global ocean Prediction with the HYbrid Coordinate Ocean Model (HYCOM).
Oceanography 22, 64–75. https://apps.dtic.mil/dtic/tr/fulltext/u2/a504037.pdf
This paper demonstrates the viability of applying ML based ap- https://doi.org/10.5670/oceanog.2009.39.
proaches, addressing transportability, biases and robustness by com- Chen, T., Guestrin, C., 2016. XGBoost: a Scalable Tree Boosting System. In: Proceedings of
bining feature selection and disparate models with specific character- the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining - KDD ’16. ACM Press, New York, New York, USA, pp. 785–794. URL: http://
istics in a weighted aggregation based on average model performance. dl.acm.org/citation.cfm?doid=2939672.2939785. https://doi.org/10.1145/
This study aimed to assess the ability of data-driven approaches to 2939672.2939785.
accurately predict SST characterised by seasonal patterns, temporal Chollet, F., et al., 2015. Keras. https://keras.io.
DeVries, P.M.R., Thompson, T.B., Meade, B.J., 2017. Enabling large-scale viscoelastic
dependencies and short-term fluctuations. Results demonstrate com- calculations via neural network acceleration. Geophysical Research Letters 44,
parable performance to physics-based model simulations with low 2662–2669. http://doi.wiley.com/10.1002/2017GL072716 https://doi.org/10.
computational cost, and which is easily parametrised to other geo- 1002/2017GL072716.
Ducournau, A., Fablet, R., 2016. Deep learning for ocean remote sensing: an application
graphical locations. The low computational cost of the approach has
of convolutional neural networks for super-resolution on satellite-derived SST data.
many advantages. First, it enables separation of SST forecasting models In: 2016 9th IAPR Workshop on Pattern Recogniton in Remote Sensing (PRRS). IEEE,
from HPC centres – the suite of models presented here can be trained on pp. 1–6. URL. http://ieeexplore.ieee.org/document/7867019/. https://doi.org/10.
a laptop and applied to any geographic location. Once trained, the in- 1109/PRRS.2016.7867019.
Epstein, E.S., 1969. Stochastic dynamic prediction. Tellus 21, 739–759.
ference step is of negligible computational expense and can be readily Friedman, J., Hastie, T., Tibshirani, R., 2001. The Elements of Statistical Learning. 10
deployed on edge-type devices (e.g. in-situ devices deployed in the Springer series in statistics New York.
ocean). Deploying large-scale models is a complex task highly depen- Gers, F., 1999. Learning to forget: continual prediction with LSTM. In: 9th International
Conference on Artificial Neural Networks: ICANN ’99. vol. 1999. IEE, pp. 850–855.
dent on user skill to correctly configure and parametrise to specific http://digital-library.theiet.org/content/conferences/10.1049/cp_19991218 https://
locations. Data-driven approaches can present an alternative approach doi.org/10.1049/cp:19991218.
that enables rapid prediction, contingent on availability of sufficient Gokaraju, B., Durbha, S.S., King, R.L., Younan, N.H., 2011. A Machine Learning Based
spatio-Temporal Data Mining Approach for Detection of Harmful Algal Blooms in the
data. Gulf of Mexico. IEEE Journal of Selected Topics in Applied Earth Observations and
Remote Sensing 4, 710–720. http://ieeexplore.ieee.org/document/5701668/
Declaration of competing interest https://doi.org/10.1109/JSTARS.2010.2103927.
Goodfellow, I., Bengio, Y., Courville, A., 2015. Deep Learning. URL. https://www.
deeplearningbook.org/.
The authors declare that they have no known competing financia- Graves, A., Mohamed, A.-r., Hinton, G., 2013. Speech recognition with deep recurrent
neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and
linterestsor personal relationships that could have appeared to influ-
Signal Processing. IEEE, pp. 6645–6649. URL. http://ieeexplore.ieee.org/document/
ence the work reported in this paper. 6638947/. https://doi.org/10.1109/ICASSP.2013.6638947.
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A., 2008. Feature Extraction: Foundations
Acknowledgements and Applications. 207 Springer.
Hähnel, P., Mareček, J., Monteil, J., O’Donncha, F., 2020. Using deep learning to extend
the range of air pollution monitoring and forecasting. J. Comput. Phys. 408, 109278.
This project has received funding from the European Unions Haidvogel, D.B., Arango, H., Budgell, W.P., Cornuelle, B.D., Curchitser, E., Di Lorenzo, E.,
Horizon 2020 research and innovation programme as part of the RIA Fennel, K., Geyer, W.R., Hermann, A.J., Lanerolle, L., et al., 2008. Ocean forecasting
in terrain-following coordinates: formulation and skill assessment of the Regional
GAIN project under grant agreement No. 773330. Ocean Modeling System. J. Comput. Phys. 227, 3595–3624.
Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. CRC Press URL. https://
References onlinelibrary.wiley.com/doi/pdf/10.1002/9781118445112.stat03141.
Hey, T., Tansley, S., Tolle, K., et al., 2009. The Fourth Paradigm: Data-intensive Scientific
Discovery. 1 Microsoft research, Redmond, WA.
Adhikari, R., Agrawal, R.K., 2012. Combining multiple time series models through a Hirahara, S., Balmaseda, M.A., De Boisseson, E., Hersbach, H., 2016. 26 Sea Surface
robust weighted mechanism. In: 2012 1st International Conference on Recent Temperature and Sea Ice Concentration for ERA5, Technical Report, Reading, UK.
Advances in Information Technology (RAIT), pp. 455–460. https://doi.org/10.1109/ URL. http://old.ecmwf.int/publications/.
RAIT.2012.6194621. Huang, S.-J., Lin, J.-T., Lo, Y.-T., Kuo, N.-J., Ho, C.-R., 2014. The coastal sea surface
Arandia, E., O’Donncha, F., McKenna, S., Tirupathi, S., Ragnoli, E., 2018. Surrogate temperature changes near the nuclear power plants of northern Taiwan observed
modeling and risk-based analysis for solute transport simulations. In: Stochastic from satellite images. In: OCEANS 2014 - TAIPEI. IEEE, pp. 1–5. URL. http://
Environmental Research and Risk Assessment, pp. 1–15. URL. http://link.springer. ieeexplore.ieee.org/document/6964547/. https://doi.org/10.1109/OCEANS-
com/10.1007/s00477-018-1549-6. https://doi.org/10.1007/s00477-018-1549-6. TAIPEI.2014.6964547.
Bayr, T., Wengel, C., Latif, M., Dommenget, D., Lübbecke, J., Park, W., 2019. Error IBM, 2018. Weather Company Data API, Technical Report. The Weather Company URL.
compensation of enso atmospheric feedbacks in climate models and its influence on https://www.ibm.com/us-en/marketplace/weather-company-data-packages/details
simulated enso dynamics. Clim. Dyn. 53, 155–172. doi:i126-7492-08.
Bell, M., Schiller, A., Le Traon, P.-Y., Smith, N., Dombrowsky, E., Wilmer-Becker, K., James, S.C., Zhang, Y., O’Donncha, F., 2018. A machine learning framework to forecast
2015. An introduction to GODAE OceanView. Journal of Operational Oceanography wave conditions. Coastal Engineering 137, 1–10. https://www.sciencedirect.com/
8, s2–s11. http://www.tandfonline.com/doi/full/10.1080/1755876X.2015.1022041 science/article/pii/S0378383917304969 https://doi.org/10.1016/J.COASTALENG.
https://doi.org/10.1080/1755876X.2015.1022041. 2018.03.004.
de Bezenac, E., Pajot, A., Gallinari, P., 2017. Deep Learning for Physical Processes: Kilpatrick, K., Podestá, G., Walsh, S., Williams, E., Halliwell, V., Szczodrak, M., Brown, O.,
Incorporating Prior Scientific Knowledge, arxiv Preprint arXiv:1711.07970. URL. Minnett, P., Evans, R., 2015. A decade of sea surface temperature from MODIS.
http://arxiv.org/abs/1711.07970. arXiv:1711.07970. Remote Sensing of Environment 165, 27–41. https://www.sciencedirect.com/
Boulesteix, A.-L., Janitza, S., Kruppa, J., König, I.R., 2012. Overview of random forest science/article/pii/S0034425715001650 https://doi.org/10.1016/J.RSE.2015.04.
methodology and practical guidance with emphasis on computational biology and 023.
bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Lary, D.J., Alavi, A.H., Gandomi, A.H., Walker, A.L., 2016. Machine learning in geos-
Discovery 2, 493–507. ciences and remote sensing. Geosci. Front. 7, 3–10.
Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. http://link.springer. Leith, C., 1974. Theoretical skill of Monte Carlo forecasts. Mon. Weather Rev. 102,
com/10.1023/A:1010933404324 https://doi.org/10.1023/A:1010933404324. 409–418.
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Li, J., Heap, A.D., Potter, A., Daniell, J.J., 2011. Application of machine learning methods
Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, to spatial interpolation of environmental variables. Environmental Modelling &amp;
B., Varoquaux, G., 2013. API design for machine learning software: experiences from Software 26, 1647–1659. https://www.sciencedirect.com/science/article/pii/
the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and S1364815211001654 https://doi.org/10.1016/J.ENVSOFT.2011.07.004.
Machine Learning, pp. 108–122. Lorenz, E.N., 1963. Deterministic nonperiodic flow. J. Atmos. Sci. 20, 130–141. https://
Chao, Y., Li, Z., Farrara, J., McWilliams, J.C., Bellingham, J., Capet, X., Chavez, F., Choi, doi.org/10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.
J.-K., Davis, R., Doyle, J., Li, P., Marchesiello, P., Moline, M.A., Paduan, J., Ramp, S., Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann ma-
2009. Development, implementation and evaluation of a data-assimilative ocean chines. In: Proceedings of the 27th International Conference on Machine Learning
forecasting system off the central California coast. Deep Sea Research Part II: Topical (ICML-10), pp. 807–814.
Studies in Oceanography 56, 100–126. https://www.sciencedirect.com/science/ O’Carroll, A.G., Armstrong, E.M., Beggs, H., Bouali, M., Casey, K.S., Corlett, G.K., Dash,
article/pii/S0967064508002622 https://doi.org/10.1016/J.DSR2.2008.08.011. P., Donlon, C., Gentemann, C.L., Høyer, J.L., et al., 2019. Observational needs of sea
Chassignet, E., Hurlburt, H., Metzger, E.J., Smedstad, O., Cummings, J., Halliwell, G., surface temperature. Front. Mar. Sci. 6, 420.
Bleck, R., Baraille, R., Wallcraft, A., Lozano, C., Tolman, H., Srinivasan, A., Hankin, Ocean Biology Processing Group, 2015. MODIS Aqua Level 3 SST Thermal IR Daily 4km

11
S. Wolff, et al. Journal of Marine Systems 208 (2020) 103347

Daytime v2014.0. URL. https://podaac.jpl.nasa.gov/dataset/MODIS_AQUA_L3_SST_ Ruiz-Arias, J., Dudhia, J., Gueymard, C., Pozo-Vázquez, D., 2013. Assessment of the level-
THERMAL_DAILY_4KM_DAYTIME_V2014.0 https://doi.org/10.5067/MODSA- 3 modis daily aerosol optical depth in the context of surface solar radiation and
MO4D4. numerical weather modeling. Atmos. Chem. Phys. 13, 675–692.
O’Dea, E.J., Arnold, A.K., Edwards, K.P., Furner, R., Hyder, P., Martin, M.J., Siddorn, J.R., Saha, S., Moorthi, S., Wu, X., Wang, J., Nadiga, S., Tripp, P., Behringer, D., Hou, Y.-T.,
Storkey, D., While, J., Holt, J.T., Liu, H., 2012. An operational ocean forecast system Chuang, H.-y., Iredell, M., Ek, M., Meng, J., Yang, R., Mendez, M.P., van den Dool, H.,
incorporating NEMO and SST data assimilation for the tidally driven European North- Zhang, Q., Wang, W., Chen, M., Becker, E., Saha, S., Moorthi, S., Wu, X., Wang, J.,
West shelf. Journal of Operational Oceanography 5, 3–17. http://www.tandfonline. Nadiga, S., Tripp, P., Behringer, D., Hou, Y.-T., Chuang, H.-y., Iredell, M., Ek, M.,
com/doi/full/10.1080/1755876X.2012.11020128 https://doi.org/10.1080/ Meng, J., Yang, R., Mendez, M.P., van den Dool, H., Zhang, Q., Wang, W., Chen, M.,
1755876X.2012.11020128. Becker, E., 2014. The NCEP Climate Forecast System Version 2. Journal of Climate
O’Donncha, F., Grant, J., 2019. Precision aquaculture. IEEE Internet of Things Magazine 27, 2185–2208. http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00823.1
2, 26–30. https://doi.org/10.1175/JCLI-D-12-00823.1.
ODonncha, F., Zhang, Y., Chen, B., James, S.C., 2019. Ensemble model aggregation using Servn, D., Brummitt, C., 2018. pygam: Generalized additive models in python. https://
a computationally lightweight machine-learning model to forecast ocean waves. J. doi.org/10.5281/zenodo.1208723 https://doi.org/10.5281/zenodo.1208723.
Mar. Syst. 199, 103206. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., WOO, W.-c., 2015. Convolutional
O’Donncha, F., Hartnett, M., Nash, S., Ren, L., Ragnoli, E., 2015. Characterizing observed LSTM Network: A Machine Learning Approach for Precipitation Nowcasting.
circulation patterns within a bay using {HF} radar and numerical model simulations. The Weather Company, 2018. Cleaned Observations API Documentation, Technical
J. Mar. Syst. 142, 96–110. Report. TWC URL. http://cleanedobservations.wsi.com/documents/WSI
O’Donncha, F., Zhang, Y., Chen, B., James, S.C., 2018. An integrated framework that %20Cleaned%20Observations%20API%20Documentation.pdf.
combines machine learning and numerical models to improve wave-condition fore- Wiewel, S., Becher, M., Thuerey, N., 2019. Latent space physics: Towards learning the
casts. J. Mar. Syst. 186, 29–36. temporal evolution of fluid flow. In Computer Graphics Forum 38 (2), 71–82.
Probst, P., Wright, M.N., Boulesteix, A.-L., 2019. Hyperparameters and tuning strategies Wijaya, T.K., Sinn, M., Chen, B., 2015. Forecasting uncertainty in electricity demand. In:
for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, URL.
Discovery 9, e1301. https://www.aaai.org/ocs/index.php/WS/AAAIW15/paper/viewPaper/10104.
Rawlins, F., Ballard, S., Bovis, K., Clayton, A., Li, D., Inverarity, G., Lorenc, A., Payne, T., Wolpert, D.H., Macready, W.G., 1997. No free lunch theorems for optimization. IEEE
2007. The met office global four-dimensional variational data assimilation scheme. Trans. Evol. Comput. 1, 67–82.
Q. J. R. Meteorol. Soc. 133, 347–362.

12

You might also like