Paper 13
Paper 13
Paper 13
1093/gji/ggab139
Advance Access publication 2021 April 12
GJI Seismology
Accepted 2021 March 25. Received 2021 March 10; in original form 2021 January 6
SUMMARY
Precise real time estimates of earthquake magnitude and location are essential for early warning
and rapid response. While recently multiple deep learning approaches for fast assessment of
earthquakes have been proposed, they usually rely on either seismic records from a single
station or from a fixed set of seismic stations. Here we introduce a new model for real-
time magnitude and location estimation using the attention based transformer networks. Our
approach incorporates waveforms from a dynamically varying set of stations and outperforms
deep learning baselines in both magnitude and location estimation performance. Furthermore,
it outperforms a classical magnitude estimation algorithm considerably and shows promising
performance in comparison to a classical localization algorithm. Our model is applicable
to real-time prediction and provides realistic uncertainty estimates based on probabilistic
inference. In this work, we furthermore conduct a comprehensive study of the requirements
on training data, the training procedures and the typical failure modes. Using three diverse
and large scale data sets, we conduct targeted experiments and a qualitative error analysis.
Our analysis gives several key insights. First, we can precisely pinpoint the effect of large
training data; for example, a four times larger training set reduces average errors for both
magnitude and location prediction by more than half, and reduces the required time for real
time assessment by a factor of four. Secondly, the basic model systematically underestimates
large magnitude events. This issue can be mitigated, and in some cases completely resolved,
by incorporating events from other regions into the training through transfer learning. Thirdly,
location estimation is highly precise in areas with sufficient training data, but is strongly
degraded for events outside the training distribution, sometimes producing massive outliers.
Our analysis suggests that these characteristics are not only present for our model, but for
most deep learning models for fast assessment published so far. They result from the black
box modeling and their mitigation will likely require imposing physics derived constraints
on the neural network. These characteristics need to be taken into consideration for practical
applications.
Key words: Neural networks, fuzzy logic; Probability distributions; Earthquake early warn-
ing.
the raw input data. This property allows the models to use the
1 I N T RO D U C T I O N
full information contained in the waveforms of an event. However,
Recently, multiple studies investigated deep learning on raw seismic the models published so far use fixed time windows and can not
waveforms for the fast assessment of earthquake parameters, such be applied to data of varying length without retraining. Similarly,
as magnitude (e.g. Lomax et al. 2019; Mousavi & Beroza 2020; except the model by van den Ende & Ampuero (2020), all models
van den Ende & Ampuero 2020), location (e.g. Kriegerowski et al. process either waveforms from only a single seismic station or rely
2019; Mousavi & Beroza 2019; van den Ende & Ampuero 2020) on a fixed set of seismic stations defined at training time. The model
and peak ground acceleration (e.g. Jozinović et al. 2020). Deep by van den Ende & Ampuero (2020) enables the use of a variable
learning is well suited for these tasks, as it does not rely on manually station set, but combines measurements from multiple stations using
selected features, but can learn to extract relevant information from a simple pooling mechanism. While it has not been studied so far in
1086
C The Author(s) 2021. Published by Oxford University Press on behalf of The Royal Astronomical Society.
Earthquake assessment with TEAM-LM 1087
a seismological context, it has been shown in the general domain that minor changes in the seismic network configuration during the time
set pooling architectures are in practice limited in the complexity covered by the catalogue, the station set used in the construction of
of functions they can model (Lee et al. 2019). this catalogue had been selected to provide a high degree of stability
Here we introduce a new model for magnitude and location of the locations accuracy throughout the observational period (Sippl
estimation based on the architecture recently introduced for the et al. 2018). Similarly, the magnitude scale has been carefully cali-
transformer earthquake alerting model (TEAM, Münchmeyer et al. brated to achieve a high degree of consistency in spite of significant
2021), a deep learning based earthquake early warning model. variations of attenuation (Münchmeyer et al. 2020b). This data set
While TEAM estimated PGA at target locations, our model esti- therefore contains the highest quality labels among the data sets in
mates magnitude and hypocentral location of the event. We call this study. For the Chile data set, we use broad-band seismogramms
our adaptation TEAM-LM, TEAM for location and magnitude es- from the fixed set of 24 stations used for the creation of the original
timation. We use TEAM as a basis due to its flexible multistation catalogue and magnitude scale. Although the Chile data set has the
approach and its ability to process incoming data effectively in real- smallest number of stations of the three data sets, it comprises three
Table 1. Overview of the data sets. The lower boundary of the magnitude category is the 5th percentile of the
magnitude; this limit is chosen as each data set contains a small number of unrepresentative very small events.
The upper boundary is the maximum magnitude. Magnitudes are given with two digit precision for Chile, as the
precision of the underlying catalogue is higher than for Italy and Japan. The Italy data set uses different magnitudes
for different events, which are ML (>90 per cent of the events), MW (<10 per cent) and mb (<1 per cent). For
depth and distance minimum, median and maximum are stated. Distance refers to the epicentral distance between
stations and events. Note that the count of traces refers to the number of waveform-triplets (for three components,
or group of six waveforms for the Japanese stations). The sensor types are broadband (BB) and strong motion
(SM).
Chile Italy Japan
Years 2007–2014 2008–2019 1997–2018
Training 01/2007–08/2011 01/2008–12/2015 01/1997–03/2012
Italy with two main shocks in August (MW = 6.5) and October (MW approach to imitate real-time waveforms in training and evaluation.
= 6.0). Notably, the largest event in the test set is significantly larger The waveforms are padded with zeros to a length of 30 s to achieve
than the largest event in the training set (MW = 6.1 L’Aquila event constant length input to the feature extraction.
in 2007), representing a challenging test case. For Italy, we assign TEAM uses a CNN architecture for feature extraction, which
the remaining events to training and development set randomly with is applied separately at each station. The architecture consists of
a 6:1 ratio. several convolution and pooling layers, followed by a multilayer
perceptron (Table S1). To avoid scaling issues, each input wave-
form is normalized through division by its peak amplitude. As the
2.2 The TEAM for magnitude and location amplitude is expected to be a key predictor for the event magnitude,
We build a model for real time earthquake magnitude and location we provide the logarithm of the peak amplitude as a further input to
estimation based on the core ideas of the TEAM, as published in the multilayer perceptron inside the feature extraction network. We
Münchmeyer et al. (2021). TEAM is an end-to-end peak ground ensure that this transformation does not introduce a knowledge leak
acceleration (PGA) model calculating probabilistic PGA estimates by calculating the peak amplitude only based on the waveforms un-
based on incoming waveforms from a flexible set of stations. It uses til t1 . The full feature extraction returns one vector for each station,
the transformer network method (Vaswani et al. 2017), an attention representing the measurements at the station.
based neural network which was developed in the context of natural The feature vectors from multiple stations are combined using a
language processing (NLP), at the core of its algorithm. Here, we transformer network (Vaswani et al. 2017). Transformers are atten-
adapt TEAM to calculate real time probabilistic estimates of event tion based neural networks, originally introduced for natural lan-
magnitude and hypocentral location. As our model closely follows guage processing. A transformer takes a set of n vectors as input,
the architecture and key ideas of TEAM, we use the name TEAM- and outputs again n vectors which now incorporate the context of
LM to refer to the location and magnitude estimation model. each other. The attention mechanism allows the transformer to put
Similar to TEAM, TEAM-LM consists of three major compo- special emphasis on inputs that it considers particularly relevant and
nents (Fig. 2): a feature extraction, which generates features from thereby model complex interstation dependencies. Importantly, the
raw waveforms at single stations, a feature combination, which ag- parameters of the transformer are independent of the number of in-
gregates features across multiple stations, and an output estimation. put vectors n, allowing to train and apply a transformer on variable
Here, we briefly discuss the core ideas of the TEAM architecture station sets. To give the transformer a notion of the position of the
and training and put a further focus on the necessary changes for stations, TEAM encodes the latitude, longitude and elevation of the
magnitude and location estimation. For a more detailed account stations using a sinusoidal embedding and adds this embedding to
of TEAM and TEAM-LM we refer to Münchmeyer et al. (2021), the feature vectors.
Tables S1–S3 and the published implementation. TEAM adds the position embeddings of the PGA targets as ad-
The input to TEAM consists of three component seismogramms ditional inputs to the transformer. In TEAM-LM, we aim to extract
from multiple stations and their locations. TEAM aligns all seis- information about the event itself, where we do not know the posi-
mogramms to start and end at the same times t0 and t1 . We choose tion in advance. To achieve this, we add an event token, which is
t0 to be 5 s before the first P arrival at any station. This allows the a vector with the same dimensionality as the positional embedding
model to understand the noise conditions at all stations. We limit t1 of a station location, and which can be thought of as a query vector.
to be at latest t0 + 30 s. In a real-time scenario t1 is the current time, This approach is inspired by the so-called sentence tokens in NLP
that is the available amount of waveforms, and we use the same that are used to extract holistic information on a sentence (Devlin
Earthquake assessment with TEAM-LM 1089
et al. 2018). The elements of this event query vector are learned The model is trained end-to-end using a log-likelihood loss with
during the training procedure. the Adam optimizer (Kingma & Ba 2014). We train separate models
From the transformer output, we only use the output correspond- for magnitude and for location. As we observed difficulties in the
ing to the event token, which we term event embedding and which onset of the optimization when starting from a fully random initial-
is passed through another multi-layer perceptron predicting the pa- ization, we pretrain the feature extraction network. To this end we
rameters and weights of a mixture of Gaussians (Bishop 1994). add a mixture density network directly after the feature extraction
We use N = 5 Gaussians for magnitude and N = 15 Gaussians and train the resulting network to predict magnitudes from single
for location estimation. For computational and stability reasons, station waveforms. We then discard the mixture density network
we constrain the covariance matrix of the individual Gaussians for and use the weights of the feature extraction as initialization for
location estimation to a diagonal matrix to reduce the output di- the end-to-end training. We use this pretraining method for both
mensionality. Even though uncertainties in latitude, longitude and magnitude and localization networks.
depth are known to generally be correlated, this correlation can be Similarly to the training procedure for TEAM we make exten-
modeled with diagonal covariance matrices by using the mixture. sive use of data augmentation during training. First, we randomly
1090 J. Münchmeyer et al.
select a subset of up to 25 stations from the available station set. ensuring that the station selection does not introduce a knowledge
We limit the maximum number to 25 for computational reasons. leak.
Secondly, we apply temporal blinding, by zeroing waveforms after
a random time t1 . This type of augmentation allows TEAM-LM
to be applied to real time data. We note that this type of temporal 2.3 Baseline methods
blinding to enable real time predictions would most likely work for
the previously published CNN approaches as well. To avoid knowl- Recently, van den Ende & Ampuero (2020) suggested a deep learn-
edge leaks for Italy and Japan, we only use stations as inputs that ing method capable of incorporating waveforms from a flexible set
triggered before time t1 for these data sets. This is not necessary for of stations. Their architecture uses a similar CNN based feature
Chile, as there the maximum number of stations per event is below extraction as TEAM-LM. In contrast to TEAM-LM, for feature
25 and waveforms for all events are available for all stations active combination it uses maximum pooling to aggregate the feature vec-
at that time, irrespective of whether the station actually recorded the tors from all stations instead of a transformer. In addition they do
event. Thirdly, we oversample large magnitude events, as they are not add predefined position embeddings, but concatenate the fea-
strongly underrepresented in the training data set. We discuss the ture vector for each station with the location coordinates and apply
effect of this augmentation in further detail in the Results section. a multilayer perceptron to get the final feature vectors for each sta-
In contrast to the station selection during training, in evaluation tion. The model of van den Ende & Ampuero (2020) is both trained
we always use the 25 stations picking first. Again, we only use and evaluated on 100 s long waveforms. In its original form it is
stations and their waveforms as input once they triggered, thereby therefore not suitable for real time processing, although the real
Earthquake assessment with TEAM-LM 1091
time processing could be added with the same zero-padding ap- TEAM-LM outperforms the classical magnitude baseline con-
proach used for TEAM and TEAM-LM. The detail differences in sistently. On two data sets, Chile and Italy, the performance of
the CNN structure and the real-time processing capability make a TEAM-LM with only 0.5 s of data is superior to the baseline with
comparison of the exact model of van den Ende & Ampuero (2020) 25 s of data. Even on the third data set, Japan, TEAM-LM requires
to TEAM-LM difficult. only approximately a quarter of the time to reach the same precision
To still compare TEAM-LM to the techniques introduced in this as the classical baseline and achieves significantly higher precision
approach, we implemented a model based on the key concepts of after 25 s. The RMSE for TEAM-LM stabilizes after 16 s for all
van den Ende & Ampuero (2020). As we aim to evaluate the perfor- data sets with final values of 0.08 m.u. for Chile, 0.20 m.u. for
mance differences from the conceptual changes, rather than differ- Italy and 0.22 m.u. for Japan. The performance differences between
ent hyperparameters, for example the exact size and number of the TEAM-LM and the classical baseline result from the simplified
convolutional layers, we use the same architecture as TEAM-LM modelling assumptions for the baseline. While the relationship be-
for the feature extraction and the mixture density output. Addition- tween early peak displacement and magnitude only holds approx-
for Japan result from the wide spatial distribution of seismicity saturation sets in for large events. Interestingly the saturation starts
and therefore very variable station distribution. While in Italy most at different magnitudes, which are around 5.5 for Italy and 6.0 for
events are in Central Italy and in Chile the number of stations are Chile. For Japan, events up to magnitude 7 are predicted with-
limited, the seismicity in Japan occurs along the whole subduction out obvious bias. This saturation behavior is not only visible for
zone with additional onshore events. This complexity can likely be TEAM-LM, but has also been observed in prior studies, for ex-
handled better with the flexibility of the transformer than using a ample in Mousavi & Beroza (2020, their figs 3, 4). In their work,
pooling operation. This indicates that the gains from using a trans- with a network trained on significantly smaller events, the saturation
former compared to pooling with position embeddings are likely already set in around magnitude 3. The different saturation thresh-
modest for small sets of stations, and highest for large heteroge- olds indicate that the primary cause for saturation is not the longer
neous networks. rupture duration of large events or other inherent event properties,
In many use cases, the performance of magnitude estimation al- as in cause (ii), but is instead likely related to the low number of
gorithms for large magnitude events is of particular importance. In training examples for large events, rendering it nearly impossible
Fig. 4, we compare the RMSE of TEAM-LM and the classical base- to learn their general characteristics, as in cause (i). This explana-
lines binned by catalogue magnitude into small, medium and large tion is consistent with the much higher saturation threshold for the
events. For Chile/Italy/Japan we count events as small if their mag- Japanese data set, where the training data set contains a comparably
nitude is below 3.5/3.5/4 and as large if their magnitude is at least large number of large events, encompassing the year 2011 with the
5.5/5/6. We observe a clear dependence on the event magnitude. For Tohoku event and its aftershocks.
all data sets the RMSE for large events is higher than for intermedi- As a further check of cause (i), we trained models without up-
ate sized events, which is again higher than for small events. On the sampling large magnitude events during training, thereby reducing
other hand the decrease in RMSE over time is strongest for larger the occurrence of large magnitude events to the natural distribution
events. This general pattern can also be observed for the classical observed in the catalogue (Fig. 5, left-hand column). While the over-
baseline, even though the difference in RMSE between magnitude all performance stays similar, the performance for large events is
buckets is smaller. As both variants of the deep learning baseline degraded on each of the data sets. Large events are on average under-
show very similar trends to TEAM-LM, we omit them from this estimated even more strongly. We tried different upsampling rates,
discussion. but were not able to achieve significantly better performance for
We discuss two possible causes for these effects: (i) the magnitude large events than the configuration of the preferred model presented
distribution in the training set restricts the quality of the model in the paper. This shows that upsampling yields improvements, but
optimization, (ii) inherent characteristics of large events. Cause (i) can not solve the issue completely, as it does not introduce actual
arise from the Gutenberg-Richter distribution of magnitudes. As additional data. On the other hand, the performance gains for large
large magnitudes are rare, the model has significantly less examples events from upsampling seem to cause no observable performance
to learn from for large magnitudes than for small ones. This should drop for smaller event. As the magnitude distribution in most re-
impact the deep learning models the strongest, due to their high gions approximately follows a Gutenberg–Richter law with b ≈ 1,
number of parameters. Cause (ii) has a geophysical origin. As large upsampling rates similar to the ones used in this paper will likely
events have longer rupture durations, the information gain from work for other regions as well.
longer waveform recordings is larger for large events. At which point The expected effects of cause (ii), inherent limitations to the pre-
during the rupture the final rupture size can be accurately predicted dictability of rupture evolutions, can be approximated with physical
is a point of open discussion (e.g. Meier et al. 2017; Colombelli models. To this end, we look at the model from Trugman et al.
et al. 2020). We probe the likely individual contributions of these (2019), which suggests a weak rupture predictability, that is pre-
causes in the following. dictability after 50 per cent of the rupture duration. Trugman et al.
Estimations for large events not only show lower precision, but (2019) discuss the saturation of early peak displacement and the
are also biased (Fig. 5, middle column). For Chile and Italy a clear effects for magnitude predictions based on peak displacements.
Earthquake assessment with TEAM-LM 1093
Following their model, we would expect magnitude saturation at Japan, where the concept of station locations has to be learned si-
approximately magnitude 5.7 after 1 s; 6.4 after 2 s; 7.0 after 4 s; multaneously to the localization task. This holds true even though
7.4 after 8 s. Comparing these results to Fig. 5, the saturation for we encode the station locations using continuously varying position
Chile and Italy clearly occurs below these thresholds, and even for embeddings. Furthermore, whereas for moderate and large events
Japan the saturation is slightly below the modeled threshold. As waveforms from all stations of the Chilean network will show the
we assumed a model with only weak rupture predictability, this earthquake and can contribute information, the limitation to 25 sta-
makes it unlikely that the observed saturation is caused by limi- tions of the current TEAM-LM implementation does not allow a
tations of rupture predictability. This implies that our result does full exploitation of the information contained in the hundreds of
not allow any inference on rupture predictability, as the possible recordings of larger events in the Japanese and Italian data sets.
effects of rupture predictability are masked by the data sparsity This will matter in particular for out-of-network events, where the
effects. wavefront curvature and thus event distance can only be estimated
properly by considering stations with later arrivals.
Looking at the classical baseline, we see that it performs consid-
erably worse than TEAM-LM in the Chile data set in all location
3.2 Location estimation performance quantiles, better than TEAM-LM in all but the highest quantiles at
We evaluate the epicentral error distributions in terms of the 50th, late times in the Italy data set, and worse than TEAM-LM at late
90th, 95th and 99th error percentiles (Fig. 6). In terms of the median times in the Japan data set. This strongly different behavior can
epicentral error, TEAM-LM outperforms all baselines in all cases, largely be explained with the pick quality and the station density in
except for the classical baseline at late times in Italy. For all data the different data sets. While the Chile data set contains high quality
sets, TEAM-LM shows a clear decrease in median epicentral error automatic picks, obtained using the MPX picker (Aldersons 2004),
over time. The decrease is strongest for Chile, going from 19 km the Italy data set uses a simple STA/LTA and the Japan data set uses
at 0.5 s to 2 km at 25 s. For Italy the decrease is from 7 to 2 km, triggers from KiKNet. This reduces location quality for Italy and
for Japan from 22 to 14 km. For all data sets the error distributions Japan, in particular in the case of a low number of picks available
are heavy tailed. While for Chile even the errors at high quantiles for location. On the other hand, the very good median performance
decrease considerably over time, these quantiles stay nearly constant of the classical approach for Italy can be explained from the very
for Italy and Japan. high station density, giving a strong prior on the location. An epi-
Similar to the difficulties for large magnitudes, the characteristics central error of around 2 km after 8 s is furthermore consistent with
of the location estimation point to insufficient training data as source the results from Festa et al. (2018). Considering the reduction in
of errors. The Chile data set covers the smallest region and has by error due to the high station density in Italy, we note that the wide
far the lowest magnitude of completeness, leading to the highest station spacing in Chile likely caused higher location errors than
event density. Consequently the location estimation performance is would be achievable with a denser seismic network designed for
best and outliers are very rare. For the Italy and Japan data sets, early warning.
significantly more events occurred in regions with only few training In addition to the pick quality, the assumption of a 1-D veloc-
events, causing strong outliers. The errors for the Japanese data ity model for NonLinLoc introduces a systematic error into the
set are highest, presumably related to the large number of offshore localization, in particular for the subduction regions in Japan and
events with consequently poor azimuthal coverage. Chile where the 3-D structure deviates considerably from the 1-D
We expect a further difference from the number of unique sta- model. Because of these limitations the classical baseline could be
tions. While for a small number of unique stations, as in the Chile improved by using more proficient pickers or fine-tuned velocity
data set, the network can mostly learn to identify the stations us- models. Nonetheless, in particular the results from Chile, where the
ing their position embeddings, it might be unable to do so for a classical baseline has access to high quality P-picks, suggest that
larger number of stations with fewer training examples per station. TEAM-LM can, given sufficient training data, outperform classical
Therefore the task is significantly more complicated for Italy and real-time localization algorithms.
1094 J. Münchmeyer et al.
For magnitude estimation no consistent performance differences baseline with embeddings is even higher than the gains from adding
between the baseline approach with position embeddings and the the transformer to the embedding model. We speculate that the po-
approach with concatenated coordinates, as originally proposed by sitional embeddings might show better performance because they
van den Ende & Ampuero (2020), are visible. In contrast, for lo- explicity encode information on how to interpolate between loca-
cation estimation, the approach with embeddings consistently out- tions at different scales, enabling an improved exploitation of the
performs the approach with concatenated coordinates. The absolute information from stations with few or no training examples. This
performance gains between the baseline with concatenation and the is more important for location estimation, where an explicit notion
Earthquake assessment with TEAM-LM 1095
learning (Pan & Yang 2009), in our use case waveforms from other
source regions. This way the model is supposed to be taught the
properties of earthquakes that are consistent across regions, for
example attenuation due to geometric spreading or the magnitude
dependence of source spectra. Note that a similar knowledge transfer
implicitly is part of the classical baseline, as it was calibrated using
records from multiple regions.
Here, we conduct a transfer learning experiment inspired by the
transfer learning used for TEAM. We first train a model jointly on all
data sets and then fine-tune it to each of the target data sets. This way,
the model has more training examples, which is of special relevance
for the rare large events, but still is adapted specifically to the target
Figure 7. The 200 events with the highest location errors in the Chile data set overlayed on top of the spatial event density in the training data set. The location Downloaded from https://academic.oup.com/gji/article/226/2/1086/6223459 by US Department of Engergy user on 10 May 2024
estimations use 16 s of data. Each event is denoted by a yellow dot for the estimated location, a green cross for the true location and a line connecting both.
Stations are shown by black triangles. The event density is calculated using a Gaussian kernel density estimation and does not take into account the event
depth. The inset shows the event density at the true event location in comparison to the event density at the predicted event location for the 200 events. Red
circles mark locations of mine blast events. The inset waveforms show one example of a waveform from a mineblast (top) and an example waveform of an
earthquake (bottom, 26 km depth) of similar magnitude (MA = 2.5) at similar distance (60 km) on the transverse component. Similar plots for Italy and Japan
can be found in the supplementary material (Figs S5 and S6).
We make three further observations from comparing the pre- underestimated. In addition, for 1/32 and 1/64 of the full data set, an
dictions to the true values (Fig. S7). First, for nearly all models the ‘inverse saturation’ effect is noticeable for the smallest magnitudes.
RMSE changes only marginally between 16 and 25 s, but the RMSE Thirdly, while for the full data set and the largest subsets all large
of this plateau increases significantly with a decreasing number of events are estimated at approximately the saturation threshold, if
training events. Secondly, the lower the amount of training data, the at most one quarter of the training data is used, the largest events
lower is the saturation threshold above which all events are strongly even fall significantly below the saturation threshold. For the mod-
1098 J. Münchmeyer et al.
Figure 9. Magnitude predictions and uncertainties in the Chile data set as a function of time since the first P arrival. Solid lines indicate median predictions,
while dashed lines (left-hand panel only) show 20th and 80th quantiles of the prediction. The left-hand panel shows the predictions, while the right-hand
panel shows the differences between the predicted and true magnitude. The right-hand panel is focused on a shorter time frame to show the early prediction
development in more detail. In both plots, each colour represents a different magnitude bucket. For each magnitude bucket, we sampled 1000 events around
this magnitude and combined their predictions. If less than 1000 events were available within ±0.5 m.u. of the bucket centre, we use all events within this
range. We only use events from the test set. To ensure that the actual uncertainty distribution is visualized, rather than the distribution of magnitudes around
the bucket centre, each prediction is shifted by the magnitude difference between bucket centre and catalogue magnitude.
els trained on the smallest subsets (1/8 to 1/64), the higher the true are available. This demonstrates that location estimation with
magnitude the lower the predicted magnitude becomes. We assume high accuracy requires catalogues with a high event den-
that the larger the event is, the further away from the training distri- sity.
bution it is and therefore it is estimated approximately at the most The strong degradation further suggests insights into the inner
dense region of the training label distribution. These observations working of TEAM-LM. Classically, localization should be a task
support the hypothesis that underestimations of large magnitudes where interpolation leads to good results, i.e., the traveltimes for
for the full data set are caused primarily by insufficient training an event in the middle of two others should be approximately the
data. average between the traveltimes for the other events. Following this
While the RMSE for epicentre estimation shows a similar argument, if the network would be able to use interpolation, it should
behavior as the RMSE for magnitude, there are subtle dif- not suffer such significant degradation when faced with fewer data.
ferences. If the amount of training data is halved, the per- This provides further evidence that the network does not actually
formance only degrades mildly and only at later times. How- learn some form of triangulation, but only an elaborate fingerprint-
ever, the performance degradation is much more severe than ing scheme, backing the finding from the qualitative analysis of
for magnitude if only a quarter or less of the training data location errors.
Earthquake assessment with TEAM-LM 1099
Figure 11. The figure shows 90th per cent confidence areas for sample events around 5 example locations. For each location the 5 closest events are shown.
Confidence areas belonging to the same location are visualized using the same colour. Confidence areas were chosen as curves of constant likelihood, such
that the probability mass above the likelihood equals 0.9. To visualize the result in 2-D we marginalize out the depth. Triangles denote station locations for
orientation. The top row plots show results from a single model, while the bottom row plots show results from an ensemble of 10 models.
Fig. 10 shows the P-P plots of u in comparison to a uniform pitfalls and particularities of deep learning for this task. We showed
distribution. For all data sets and all times the model is signficantly that TEAM-LM achieves state of the art in magnitude estimation,
miscalibrated, as estimated using Kolmgorov–Smirnov test statis- outperforming both a classical baseline and a deep learning baseline.
tics (Section SM 2). Miscalibration is considerably stronger for Given sufficiently large catalogues, magnitude can be assessed with
Italy and Japan than for Chile. More precisely, the model is always a standard deviation of ∼0.2 magnitude units within 2 s of the first P
overconfident, that is estimates narrower confidence bands than the arrival and a standard deviation of 0.07 m.u. within the first 25 s. For
actually observed errors. Further, in particular at later times, the location estimation, TEAM-LM outperforms a state of the art deep
model is biased towards underestimating the magnitudes. This is learning baseline and compares favorably with a classical baseline.
least visible for Chile. We speculate that this is a result of the large Our analysis showed that the quality of model predictions de-
training data set for Chile, which ensures that for most events the pends crucially on the training data. While performance in regions
density of training events in their magnitude range is high. with abundant data is excellent, in regions of data sparsity, predic-
To mitigate the miscalibration, we trained ensembles (Hansen & tion quality degrades significantly. For magnitude estimation this
We thank Christian Sippl for providing the P picks for the Chile cat- Guo, C., Pleiss, G., Sun, Y. & Weinberger, K.Q., 2017. On calibration of
alogue. We thank Sebastian Nowozin for insightful discussions on modern neural networks, International Conference on Machine Learning,
neural network calibration and probabilistic regression. We thank PMLR, .
Martijn van den Ende for his comments that helped improve the Hansen, L.K. & Salamon, P., 1990. Neural network ensembles, IEEE Trans.
Pattern Anal. Mach. Intell., 12(10), 993–1001.
manuscript. Jannes Münchmeyer acknowledges the support of the
ISIDe Working Group, 2007, Istituto Nazionale di Geofisica e Vulcanolo-
Helmholtz Einstein International Berlin Research School in Data
gia (INGV). Italian seismological instrumental and parametric database
Science (HEIBRiDS). We use obspy (Beyreuther et al. 2010), ten- (ISIDe), doi:10.13127/ISIDE.
sorflow (Abadi et al. 2016) and colour scales from Crameri (2018). Istituto Nazionale di Geofisica e Vulcanologia (INGV), 2008. Ingv experi-
ments network, Istituto Nazionale di Geofisica e Vulcanologia (INGV).
Istituto Nazionale di Geofisica e Vulcanologia (INGV), Istituto di Geologia
D ATA AVA I L A B I L I T Y Ambientale e Geoingegneria (CNR-IGAG), Istituto per la Dinamica dei
Processi Ambientali (CNR-IDPA), Istituto di Metodologie per l’Analisi
Münchmeyer, J., Bindi, D., Leser, U. & Tilmann, F., 2021, 225, 1, 646–656. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
The transformer earthquake alerting model: a new versatile approach to Kaiser, Ł. & Polosukhin, I., 2017. Attention is all you need, in Proceedings
earthquake early warning, Geophys. J. Int., doi:10.1093/gji/ggaa609. of the 31st Conference on Advances in Neural Information Processing
Münchmeyer, J., Bindi, D., Sippl, C., Leser, U. & Tilmann, F., 2020b. Low Systems, Long Beach, CA, USA, pp. 5998–6008.
uncertainty multifeature magnitude estimation with 3-D corrections and Wigger, P., Salazar, P., Kummerow, J., Bloch, W., Asch, G. & Shapiro,
boosting tree regression: application to North Chile, Geophys. J. Int., S., 2016. West–fissure- and atacama-fault seismic network (2005/2012),
220(1), 142–159. doi:10.14470/3s7550699980, Deutsches GeoForschungsZentrum GFZ.
Münchmeyer, J., Bindi, D., Leser, U. & Tilmann, F., 2021a. Fast earthquake
assessment dataset for chile, doi:10.5880/GFZ.2.4.2021.002, GFZ Data
Services
Münchmeyer, J., Bindi, D., Leser, U. & Tilmann, F., 2021b. Team – the S U P P O RT I N G I N F O R M AT I O N
transformer earthquake alerting model, doi:10.5880/GFZ.2.4.2021.003,
Supplementary data are available at GJ I online.
GFZ Data Services.
encoded using colour. There are ∼20 additional events far offshore training data set. The estimations use 16 s of data. Each event
in the catalogue, which are outside the displayed map region. is denoted by a dot for the estimated location, a cross for the true
Figure S4. Distribution of the hypocentral errors for TEAM-LM, location and a line connecting both. Stations are not shown as station
the pooling baseline with position embeddings (POOL-E), the pool- coverage is dense. The event density is calculated using a Gaussian
ing baseline with concatenated position (POOL-C), TEAM-LM kernel density estimation and does not take into account the event
with transfer learning (TEAM-TRA) and a classical baseline. Ver- depth. The inset shows the event density at the true event location
tical lines mark the 50th, 90th, 95th and 99th error percentiles. The in comparison to the event density at the predicted event location.
time indicates the time since the first P arrival at any station. We Figure S7. True and predicted magnitudes after 8 s using only parts
use the mean predictions. of the data sets for training. All plots show the Chile data set. The
Figure S5. The 100 events with the highest location error in the Italy fraction in the corner indicates the amount of training and validation
data set overlayed on top of the spatial event density in the training data used for model training. All models were evaluated on the full
data set. The estimations use 16 s of data. Each event is denoted test data set.