How Deutsche Bank achieved high availability and scalability with Spanner
Michael Otmar Kaiser
Lead Engineer, Deutsche Bank AG
Eike Falkenberg
Engineering Manager, Google
Deutsche Bank is the leading German bank with strong European roots and a global network. The bank provides financial services to companies, governments, institutional investors, small and medium-sized businesses and private individuals.
For its German retail banking business, the bank recently completed the consolidation of two separate IT systems — Deutsche Bank and Postbank — to create one modern IT platform. This migration of roughly 19 million Postbank product contracts alongside the data of 12 million customers into the IT systems of Deutsche Bank was one of the largest and most complex technology migration projects in the history of the European banking industry.
As part of this modernization, the bank opted to design an entirely new online banking platform, partnering with Google Cloud for their migration from traditional on-premises servers, to the cloud. An integral functionality in enabling this migration, as apparent in the first production instance for 5 million Postbank customers, is Spanner, Google Cloud’s fully managed database service. Spanner's high availability, external consistency, and infinite horizontal scalability made it the ideal choice for this business critical application. Read on to learn about the benefits that Deutsche Bank achieved from migrating to Spanner, and some best practices it developed to reliably and efficiently scale the platform.
Scaling in high availability environments
Scaling in high availability environments can be challenging, but Spanner does all the heavy lifting for Deutsche Bank. Spanner scales infinitely and allows Deutsche Bank to start small and easily scale up and down as needed.
In a traditional on-prem project, fixed resources would have been assigned to the online banking databases, provisioned high enough to respond to customer requests quickly even during peaks. In such a setup, the resources remain unused most of the time, as the online banking load profile varies based on the time of day (more specifically the amount of traffic of online users at a given time). Traffic is low overnight, increasing sharply in the morning to a high load throughout the day, before dropping again in the evening hours. Spanner supports elasticity with horizontal scaling based on nodes that can be added and removed at any time, without disrupting any active workloads.
The amount of nodes can be changed via the Google Cloud console, gcloud and the REST API. For automation, Google Cloud provides an open source Autoscaler that runs entirely on Google Cloud. The bank utilized Autoscaler in all environments (including non-production environments) to maximize cost-efficiency while still ensuring the provisioning of relevant Spanner capacity, for a seamless user experience.
For any components subject to high availability requirements, the autoscaler used to manage those components must be highly available, too. Below are some of the bank’s experiences — the lessons it learned from using it, and the contributions that will soon be given back to the open-source community.
Faster instance scale up
By default, the Autoscaler checks Spanner instances once per minute. To scale out as early as possible, this interval can be shortened, which increases the frequency of Autoscaler querying the Cloud Monitoring API. This change, along with choosing the right scaling methods, helped the bank to fulfill its latency service level objectives.
Multi-cluster deployment
Projects running a high availability GKE cluster should consider deploying the Spanner Autoscaler on GKE over Cloud Functions because it can be deployed to multiple regions, which mitigates issues potentially caused by a regional outage. To avoid race conditions between the poller-pods, simple semaphore logic can be added so that only one pod manages the Spanner resources at any given time. This is simple to do, since the Autoscaler already persists a state in either Firestore or Spanner.
Manageable complexity
Customizing Spanner Autoscaler does not require rocket scientist expertise. All changes can be made without touching the Autoscaler's poller-core or scaler-core. Semaphore handling and monitoring integration can be implemented in custom wrappers, like the wrappers provided by Google Cloud in the respective poller and scaler folders. For a multi-cluster deployment, you can amend the exemplary kpt files or add custom helm charts, selecting the option that best suits your needs.
Decouple configuration from deployment
When multiple teams are working with Spanner instances, it can be inconvenient to deploy the Autoscaler each time the scaling configuration changes. To avoid this, Deutsche bank fetches the instance configuration from sources external to the image and deployment.
There are two ways to do this:
- Store the configuration separately from the instance, e.g., in Cloud Storage
- Add the configuration to the instance itself, e.g., by setting appropriate Spanner instance labels via Terraform
To read the instance configuration and build the poller's internal instances configuration on the fly, the Spanner googleapis provide convenient methods to list and access either files in buckets or Spanner instances along with their metadata such as labels.
If you are using Terraform, it’s a good idea to exclude the Spanner instance processing units from still being managed by Terraform after the instance creation. Any terraform apply run would otherwise reset the autoscaled processing units to the fixed value set in the Terraform state. Terraform provides a lifecycle ignore_changes meta-argument that will do the trick.
Scale at will
The Autoscaler default metrics work well for most use cases. In special cases where scaling needs to be based on different parameters, custom metrics can be configured on an instance level.
A decoupled configuration makes it easy to create custom metrics and test them upfront. By making the custom metric part of the compiled image, using it on an instance level becomes less error prone. By following this approach, scaling a particular instance won’t accidentally stop because of a typo made in a metric definition during a configuration change.
By default, Autoscaler determines scaling decisions based on current storage utilization, 24-hour rolling CPU load, and current high priority CPU load. In cases where scaling parameters differ from this default, i.e., on medium-priority CPU load, custom metrics can be set in less than one minute.
Scale ahead of traffic
One minor shortcoming of the Autoscaler is its inability to compensate for sudden load peaks in real time. To compensate for expected peaks, it would be advisable to temporarily increase the minimum processing unit’s configuration. A solution could be easily implemented by decoupling the Autoscaler’s instance configuration from the Autoscaler image.
If changing the configuration isn’t an option, you can either send a POST request to the scaler’s metric endpoint or script gcloud commands to update the timestamps for the last scaling operation in the Autoscaler’s state database and set the instance processing units directly. The first solution may cause concurrent scaling operations, in which you should be aware of the Autoscaler's internal cooldown settings. By default, the instance will be scaled in again after 30 minutes for scale-in and 5 minutes for scale-out events. The second solution would fix the processing units to any value of your choice for n minutes by manipulating the state database timestamps.
Conclusion
The open-source Autoscaler is a valuable tool for balancing cost control and performance needs when using Spanner. Autoscaler automatically scales your database instances up and down based on load to avoid over-provisioning, increasing cost savings.
The Autoscaler is easy to set up and runs on Google Cloud. Google provides the Autoscaler as open source, which allows full customization of the scaling logic. The core project team at Deutsche Bank worked closely with Google to further improve the tool’s stability and is excited to contribute its enhancements back to the open source community in the near future.
To learn more about the open-source Autoscaler for Spanner, follow the official documentation. You can read more about the Deutsche Bank and Google Cloud partnership in the official Deutsche Bank press release.