Eight essential best practices for production cloud services
Photo: Unsplash

Eight essential best practices for production cloud services

With a new cloud service, developers are focused on the product and the user experience. But as services move from beta to production, many organizations are late to address best practices that are essential to run secure and reliable cloud services at scale:

• Autoscaling • Immutable Infrastructure • Monitoring • Redundancy • Recovery • Log Consolidation • Identity Management • Secrets Management

Organizations that do not address these concerns as they grow eventually run into service breakdowns and security breaches.

This article will help you identify the gaps in your production cloud platform, and plan a roadmap for your cloud strategy. The intended audience is management, so we keep the tech-talk at a high level. I refer to some specific tools in the Amazon Web Services (AWS) ecosystem, but the concepts apply in any cloud environment.

Introduction

Say your company has developed an application, such as an enterprise, fin-tech, IOT or social app. After a successful beta you made improvements to the app based on customer feedback, and you feel ready for general production. Next month you will bump up the size of your servers to “large” and launch a marketing campaign to target a few hundred or thousand users by end-of-quarter.

Wait. This approach worked in beta but there’s a whole new set of technical challenges that you will need to surmount before your service is production ready.

This is how breaches happen

Scaling up to production is more than bumping the number of users by an order of magnitude or two; it’s a fundamentally different game. With a larger user base, failures and incidents become more costly but rolling out fixes also becomes more difficult. Also, as your user base expands, security threats from bad actors among those users increase exponentially.

All of these challenges have best-practice solutions for security and reliability, but they must be addressed together. For example, when a user experiences a bug, by definition, something in your system is in an unexpected state. That’s not just a user experience problem, it’s a security threat. If you are not immediately alerted to that unexpected state, the threat is much worse. When an incident causes your system to fail and the recovery process is not automatic, people scramble to respond. When there’s a lot of users, people get stressed, make mistakes and leave doors open, which you may not detect for a long time. This is how breaches happen.

Devops is good, immutable infrastructure is better

Companies have become more aware of the importance of automation in recent years and devops has become well established. Devops means approaching operations like a developer, and capturing processes in code. Fortunately, as Git for revision control, continuous integration (CI/CD) and containers (Docker) have become best practice among developers, this has become easier. These days, devops people can largely automate release management with well-defined and well tested releases.

As with your application, the underlying infrastructure must also be well-defined and well tested, but this is often not the case. “Sys-admins” often spin up servers, databases, load-balancers, etc. manually, and it is common that no one knows exactly what configuration is running in production.

The solution to this is what devops folks call infrastructure as code, which means you write code to automate deploying infrastructure, using tools like Terraform or CloudFormation on AWS. Everyone knows by now that this is the right way, but the entrenched culture of “sys-admins” has not yet taken it to heart, despite having adopted “devops” in their titles. I think a lot of admins actually prefer to click than code, and many developers who would write good “infrastructure as code” don’t want the job because “devops” is less glamorous than “full-stack developer”.

Nonetheless, if your devops person is on it, you can achieve the holy grail of devops: immutable infrastructure, which means you never tweak the configuration of infrastructure after it’s deployed. Rather, you always change the “infrastructure as code”, check it into Git, and deploy new infra through an automated process. The benefit of this approach is that you always know exactly what tested configuration you are running in prod, and you can casually blow away the entire environment with a click at any time and deploy a brand new one without breaking a sweat. Some places even have robots that automatically terminate and redeploy bits of infrastructure from code at random, just to keep their devops people agile and on their toes.

Devops is not enough

Devops automates things but leaves a lot of other important concerns still open, such as monitoring, redundancy, auto-scaling, recovery, IAM, etc. Many companies neglect these concerns until it’s too late, and run into service breakdowns and security breaches. In the rest of this article we look at these other aspects of a wholistic and well-architected production platform.

Monitoring

A modern car internally monitors its own health across thousands of parameters, from oil level to tire pressure to emissions. Most often, a warning light on your car’s dashboard comes on before you notice anything is wrong, so you can get to a garage before the car fails. Similarly, the many parameters that reflect the health of your cloud platform are all available internally, but it’s up to you to build a dashboard, and most companies wait too long before they get around to it.

With a well conceived monitoring dashboard you will be alerted, for example, when memory usage is creeping up, when processes are failing and restarting, or when application logs indicate that a user is experiencing a bug. Then you can analyze the anomaly calmly and plan a response before things get critical.

In AWS, these health parameters available through CloudWatch, but you need to be an expert to interpret the data. Third party products such as Datadog can consolidate and present health status in a way that non-technical folks can understand, but it still takes an expert to configure and maintain it.

Monitoring must not be an afterthought. If the internal state of your platform is not normal and you don’t know it, it’s a security problem. Once again, a car metaphor is appropriate: If you’re doing 75 mph on the highway and you don’t realize that one of your tires is half deflated, you have a security problem.

Autoscaling

A well-architected cloud platform must be elastic; i.e.: the platform must scale up when the service load increases and it must do so automatically, otherwise your service will fail. Conversely, the platform must automatically scale back when the service load decreases again, otherwise you will waste money on unused capacity. Autoscaling continuously and automatically adjusts the capacity of your platform according to the load.

Unfortunately, this is easier said than done. With autoscaling on AWS or GCP, you have to figure out what resources to scale, such as CPU, memory, database throughput, etc.. Then you need to figure out which parameters to use to trigger the scaling, such as the number of users, or processes, or database connections, etc.. Then you have to figure out how much to scale up or down in each step to make the whole thing responsive and smooth. Frankly, the facilities for auto-scaling are fairly crude and it usually involves trial and error. The situation is getting better, however, with improvements in container management (e.g.: Kubernetes, ECS) and serverless (e.g.: Lambda), but these newer technologies bring their own complexity and the trade-offs depend on the business.

Redundancy

There must be no single point of failure in your production platform, and the platform should be self-healing; i.e: it should automatically replace failed components without human invention.

If you have implemented autoscaling you probably have redundancy at the server level. Further, AWS and GCP provide redundancy at the load-balancer, network and database levels out-of-the-box. Nonetheless, reliable redundancy requires that all the automation, deploy processes, configuration, testing and monitoring work perfectly together. If not, even redundant infrastructure may end up in a spiral of failures.

Testing and maintaining a fully redundant and reliable self-healing platform is a challenge.

Recovery

Even with a fully redundant architecture, you still need to plan for failure and disaster. For example, you may deploy a release with a critical bug and need to fall back, or a database migration may fail and you may need to roll back. In the worst case, human error may require that you completely redeploy the platform from scratch and restore the data from a backup. This must be possible.

There are two fundamental questions to answer ahead of time:

  1. Recovery Time Objective – What is the maximum acceptable elapsed time to restore a failed service?

  2. Recovery Point Objective – If you must restore data from a backed-up checkpoint, what is the maximum acceptable period of data loss?

The inherent trade-off in designing recovery processes is that, as RTO and RPO times get shorter, the implementation cost gets exponentially higher. Companies must make this trade-off consciously and design automated recovery processes ahead of time.

Identity and Access Management (IAM)

At the end of the day, people will run your production platform. No matter how much you invest in security, the last mile in security will be that you trust those people. Yet even the most trustworthy people make mistakes. Many of my clients trust me with the highest level of access to their data, the root account, but I almost never use the root account, because I also can, and do, make mistakes. We can reduce the likelihood of mistakes becoming costly by giving people only the specific access that they need to do their job. This is the point of IAM.

AWS and GCP both have a powerful framework for IAM policies and processes, which lets you finely control who can do what, and with which resources, in your environment. However, well-managed IAM is highly specific to each business and organization; very little is out-of-the-box. There are direct trade-offs between security, cost and convenience. For a small company working with not very sensitive data, it can be pretty simple. But for companies with compliance requirements (e.g.: PCI, HIPPA) and larger organizations, IAM can be very complex.

IAM is also concerned with authenticating users’ identities, e.g.: with passwords or tokens or external identities, such as from Facebook or Google. Again, AWS provides tools (i.e.: Cognito) to authenticate and manage identities.

IAM intersects with everything else in your cloud platform: data, processes, infrastructure, security, compliance; so you really need to think about IAM early and review it continuously.

Logging

Servers and applications write reams of logs, and people almost never look at them. Logs get filled with millions of WARNINGs and ERRORs and no one notices because they are buried in files on servers you almost never log into. Logs should also record who has accessed or modified application data, which is critical for security and compliance, but these logs are often not managed in a secure way.

The solution to these problems is log consolidation, which means that logs are continuously copied to a secure cloud-based repository. Logs are actively monitored in the repository, so when there is an error or unusual access, it shows up on a dashboard where you can see it and take action. In the event of an incident or a breach, you have a searchable and correlated audit trail for compliance and forensics.

AWS has an extensive facility for logging access to its own infrastructure (CloudTrail) but it’s very low-level and doesn’t do much for application logs. There are a many solutions from third-parties, however.

Deciding what should be logged and the granularity of the log data requires quite a lot of thought from the points of view of compliance, the application data model and organizational processes. This may have to be a joint effort involving a compliance specialist, a subject matter specialist and a cloud architect.

Secrets and key management

Most applications require access to certain secrets to run, such as database passwords, private keys for SSL certificates, and encryption keys for data. However, many organizations do not manage these secrets in a secure way. In the worst case, passwords and keys may be embedded in source code, visible to developers and unsecured. This is a security hole the size of a barn door.

Fortunately, the solution to this problem is straightforward: On AWS, for exampke, use the Key Management Service (KMS) and Secrets Manager, or on Kubernetes use Kubernetes Secrets. These tools keep secrets secure and give applications runtime access to secrets through well-defined and auditable IAM policies.

If your company is not using native tools for managing secrets and keys, you should review your approach.

Conclusion

Security, reliability and scalability must be addressed together. There are a lot of moving parts and they must all work together or they will break down. All the aspects of a well-architected cloud platform must be considered before you go to production and the cost trade-offs need to be prioritized. As you scale up, the architecture and trade-offs must be continually reviewed.


As an independent AWS certified cloud architect, I can help you: • Plan your cloud strategy • Find and fix existing security vulnerabilities • Deploy secure and scalable cloud infrastruture • Reduce cloud spending.

Get more information at merritt.cloud or contact me .

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics