Embracing the resilience imperative: A five-step guide for forward-thinking executives
For executives accountable to business and operational resilience, the mandates are coming—internally and externally. Here’s the path forward.
—
For CIOs, CTOs, and other executives overseeing business and operational stability, the calls for resiliency are coming from inside—and outside—the house. System hiccups erode hard-earned brand equity and lead to customer churn,[1], and investing in reliability is necessary to maintain a competitive edge.[2] Internal mandates to shore up systems against disruption are only increasing,[3] while the General Data Protection Regulation- GDPR[4] and the Digital Operational Resilience Act- DORA [5] add a layer of regulatory complexity.
For the executives we talk to—from the board level to C-suite to senior leadership—there are four pressing issues involving resiliency:
● System problems impacting brand reputation and customer experience.
● Operational complexity arising from an intricate, interconnected tech landscape.
● Increased blast radius due to operational vulnerabilities.
● Regulatory compliance around data protection and failover mechanisms.
The stakes are high. Failing to address these challenges can lead to brand damage, absorbed operational risks, and a constant reactive posture.[6] Perhaps most critically, it becomes exponentially more costly to fix these issues down the road rather than tackling them early. The smart money is on getting out ahead of these expectations.
If you try to bring this along later in the life cycle rather than addressing it upfront, you end up with technology sprawl and drift. It becomes much harder to make changes and fix things along the way, especially when you're servicing the business. This maxim echoes across industries, from financial services to healthcare to retail. Proactive resilience is no longer a nice-to-have, but a business imperative.
Executives should consider taking calculated risks and tackling unknowns early
Given these pressures, what's the optimal approach for enterprise leaders? Our research and experience point to taking calculated risks while addressing known issues early. This means looking at technology resilience holistically rather than in silos.
Use an interconnected, tightly coupled approach to tackle known risks upfront instead of waiting. Unknown unknowns require early scenario planning and can be addressed as they arise if we're prepared with a graceful approach (e.g., phased rollouts). Being too reactionary makes it much harder to fix issues down the line. This holistic view requires breaking down traditional organizational silos and fostering collaboration between IT, security, operations, and business units.
The five key trends enabling next-gen resilience
Fortunately, emerging technology trends are enabling more robust approaches to resilience that keep firms from being caught on the back foot. Here are four key developments leaders should leverage.
System Reliability Engineering (SRE) Adoption and Automation: SREs are at the forefront of driving automation and Infrastructure as Code into environments, embedding resilience into the culture of operations. SREs are not just automating processes—they are redefining how reliability is managed by applying engineering principles to IT operations, ensuring that automation is infused with resiliency from the start. By leveraging SRE practices, such as continuous monitoring, automated remediation, and failure injection testing, organizations can reduce manual intervention, increase system robustness, and drive a proactive approach to operational resilience.
Shift to Service Level Indicators/Objectives (SLI/SLO) Frameworks: SLIs and SLOs set clear, measurable targets for system performance and reliability, creating a shared accountability model that aligns product development with operational excellence. This approach, combined with strategic automation and AI-driven methods throughout the software lifecycle, enables earlier issue detection, increased velocity, and improved code quality. By enforcing appropriate controls and baking in quality from the start, organizations can shift from reactive to proactive measures. For example, automated remediation tied to SLOs can trigger corrective actions like scaling infrastructure when performance dips, preventing downtime and maintaining high reliability.
Observability 2.0: Observability is undergoing a significant evolution, moving from basic monitoring to what we refer to as Observability 2.0. While traditional monitoring focuses on predefined metrics, next-gen observability collects rich events and profiles to enable flexible, deep analysis of system behavior integrating with software development life cycle (SDLC).[7] Key capabilities include managing SLOs and error budgets aligned with development lifecycles, focusing on ecosystem hygiene and code quality, and optimizing user experience. This enhanced visibility allows organizations to catch issues earlier, understand complex system interactions, and make data-driven decisions about resilience improvements, ultimately prioritizing investments based on customer experience and business outcomes.
Synthetic data for chaos engineering: As systems grow more complex, synthetic data plays a crucial role in testing system resilience. It allows for comprehensive non-functional testing, such as load and fault injection, without the risks of using real-world data. By leveraging synthetic data, organizations can simulate business processes at scale, uncover hidden edge cases, and safely test under chaotic conditions. This is particularly valuable for regulated industries where compliance concerns restrict the use of real customer data in testing environments.
Self-healing systems: Self-healing systems use smart code designed to degrade, flex, and adapt autonomously to various scenarios, improving resilience without human intervention. This approach leverages coding techniques that enhance code hygiene and integrate algorithms to scale and adjust system behavior dynamically.
By embedding intelligence directly into applications, these systems can predict and address issues proactively through autonomous failovers, graceful shutdowns, and real-time adjustments. This can reduce downtime and minimize disruptions by fixing problems before they are noticed, significantly cutting mean time to recovery. Self-healing capabilities not only boost reliability but also allow IT teams to shift focus from reactive maintenance to strategic innovation.
The way forward: a five-step approach
For executives looking to level up their resilience capabilities, we suggest a three-step approach:
Step one: Identify and prioritize services and dependencies. Start by mapping out critical services and their dependencies across your technology landscape. Understanding which services are vital to your operations and how they interconnect helps prioritize resilience efforts and ensures that key business functions are protected.
Step two: Identify failures and control the blast radius. Proactively identify potential failure points to help minimize impact. This involves designing systems with isolation in mind to contain disruptions, ensuring that issues in one area don’t cascade across the entire operation.
Step three: Define metrics and scale adoption of SLIs/SLOs. Establish clear SLIs and SLOs to measure performance and reliability. These metrics drive the right behaviors, aligning teams around shared goals and helping to scale resilience practices across the organization.
Step four: Address technical debt by prioritizing backlogs and drifts. Regularly review and prioritize your technical debt to manage code drifts and outdated technologies. By actively addressing these issues, you can prevent hidden vulnerabilities from compromising your resilience efforts.
Step five: Build a continuous feedback mechanism to learn and adapt. Implement a feedback loop that captures insights from incidents and operational performance. This mechanism allows your organization to continuously learn, adapt, and refine resilience strategies, fostering an environment of ongoing improvement.
The time to act is now
You're not too late to start, but the time to act is now. While overhauling technology resilience is a continuous journey, there are opportunities to make incremental changes that deliver real impact. These improvements compound over time, creating a virtuous cycle of enhanced reliability, improved customer experience, and reduced operational risk.
Organizations that take a proactive, holistic approach to technology resilience will be better positioned to thrive in an increasingly uncertain and complex business landscape. Those who wait may find themselves perpetually playing catch-up, struggling to meet regulatory requirements, and losing ground to more agile competitors. Technology resilience isn't just an IT issue.
Ready to take the next step?
Reach out to learn more about our framework and how we can help you build a more robust, reliable technology foundation for your business. Our team can guide you through the process of assessing your current resilience posture, identifying key areas for improvement, and implementing targeted solutions that drive measurable business value.
Don't wait for the next crisis to expose vulnerabilities in your systems—take action now to build more resilient, future-proof technology.
This article is part of a series on technology resilience. For further insights on building resilient operations, stay tuned for the next piece in the series.
Part 0: Why technology is the cornerstone of business resilience
Part 1: Embracing the resilience Imperative: A five-step guide for forward-thinking executives
Part 2: How “resilience as code” empowers continuous reliability in complex systems
Part 3: How “resilience as code” empowers continuous reliability in complex systems
A blog post by Ganesh Seetharaman, Consulting Management Director, Tech Resiliency Offering leader Deloitte Consulting LLP; Nicholas Merizzi, Principal | Platform & Infrastructure Offering Leader
This blog post contains general information only and Deloitte is not, by means of this blog post, rendering accounting, business, financial, investment, legal, tax, or other professional advice or services. This blog post is not a substitute for such professional advice or services, nor should it be used as a basis for any decision or action that may affect your business. Before making any decision or taking any action that may affect your business, you should consult a qualified professional advisor.
Deloitte shall not be responsible for any loss sustained by any person who relies on this blog post. As used in this document, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of our legal structure. Certain services may not be available to attest clients under the rules and regulations of public accounting.
[1] “Operational Risk Management for Brands,” Deloitte, August 21st, 2019.
[2] O’Keeffe, Dunigan, “How to Succeed in an Era of Volatility.” Harvard Business Review, March 1st, 2024.
[3] “IT’s Changing Mandate in an Age of Disruption,” The Economist, September 14th, 2021.
[4] Stupp, Catherine, “European Privacy Regulators Step up Scrutiny of Business Data Practices,” Wall Street Journal, January 18th, 2023.
[5] Atzema, Hugo, and Brandwijk, Noah, “What Can We Expect from the Digital Operational Resilience Act,” Deloitte.
[6] Stackpole, Beth, “Cybersecurity Plans Should Center on Resilience,” MIT Sloan, 2 May 2024.
[7] Linthicum, David, “Observability—Taking ‘Monitoring’ to the Next Level.” Deloitte, September 16th, 2022.
Senior Director at Optum | Driving Resilient and Scalable Cloud Infrastructures | AI Transformation Leader
3wGanesh Seetharaman, Great insights! Step five is often forgotten—establish a continuous feedback loop to drive learning and adaptation. I'd love to see more organizations adopt blameless postmortems, as they are invaluable for improving observability, automation, and, in some cases, even system architecture
Mainframe Modernization, Optimization and Hybrid Cloud Specialist
1moGanesh Seetharaman this kind of effort reaps enormous benefits across the enterprise, often overlooked because systems, platforms, frameworks evolve over time. Having a consistent operational view across improves MTTD, MTTR. It reduces risk, cost. We did this are MetLife a number of years ago, it involved the business identifying critical applications (Revenue Generating, Customer facing, Subject to contractual liabilities, such as "Available at Market Open"). Then exposing metrics to monitor degradation, at risk SLAs, Workflow "#Tickets, how many hands were involved in restoration of service, etc". ServiceNow comes to mind. Here is an article that was written about it and how you can address these and use best practices to fit systems into a common framework to support it. This was pre #AIOPS, so I'm sure there are lots of opportunities, to automate detection and self healing. #splunk, a centralized operations team and conforming to best practices published through Enterprise Architecture were critical pieces. https://www.linkedin.com/pulse/20141126035821-57827944-gold-in-them-thar-it-operations/
Department Head Mechanical Maintenance Planner @ Utkal Alumina International Limited | Data Science, Machine Learning
1moL x
Technology Initiatives Orchestrator: Driving Strategic Delivery Leadership at Tech Mahindra
1moLike this especially the observability 2.0 and Synthetic data for chaos engg..
Site Reliability Eng, Performance Eng, Chaos Eng, Cloud Transformation, Operational Excellence, SLA/O/I & Quality Engineering
1moNailed it Ganesh Seetharaman - time to act is NOW … to start with SRE - more than any tool, as you mentioned executives should come to terms with the caliculated risk acceptance mind set - that will shape the north star journey to SRE exploration and adoption. I still remember every moment of the rush when we had to set up a data masking as a service in the enterprise to comply with GDPR and DORA is on the way ….