How to handle a service overload? Just turn features off...at least that is what Facebook does

How to handle a service overload? Just turn features off...at least that is what Facebook does

Silav, kesê bijare, everyone working in large scale infrastructure (e.g. as SRE) experienced this situation: The system gets overloaded due to a one (or worst-case many consecutive) of the following conditions:

In all of this cases, the system gets overloaded and degrades in performance, or entirely goes down. With managing the infrastructure for the biggest sites on the web, Facebook/Meta experiences this issues more than anyone, and they came up with a clever solution: On-demand turn-off switches for features on the server side and also in the clients.

This means, if they run into a situation where the systems are overloaded, they have easy knobs to turn, to degrade the users experience, but secure the system. (e.g. Turning of comments on posts, which is bad for the users, but a lot better than the entire system going down)

Exactly this system (called "Defcon") is explained in this weeks paper. Definitely an inspiration for all SREs and Infra people out there.


Abstract:

Every day, billions of people depend on Internet services for communication, commerce, and entertainment. Yet planetary-scale data center infrastructures consisting of millions of servers experience unplanned capacity outages and unexpected demand for resources; how can such infrastructures remain reliable in the face of capacity and workload flux? In this paper, we introduce Defcon, a system for improving the availability of large-scale, globally-distributed Internet services using graceful feature degradation. In response to overload conditions, Defcon enables site operators to gradually disable less-critical features in order to reduce resource demand. Defcon presents a common interface to product developers to define feature knobs that represent degradation capabilities. Defcon automatically tests knobs to understand each knob’s product- and infrastructure-level trade-offs. At Meta, we have used Defcon to improve global product availability in the face of worldwide demand-surges in addition to large-scale infrastructure failures

Download Link:

https://www.usenix.org/system/files/osdi23-meza.pdf


Additional Links:

Janis Horsts

Empowering CTOs and founders with rapid digital execution, outpacing the competition and avoiding missed deadlines. 🚀 Golang consultant (contractor) 🚀 Founder 🚀 Mentor 🚀 Coach

4mo

Interesting. It's like feature toggles but for SRE.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics