In late 2016, while at Microsoft, I wrote a piece of code that caused severe crashes across 8+ regions, reducing our Service Level Agreements (SLAs) significantly. Within 30 hours, our team had jumped into action and resolved the crisis. This is the story of one of my biggest career mistakes and what it taught me. It all started with a subtle error: a null pointer exception in a rarely used code path. I thought it wasn't urgent and even considered going on vacation. But as life would have it, another team made changes that increased the frequency of this problematic code path, leading to massive crashes in multiple regions and affecting our SLAs badly. I was in shock when I realized the magnitude of what had happened. My heart pounded, but I knew I couldn't freeze. I took ownership and immediately informed leadership. Initially, they thought I was joking, but soon realized the severity of the issue. I involved the Product Management team to communicate with impacted customers while I focused on finding a fix. Within 30-40 minutes, I had a solution. I tested it thoroughly, validated it in a test region, and gathered approvals for a hotfix. Within 30 hours, we rolled out the fix to all regions. This experience taught me: 1. High-Quality Code Is Non-Negotiable: Quality code and thorough testing are critical, especially at scale. 2. Ownership Earns Respect: Taking responsibility rather than deflecting blame is crucial in resolving issues. 3. Communication Is Key: Proactive communication with leadership and customers maintains trust. 4. Learn and Reflect: Reflecting on mistakes and learning from them is what makes us better. I survived one of my worst mistakes by owning, fixing, and growing. Mistakes happen, but it’s how we respond that defines us. What's your biggest mistake, and what did it teach you?
Never delete a deployed resource late at night, may just end up deleting a full blown resource group 🤯
Something similar happened to me at a big FANG company I used to work for... it went from millions of requests per minute to zero in just a code change. What made the difference, was not only that I owned my mistake, as soon as I realized what happened, but also how my manager handled the situation. Managers can make an amazing impact on someone's career.
I have a different opinion on this. While I agree that being the primary developer, it may be a miss from our end. I would also like to point out improvements - 1. Enhanced Unit tests. 2. Aggressive peer code reviews. (Have +1s and +2s) 3. Deep test coverage by the Test/QA team(L1 and L2). 4. Adding dedicated CI jobs that run sanity to ensure there's no regression. And even if the bug isn't caught with multiple stakeholders involved, one shouldn't feel guilty about being solely responsible. At the end of the day it is the team that is providing the deliverable and not the individual developer.
I worked at a startup software company back in December 2000. We had a database of several hundred customers, and my job was to email these customers our holiday newsletter consisting of company news and seasons greetings. There was a critically important field that I neglected to check called "opt-out." If a customer indicated "opt-out", they were saying "do not send me any correspondence; I do not want to hear from you." Consequently, I ended up email blasting all of our customers, and we received a fair number of angry replies from customers who had opted out. When I realized my mistake, I immediately informed my boss who happened to be the president of our small company. I could tell from his facial expression and tone that he was very displeased (understandably) with the mess I had caused; he never expressed any iota of appreciation for my integrity, responsiveness and ownership of my mistake. Instead, I was left with a feeling of guilt as though I had committed an unforgivable crime, and if this happens again, I'm out the door. Lesson learned: Adhere to my highest values of integrity, courage, taking responsibility for my mistakes without expecting anything in return like a pat on the back or an expression of appreciation.
I've caused a similar incident at MSFT - also due to Null Pointer exception in a rare path, affecting 500K customers in all regions, escalated to sev1 at some point. I was driving to pick up my young toddler on a heavy trafficed road when the pager rang, with 1 hand holding the phone and the other on the steering wheel. Heart pounded and anxious, and my brain froze for a sec. Then i tell myself i have to stay Absolutely Focused: - park my car to the curbside, call people that can back me up 1 by 1, finally found one available, communicate context, show gratitude, and tell him i'd be online in 30, all done in 5 minutes then did all things you mentioned, it went all good. After this, i encountered multiple situations of similar or worse severity but never panicked like that again, I know if I stay razor focused, things will be just fine.
I work procurement so different lane; but I did an analysis of part usage when I was 2 months out of college. Engineering saw my analysis and insisted I wasn't ordering enough but I didn't listen. Low and behold I was scrambling to expedite more parts a couple weeks later. I took ownership and owned up to not listening. My boss was understanding and said; "you should be able to go off just your analysis, but typically it's good to hear out the other teams". Apparently a lot of usage wasn't tracked in our inventory transactions; 100 would be issued but really manufacturing used 150 (these were very small low cost seals). The learning moment was to consider someone's conflicting view. They see something I don't. By considering I would have asked the question, "why is drawing calling for more than my forecast based off historical usage?" I would have spent a couple weeks figuring out a solution to track the inventory more effectively than scrambling to get parts. Good news was the supply chain gods has my back. Another business unit within the company had mass ordered these parts two years prior and had them sitting inventory; they were estatic to get them off their books and not add to their wasted inventory column.
Once i had to deploy a patch which was coming from other developers. I suspected a small code path which could be problematic. Informed them but the developer said its small one amd should not have any effect. I deployed the change in one region during IST time. The users were in US time zone. As soon as the users came online we saw an outage. Immediatelyha to rollback and do root cause analysis. After some days of deep analysiswe were able to reproduce the issue in the exact code line which i identified. Learning: We should test each and every path of the change no matter how small it looks.
When I was accidentally touched a live electric wire. It caused my body to go into shock for 8 mins. My consciousness left my body. These are the 8 things that I learned from that incident that made me better at not risking my life. 1. don’t touch everything you see 2. Don’t touch anything that you don’t know 3. Electricity can kill you 4. curb your curiosity 5. don’t f around to find out 6. call for help ASAP 7. electricity is needed for our livelihood 8. No amount of electricity can kill me when compared to this cringe post. one day you will go, Microsoft will die , your profile will die, linked in will die, so your micro learnings in your career are not needed for anyone
I wonder “already known” NPE took 30 - 40 mins to fix. On top of it 30 hours to release. That too at MS. And it’s still called “Hot fix.” —— Apologies for being blunt, but I feel there is a need to improve the process as well to stop the bleeding quickly.
Building @ FreeTime | ex-Microsoft | ex-Imam @ Seattle | Back to builder life | Helping everyone unlocking their full potential بإذن الله
3wTeams Android Monthly active users dropped from millions to almost 0 Everyone thought it was an anomaly initially and then realized something was wrong. There was a big PR that commented out the last line in telemetry fire event. That was me trying to avoid a runtime bug due to broken build dependencies. PR was so huge that nobody noticed and went to production. Thankfully there was no user impact, but our telemetry was messed up. I was on vacation. The team was kind enough to not make a big fuss about it but some friends still talk about how I got away with it. - we added explicit reviewer groups for telemetry framework code - I learnt to not send massive PRs - solidified my trust in the team That’s the kind of team everyone should build, where past mistakes are forgiven and fixed forward. No point crying over spilt milk, best to build a positive path forward