
Friday Nov 28, 2025
Black Friday War Stories: Lessons from E-Commerce's Worst Days
Why do major retailers with unlimited budgets still crash on Black Friday? This episode dives into the graveyard of e-commerce outages—from J.Crew's $775,000 five-hour crash to the AWS typo that cost $150 million.
In this Black Friday special episode, we examine:
📊 THE HALL OF FAME CRASHES
• J.Crew 2018: 323,000 shoppers affected, $775,000 lost in 5 hours
• Walmart 2018: $9 million lost before Black Friday even started
• Best Buy 2014: Infrastructure optimized for desktop, got 78% mobile
• Cloudflare 2024: 99.3% of Shopify stores frozen (6M+ domains)
💥 THE FAMOUS NON-BLACK-FRIDAY DISASTERS
• AWS S3 2017: One typo took down half the internet for 4+ hours
• GitLab 2017: 5 backup systems, none working, 300GB data deleted
• k8s.af: The community treasure trove of Kubernetes failures
🛡️ THE PLATFORM ENGINEER'S PLAYBOOK
• Load test at 5-10x (not 2x)
• Multi-CDN/multi-cloud strategies
• Monthly backup restore tests
• Practice chaos before it finds you
• Design mobile-first (78%+ of traffic)
• Safeguards on dangerous commands
The uncomfortable truth: These outages aren't caused by lack of budget or talent. They're caused by complexity, assumptions, and the gap between "should work" and "actually tested."
🔗 Full transcript & notes: https://platformengineeringplaybook.com/podcasts/00039-black-friday-war-stories
Episode Tags: Black Friday, e-commerce outages, AWS S3, GitLab, Kubernetes, platform engineering, SRE, incident response, chaos engineering, load testing
No comments yet. Be the first to say something!