Platform Engineering Playbook Podcast
The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution.
Read the playbook at https://platformengineeringplaybook.com
Episodes

Sunday Dec 07, 2025
Sunday Dec 07, 2025
DORA metrics revolutionized how we measure DevOps performance, but are we missing the bigger picture? This episode explains DORA from the ground up—the four key metrics, how they're measured, and why elite teams deploy more AND fail less. Then we explore what DORA misses: developer satisfaction, cognitive load, and flow state. From SPACE to DevEx to DX Core 4, discover the frameworks changing how we measure developer productivity.
In this episode:- DORA's Four Key Metrics: Deployment Frequency, Lead Time, Change Failure Rate, and MTTR (now Failed Deployment Recovery Time)- Elite vs Low performers: Elite teams deploy multiple times daily with <5% failure rate; low performers deploy monthly with 40% failure rate- The big insight from 10 years of DORA: Throughput and stability correlate—speed and quality go together
📰 News Segment Links:• Iterate.ai Launches AgentOne for Enterprise AI Code Security https://thenewstack.io/iterate-ai-launches-agentone-for-enterprise-ai-code-security/• AWS Introduces Durable Functions: Stateful Logic Directly in Lambda Code https://www.infoq.com/news/2025/12/aws-lambda-durable-functions/• How Capital One Cut Tracing Data by 70% With OpenTelemetry https://thenewstack.io/how-capital-one-cut-tracing-data-by-70-with-opentelemetry/
Perfect for platform engineers, engineering managers, and devops leaders measuring team productivity looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00048-developer-experience-metrics-beyond-dora
Duration: 13:16
Host: Jordan and Alex
Category: TechnologySubcategory: Software How-To
Keywords: episode, developer, experience, metrics, beyond, platform engineering

Saturday Dec 06, 2025
Saturday Dec 06, 2025
Three weeks after their worst outage since 2019, Cloudflare went down again. On December 5, 2025, a Lua code bug took down 28% of HTTP traffic for 25 minutes - the sixth major outage of 2025. Beyond the technical postmortem, this episode examines the pattern of repeated failures, community reactions, and the often-overlooked human cost to on-call engineers.
📰 News Segment Links:• KubeCon Survey: How Platform Teams Are Adopting AI and IDPs https://thenewstack.io/kubecon-survey-how-platform-teams-are-adopting-ai-and-idps/• GitHub Actions workflow dispatch now supports 25 inputs https://github.blog/changelog/2025-12-04-actions-workflow-dispatch-workflows-now-support-25-inputs• Hybrid Cloud-Native Networking in Enterprise - Louis Ryan (Google) https://www.infoq.com/presentations/hybrid-cloud-native-networking-enterprise/
Perfect for platform engineers, sres, and devops engineers dealing with infrastructure dependencies looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00047-cloudflare-december-2025-outage-trust-crisis
Duration: 11:35
Host: Jordan and Alex
Category: TechnologySubcategory: Software How-To
Keywords: cloudflare, outage, cost, trust, cloud, episode, human, crisis, december

Friday Dec 05, 2025
Friday Dec 05, 2025
Global cloud spend hits $720 billion in 2025—and organizations waste 20-30% on unused resources. Year-end is the perfect time to show savings before budgets reset.
In this episode, Jordan and Alex deliver six actionable quick wins you can implement THIS WEEK:
💰 The Six Wins:1️⃣ Scheduling non-prod environments → 70% savings2️⃣ Right-sizing oversized instances → 25-40% per instance3️⃣ Reserved Instances/Savings Plans → Up to 72% discount4️⃣ Spot instances for CI/CD → 60-90% savings5️⃣ Storage tiering → Move cold data to Glacier6️⃣ Zombie resource hunt → $500-2K/month per account
📋 Monday Morning Checklist:• Run cloud cost analyzer (30 min)• Find top 5 zombie resources (1 hr)• Schedule one non-prod environment (2 hrs)• Present findings to manager (30 min)
📰 News Segment:• Envoy v1.36.3: Three CVEs including request smuggling - PATCH NOW• Loki Operator 0.9.0: Automatic NetworkPolicy deployment• AWS Graviton5 M9g: 25-35% performance improvements (preview)• Uncloud: Deploy containers without Kubernetes complexity
🔗 Full episode page with transcript: https://platformengineering.org/podcasts/00046-cloud-cost-quick-wins-year-end
#PlatformEngineering #FinOps #CloudCosts #DevOps #Kubernetes #CloudOptimization

Thursday Dec 04, 2025
Thursday Dec 04, 2025
Platform Engineer roles pay 20% more than DevOps Engineer roles, but job descriptions are 90% identical. Is Platform Engineering just DevOps with better marketing?
In this episode, we cut through the confusion with origin stories, philosophy comparisons, and practical career advice.
Key insights:• Platform Engineer job postings grew 40% YoY while DevOps postings declined 15%• DevOps (2009) was a movement—never meant to be a job title• SRE (2003/2016) introduced Google's 50% engineering time rule• Platform Engineering (2018-2020) brought product thinking to internal tools• The 20% salary premium is for product thinking, not the title
Decision framework: Start with DevOps culture (non-negotiable) → Add SRE when reliability is your pain point → Add Platform Engineering when cognitive load kills velocity
News segment: PgBouncer CVE-2025-12819 patch, MinIO Docker CVE controversy, GitHub CI/CD observability with OpenTelemetry guide
Resources and transcript: https://platform-engineering-playbook.io/podcasts/00045-platform-engineering-vs-devops-vs-sre

Wednesday Dec 03, 2025
Wednesday Dec 03, 2025
Are certifications worth it? The answer is: it depends. And that's precisely the problem.
In this episode, Jordan and Alex rank 25+ certifications using a data-driven 60/40 framework (60% skill-building, 40% market signal).
🎯 The Certification Dilemma:• Platform engineers span Kubernetes, cloud, observability, security, and developer experience• No single certification captures that breadth• Most certifications prove you can cram for exams, not solve production problems
📊 Key Statistics:• Platform engineers earn $172K vs DevOps $152K (13% premium)• CKA appears in 45,000+ job postings globally• Average certification investment: $800-1,200/year• CKA pass rate: 66% (hands-on exam, production-relevant)• AWS SA Associate holders: 500,000+ (now a minimum credential)
🏆 S-Tier Certifications:• CKA ($445) - Gold standard, hands-on Kubernetes troubleshooting• AWS Solutions Architect Professional - Proves cloud architecture depth• CKS - For security-focused roles, builds on CKA
🔥 Hot Takes:• AWS SA Associate is OVERRATED - too common to differentiate• CNPE (launched Nov 2025) - First platform-specific cert, early adopters get 12-18 month advantage• HashiCorp certs (Terraform, Vault) - Solid B-tier, but post-IBM acquisition concerns
💰 Optimal Certification Stack:CKA + one cloud Professional + one specialty certTotal: ~$1,200 investment, 7-9 months timeline
📖 Full blog post: https://platformengineering.org/blog/platform-engineering-certification-tier-list-2025
🔗 Resources:• CKA Exam: https://training.linuxfoundation.org/certification/certified-kubernetes-administrator-cka/• CNPE Exam: https://training.linuxfoundation.org/certification/certified-cloud-native-platform-engineer-cnpe/• AWS Certifications: https://aws.amazon.com/certification/
#PlatformEngineering #Certifications #CKA #CNPE #AWS #Kubernetes #DevOps #CareerDevelopment #CloudNative

Tuesday Dec 02, 2025
Tuesday Dec 02, 2025
The Wild West of AI infrastructure just ended. CNCF launched the Certified Kubernetes AI Conformance Program at KubeCon Atlanta on November 11, 2025.
In this episode, Jordan and Alex break down:
🎯 The Problem AI Teams Faced:• GPU scheduling worked differently on GKE vs EKS vs OpenShift• Training on one platform, deploying on another = rewriting code• GPU utilization stuck at 45-60% without standardization• 82% of organizations building custom AI, 58% using Kubernetes
⚡ The 5 Core Certification Requirements:• Dynamic Resource Allocation (DRA) - request GPUs with specific VRAM, interconnect requirements• Intelligent Autoscaling - cluster and pod scaling based on GPU metrics• Rich Accelerator Metrics - memory, bandwidth, temperature, NVLink stats• AI Operator Support - Kubeflow, Ray, KServe compatibility• Gang Scheduling - all-or-nothing pod startup for distributed training
📊 The Impact:• GPU utilization: 45-60% → 70-85%• Job queue times: 15-45 min → 3-10 min• Monthly GPU costs: 30-40% reduction
🏢 Certified Vendors (11+):AWS EKS, Google GKE, Microsoft Azure, Red Hat OpenShift, Oracle OCI, CoreWeave, Akamai, VMware/Broadcom, Giant Swarm, Kubermatic, Sidero Labs
🔮 What's Coming in v2.0 (2026):• Topology-aware scheduling• Multi-node NVLink standardization• Model serving standards• Cost attribution for GPU chargeback
📖 Full blog post: https://platformengineering.org/blog/kubernetes-ai-conformance-program-cncf-standardization-guide
🔗 Resources:• CNCF Announcement: https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/• GitHub: https://github.com/cncf/k8s-ai-conformance• GKE Implementation: https://opensource.googleblog.com/2025/11/ai-conformant-clusters-in-gke.html
#Kubernetes #AI #CNCF #PlatformEngineering #DevOps #MLOps #GPU #CloudNative

Monday Dec 01, 2025
Monday Dec 01, 2025
Helm 4.0 dropped at KubeCon Atlanta 2025, marking the biggest update in 6 years. Server-Side Apply finally ends the GitOps ownership wars. WASM plugins bring sandboxed security. But what breaks? This is the definitive guide covering SSA deep-dive, migration timeline, and the full breaking changes analysis.
In this episode:- Server-Side Apply (SSA) replaces three-way merge - field ownership tracked at API server level via managedFields- SSA delivers 40-60% faster deployments by reducing API calls (1 PATCH vs 2+ GET/PATCH per resource)- WASM plugins via Extism runtime are optional but recommended - existing Go binaries and shell scripts still work
Perfect for senior platform engineers, sres, devops engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00042-helm-4-comprehensive-guide

Sunday Nov 30, 2025
Sunday Nov 30, 2025
CNCF just launched the first-ever hands-on platform engineering certification at KubeCon Atlanta 2025. But with beta testers reporting 29% scores, is CNPE worth pursuing?
In this episode, Jordan and Alex break down everything you need to know:
🎯 What CNPE Tests:• GitOps & Continuous Delivery (25%)• Platform APIs & Self-Service (25%)• Observability & Operations (20%)• Platform Architecture (15%)• Security & Policy Enforcement (15%)
📊 Career Impact:• Platform engineers earn $219K average (US)• 20% higher than DevOps engineers• Second most popular K8s role at 11.47% of job postings
🛤️ Three Certification Paths:• Traditional: CKA → CKS → CNPA → CNPE• Fast-track: CNPA → CNPE• Full coverage: Kubestronaut → CNPE → Golden Kubestronaut
⚠️ Key Considerations:• No Killer.sh simulator until Q1 2026• Beta testers reported 29% scores• Required for Golden Kubestronaut after March 1, 2026
Links:• Episode page: https://platform-engineering-playbook.io/docs/podcasts/00041-cnpe-certification-guide• CNPE Certification: https://training.linuxfoundation.org/certification/certified-cloud-native-platform-engineer-cnpe/• CNPA Certification: https://training.linuxfoundation.org/certification/certified-cloud-native-platform-engineering-associate-cnpa/
#PlatformEngineering #CNPE #CNCF #Kubernetes #DevOps #Certification #CloudNative #CareerDevelopment

Saturday Nov 29, 2025
Saturday Nov 29, 2025
DORA 2024 found organizations with platform teams saw throughput decrease by 8% and stability decrease by 14%. Wait—isn't platform engineering supposed to help?
In this episode, Jordan and Alex unpack the 10 anti-patterns sabotaging platform engineering initiatives:
ORGANIZATIONAL ANTI-PATTERNS:1. Ticket Ops - The bottleneck factory where developers wait a week for tasks that should take minutes2. Ivory Tower Platform - Teams disconnected from developer reality creating standards no one follows3. Platform as Bucket - When platform scope grows 3x without corresponding team growth4. Mandatory Adoption - Forcing usage hides resistance and breeds resentment
TECHNICAL ANTI-PATTERNS:5. Golden Cage - Excessive standardization that blocks productivity6. Over-Engineered Monolith - Platforms so complex that learning them becomes a project7. Front-End First - Beautiful portals with manual processes underneath (35% still use spreadsheets!)8. Biggest Bang Trap - Starting with the hardest problem to maximize ROI
STRATEGIC ANTI-PATTERNS:9. Day 1 Obsession - Optimizing for <1% of the application lifecycle10. Build It And They Will Come - No adoption strategy at all
KEY INSIGHTS:• Spotify's Backstage users are 2.3x more active on GitHub• Zalando's first step was cultural, not technical• Teams with stable priorities face 40% less burnout• Firms with adoption strategies see 30% higher ROI
AUDIT YOUR PLATFORM:Are developers waiting more than 1 day for requests? → Ticket OpsHas your platform team pair-programmed with devs this quarter? → Ivory Tower riskScope grown 3x without team growth? → Platform as BucketOne-size-fits-all templates with no extension points? → Golden CageBeautiful portal but Slack for help? → Front-End First
If you answer yes to 3+ of these, your platform initiative is at serious risk.
Read the full blog post: https://platformengineering.org/blog/2025/11/28/platform-engineering-anti-patterns
Sources: DORA 2024, Atlassian DX Report 2024, Port.io 2024, Team Topologies, Spotify Engineering, Zalando Engineering
#PlatformEngineering #DevOps #DORA #DeveloperExperience #InternalDeveloperPlatform #Backstage

Friday Nov 28, 2025
Friday Nov 28, 2025
Why do major retailers with unlimited budgets still crash on Black Friday? This episode dives into the graveyard of e-commerce outages—from J.Crew's $775,000 five-hour crash to the AWS typo that cost $150 million.
In this Black Friday special episode, we examine:
📊 THE HALL OF FAME CRASHES• J.Crew 2018: 323,000 shoppers affected, $775,000 lost in 5 hours• Walmart 2018: $9 million lost before Black Friday even started• Best Buy 2014: Infrastructure optimized for desktop, got 78% mobile• Cloudflare 2024: 99.3% of Shopify stores frozen (6M+ domains)
💥 THE FAMOUS NON-BLACK-FRIDAY DISASTERS• AWS S3 2017: One typo took down half the internet for 4+ hours• GitLab 2017: 5 backup systems, none working, 300GB data deleted• k8s.af: The community treasure trove of Kubernetes failures
🛡️ THE PLATFORM ENGINEER'S PLAYBOOK• Load test at 5-10x (not 2x)• Multi-CDN/multi-cloud strategies• Monthly backup restore tests• Practice chaos before it finds you• Design mobile-first (78%+ of traffic)• Safeguards on dangerous commands
The uncomfortable truth: These outages aren't caused by lack of budget or talent. They're caused by complexity, assumptions, and the gap between "should work" and "actually tested."
🔗 Full transcript & notes: https://platformengineeringplaybook.com/podcasts/00039-black-friday-war-stories
Episode Tags: Black Friday, e-commerce outages, AWS S3, GitLab, Kubernetes, platform engineering, SRE, incident response, chaos engineering, load testing




