Platform Engineering Playbook Podcast

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution.

Read the playbook at https://platformengineeringplaybook.com

Listen on:

  • Apple Podcasts
  • YouTube
  • Podbean App
  • Spotify
  • Amazon Music

Episodes

Thursday Nov 27, 2025

This Thanksgiving, let's talk about the people you've never thanked. 60% of open source maintainers are unpaid. 60% have left or considered leaving. Your infrastructure runs on their free time.
In this episode:- Gratitude tools: npx thanks, npm fund, cargo-thanks, thanks-stars- Happiness Packets: Send anonymous thank-you notes to developers- Beyond stars: Why specific use case emails matter more than generic thanks- Company-level: Open Source Pledge ($2K/dev/year), GitHub Sponsors- Your 5-minute Thanksgiving challenge
Perfect for platform engineers, developers, and engineering leadership who want to support the open source ecosystem.
Episode URL: https://platformengineeringplaybook.com/podcasts/00038-thanksgiving-oss-gratitude
Duration: 9:00
Host: Jordan and Alex
Category: TechnologySubcategory: Software How-To
Keywords: open source, maintainer, gratitude, Thanksgiving, npm, cargo, GitHub Sponsors, Open Source Pledge, dependencies

Wednesday Nov 26, 2025

CNCF celebrates 10 years with 300,000 contributors and 230+ projects—but the hallway track told a different story. 60% of maintainers unpaid. 60% have left or considered leaving. The XZ Utils backdoor showed what happens when isolated maintainers burn out. Han Kang's passing reminds us of the human cost behind the code.
In this episode:- Technical breakout sessions: CiliumCon (TikTok IPv6, 60K node clusters), in-toto graduation, Gateway API convergence, OpenTelemetry eBPF- Open Source Pledge: Antithesis $110K, Convex $100K - real cash to maintainers- Kat Cosgrove survival strategies: "When you're an open source maintainer, you don't get to have a bad day in public"
Perfect for platform team leads, open source contributors, and engineering leadership invested in community sustainability.
Episode URL: https://platformengineeringplaybook.com/podcasts/00037-kubecon-2025-community-sustainability

Tuesday Nov 25, 2025

After years of "what even IS platform engineering" debates, KubeCon 2025 delivered consensus: three non-negotiable principles, real-world adoption at Intuit/Bloomberg/ByteDance scale, and the honest truth about maintainer burnout. Cat Cosgrove's "ready to abandon ship" quote reveals the human cost of building the infrastructure everyone depends on.
In this episode:- Three platform principles emerged: API-first self-service, business relevance (not just tech metrics), and managed service approach (not templates)- The "puppy for Christmas" anti-pattern explains 70% platform team failure rate - templates without ongoing operational support- Intuit migrated Mailchimp's 11M users and 700M emails/day so seamlessly "developers didn't even notice"
Perfect for platform team leads, principal engineers, technical leadership considering or running platform engineering initiatives looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00036-kubecon-2025-platform-engineering

Monday Nov 24, 2025

Google donates a GPU driver live on stage. OpenAI saves $2.16M/month with one line of code. Kubernetes rollback finally works after 10 years. What changed at KubeCon Atlanta 2025 that proves Kubernetes isn't adapting to AI—it's being rebuilt for it?
This is Part 1 of our three-part deep dive into KubeCon Atlanta 2025 (November 12-21). Over three episodes, we're covering the CNCF's 10-year anniversary, the announcements reshaping platform engineering, and the honest conversations about ecosystem sustainability.
Key Topics Covered:• Dynamic Resource Allocation (DRA) reaches GA in Kubernetes 1.34 - prevents 10-40% GPU performance loss from NUMA misalignment ($200K/day waste at 100-node scale)• CPU DRA driver announced - enables Kubernetes + Slurm integration for HPC workloads (computational fluid dynamics, molecular modeling, financial simulations)• Workload API arrives in alpha for gang-scheduling multi-pod AI training jobs - eliminates partial failure waste• OpenAI freed 30,000 CPU cores ($2.16M/month savings) by disabling inotify in Fluent Bit after profiling revealed 35% CPU time on fstat64• Kubernetes rollback achieves 99.99% success rate after 10 years - skip-version upgrades now supported
Tomorrow in Part 2: Platform engineering reaches consensus on three principles, real-world case studies from Intuit/Bloomberg/ByteDance, and the "puppy for Christmas" anti-pattern.
Monday Action Plan:1. Test Kubernetes 1.34 in development with DRA enabled2. Profile your highest-CPU service with perf or eBPF (spend 30 minutes)3. Check for NUMA misalignment in GPU workloads
Full Episode Page: https://platformengineeringplaybook.com/podcasts/00035-kubecon-2025-ai-native
Read the Complete Blog Post: https://platformengineeringplaybook.com/blog/2025/11/24/kubecon-atlanta-2025-recap
Part of the Platform Engineering Playbook Podcast series. Open source, community-driven content for senior platform engineers, SREs, and DevOps engineers.

Sunday Nov 23, 2025

Your H100 costs $5,000 per month, but you're only using it at 13% capacity—wasting $4,350 monthly per GPU. Analysis of 4,000+ Kubernetes clusters reveals 60-70% of GPU budgets burn on idle resources because Kubernetes treats GPUs as atomic, non-shareable resources. Discover why this architectural decision creates massive waste, and the five-layer optimization framework (MIG, time-slicing, VPA, Spot, regional arbitrage) that recovers 75-93% of lost capacity in 90 days.
🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00034-kubernetes-gpu-cost-waste-finops
📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!
Keywords: kubernetes gpu, gpu cost optimization, multi-instance gpu, kubernetes finops, gpu utilization, spot instances, vertical pod autoscaler, aws eks cost allocation, nvidia mig, gpu time-slicing
Summary:• Analysis of 4,000+ K8s clusters shows 13% average GPU utilization because Kubernetes treats GPUs as atomic resources—when a pod requests nvidia.com/gpu:1, it locks the entire GPU even when using only 5% capacity, leaving the remaining 95% completely unusable by other workloads• Platform teams compound the waste with round-number overprovisioning (memory: 16GB when P99 usage is 4.2GB) without Vertical Pod Autoscaler data, and miss 2-5x regional cost differences plus 70-90% Spot instance savings by anchoring on AWS us-east-1 on-demand pricing• Multi-Instance GPU (MIG) enables up to 7 isolated instances per A100 with hardware partitioning—real SaaS example: 50 dedicated A100s ($23,760/month) → 8 A100s with 7×1g.10gb MIG instances ($3,802/month) = 84% cost reduction with maintained security isolation• Five-layer solution framework: Kubernetes resource configuration (GPU limits, node taints preventing 30% non-GPU pod waste), MIG for production inference, time-slicing for development (75% savings per developer), AWS EKS Split Cost Allocation (pod-level GPU tracking since Sept 2025), and model optimization (quantization achieving 4-8x compression)• 90-day implementation playbook: Days 1-30 foundation (DCGM Exporter, node taints, VPA in recommendation mode, cost tracking), Days 31-60 optimization (right-sizing from VPA data, MIG for production, time-slicing for dev), Days 61-90 advanced (regional arbitrage, Spot pilot, model quantization)—target outcome is 13-30% baseline → 60-85% utilization with $780K annual savings for 20-GPU clusters

Saturday Nov 22, 2025

Kernel-level eBPF should beat user-space proxies—but Istio Ambient delivers 8% mTLS overhead while Cilium shows 99%. Academic benchmarks reveal why architecture boundaries matter more than execution location, and what that means for your service mesh choice in 2025.
In this episode:- Istio Ambient (user-space) achieves 8% mTLS overhead vs Cilium (kernel eBPF) at 99%—counterintuitive result explained by L7 processing boundaries requiring kernel/user-space transitions- 50,000-pod stability test shows Cilium's distributed control plane crashed API server under churn while Istio's centralized control handled it—20% per-core efficiency, 56% total throughput advantage- Decision framework: Ambient for 1,000+ nodes with mixed L4/L7 traffic (saves $186K/year on 2,000-pod cluster), Cilium for <500 nodes pure L4, sidecars for multi-cluster compliance
Perfect for senior platform engineers, sres, devops engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00033-service-mesh-showdown-cilium-istio-ambient

Friday Nov 21, 2025

HashiCorp's license change and IBM's $6.4B acquisition created the "you must migrate" narrative—but 70% of teams using Terraform in-house aren't legally affected. Jordan and Alex challenge the binary thinking with Fidelity's 50,000 state file migration case study, a three-factor decision framework, and the truth nobody talks about: migration is 90% organizational change management, not technology.
In this episode:- 70% of teams using Terraform in-house are unaffected by BSL license restrictions, yet face strategic vendor lock-in risk with IBM's $6.4B acquisition- Fidelity migrated 50,000 state files managing 4M resources in 2 quarters—technical migration is trivial, organizational change management is the challenge (6 months to 70% completion)- OpenTofu 1.7+ delivers native state encryption after 5+ years of Terraform community requests going unfulfilled—for compliance-heavy industries (finance, healthcare, government), this alone justifies migration
Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience managing infrastructure-as-code at scale.
Episode URL: https://platformengineeringplaybook.com/podcasts/00032-terraform-opentofu-debate

Thursday Nov 20, 2025

GitHub Universe 2025 announced Agent HQ—mission control for orchestrating AI agents from OpenAI, Anthropic, Google, and more. Azure SRE Agent saved Microsoft 20,000+ engineering hours. But 80% of companies report agents executing unintended actions, and only 44% have agent-specific security policies. Jordan and Alex break down what agentic DevOps actually means, the architectural shift from automation to autonomy, and the tiered adoption framework for deploying agents without creating catastrophic risk.
In this episode:- GitHub Agent HQ enables multi-agent orchestration (OpenAI, Anthropic, Google, Cognition) with Enterprise Control Plane for governance- Copilot coding agent works asynchronously—spins up GitHub Actions env, writes code, pushes to draft PR while you sleep- Azure SRE Agent saved 20,000+ engineering hours at Microsoft with human-in-the-loop incident response
Perfect for senior platform engineers, sres, devops engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00031-agentic-devops-github-agent-hq

Wednesday Nov 19, 2025

A routine database permissions change triggered Cloudflare's worst outage since 2019—taking down ChatGPT, X, Shopify, Discord, and 20% of the internet for nearly 6 hours. Jordan and Alex dissect the technical chain reaction from ClickHouse metadata exposure to a Rust panic in the FL2 proxy, examining how ~60 features became >200 and exceeded a hardcoded memory limit. The third major cloud outage in 30 days—after AWS and Azure—raises critical questions about infrastructure concentration risk and why internal configuration needs the same defensive programming as external input.
Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.io/podcasts/00030-cloudflare-outage-november-2025

Wednesday Nov 19, 2025

The de facto standard Kubernetes ingress controller will stop receiving security patches in March 2026—and only 1-2 people have been maintaining it for years. Jordan and Alex unpack why this happened, examine the security implications of unpatched CVEs on internet-facing infrastructure, and provide a four-phase migration framework to Gateway API. Includes controller comparison (Envoy Gateway, Cilium, Kong, Traefik, NGINX Gateway Fabric) and immediate actions for this week.
Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.io/podcasts/00029-ingress-nginx-retirement

Copyright 2025 All rights reserved.

Podcast Powered By Podbean

Version: 20241125