Platform Engineering Playbook Podcast

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution.

Read the playbook at https://platformengineeringplaybook.com

Listen on:

Episodes

Wednesday Nov 19, 2025

Ingress NGINX Retirement: The March 2026 Migration Deadline

Wednesday Nov 19, 2025

The de facto standard Kubernetes ingress controller will stop receiving security patches in March 2026—and only 1-2 people have been maintaining it for years. Jordan and Alex unpack why this happened, examine the security implications of unpatched CVEs on internet-facing infrastructure, and provide a four-phase migration framework to Gateway API. Includes controller comparison (Envoy Gateway, Cilium, Kong, Traefik, NGINX Gateway Fabric) and immediate actions for this week.
Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.io/podcasts/00029-ingress-nginx-retirement

Tuesday Nov 18, 2025

OpenTelemetry eBPF Instrumentation: Zero-Code Observability Under 2% Overhead

Tuesday Nov 18, 2025

What if you could achieve complete observability coverage—every HTTP request, database query, and gRPC call—without touching application code? Jordan and Alex investigate eBPF instrumentation for OpenTelemetry, revealing how kernel-level hooks deliver under 2% CPU overhead versus traditional APM agents' 10-50%. Discover the May 2025 inflection point, the TLS encryption challenge, and a practical framework for combining eBPF with SDK instrumentation.
In this episode:- eBPF instrumentation achieves under 2% CPU overhead by observing kernel operations already happening—versus 10-50% for traditional APM agents- Grafana donated Beyla to OpenTelemetry in May 2025, making eBPF instrumentation part of the core ecosystem- eBPF captures protocol-level data (HTTP, gRPC, SQL) but cannot access application context like user IDs or feature flags—use SDKs for business-critical paths
Perfect for senior platform engineers, sres, devops engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00028-opentelemetry-ebpf-instrumentation

Monday Nov 17, 2025

The Open Source Observability Showdown: When "Free" Costs $12K/Month

Monday Nov 17, 2025

Prometheus is free, Grafana is free, Loki is free—yet Datadog posted $2.3B in revenue and Shopify runs a 15-person team just to manage their observability stack. We decode which open source tools (Prometheus, Loki, Tempo, VictoriaMetrics) actually deliver on their promises, which hide massive operational complexity, and when the "free" option costs more than paying a vendor. Learn the decision framework that matches observability architecture to your team's operational maturity.
In this episode:- Single-cluster Prometheus costs ~5 hrs/month ($750-1500 equivalent), but multi-cluster federation jumps to 40-80 hrs/month ($6K-12K)—know your tier before committing- Loki delivers 5-10x cheaper storage than OpenSearch but 3-5x slower queries for complex searches—works brilliantly for structured logs with good labels, struggles with full-text search- VictoriaMetrics reports 40-60% storage reduction vs Prometheus with better high-cardinality handling—consider it before jumping to commercial platforms
Perfect for senior platform engineers, sres, devops engineers with 5+ years experience making build vs buy decisions for observability infrastructure looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00027-observability-tools-showdown

Sunday Nov 16, 2025

The Kubernetes Complexity Backlash: When Simpler Infrastructure Wins

Sunday Nov 16, 2025

Kubernetes commands 92% market share, yet 88% report year-over-year cost increases and 25% plan to shrink deployments. We unpack the 3-5x cost underestimation problem, the cargo cult adoption pattern, and when alternatives like Docker Swarm, Nomad, ECS, or PaaS platforms deliver better ROI. From the 200-node rule to 37signals' $10M+ five-year savings leaving AWS, this is your data-driven framework for right-sizing infrastructure decisions in 2025.
🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00026-kubernetes-complexity-backlash
📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!
Summary:• 88% of Kubernetes adopters report year-over-year TCO increases (Spectro Cloud 2025), with teams underestimating total costs by 3-5x when missing human capital ($450K-$2.25M for 3-15 FTE platform team), training (6-month ramp-up), and tool sprawl• The 200-node rule: Kubernetes makes sense above 200 nodes with complex orchestration needs; below that, Docker Swarm (10-minute setup), HashiCorp Nomad (10K+ node scale), AWS ECS, Cloud Run (production in 15 minutes), or PaaS platforms ($400/month vs $150K/year K8s team) often win• 209 CNCF projects create analysis paralysis, with 75% inhibited by complexity and fintech startup wasting 120 engineer-hours evaluating service mesh they didn't need for their 30 services• Real 5-year TCO comparison: Kubernetes at 50-100 nodes costs $4.5M-$5.25M (platform team + compute + tools + training) versus PaaS at $775K-$825K (5-6x cheaper), but Kubernetes wins at 500+ nodes where PaaS per-resource costs become prohibitive• 37signals' cloud repatriation saved $10M+ over five years by leaving AWS (EKS/EC2/S3) for on-prem infrastructure ($3.2M → $1.3M annually), proving cloud and Kubernetes aren't universally optimal—they're tools with specific use cases that require matching tool to actual scale, not aspirational scale

Sunday Nov 16, 2025

SRE Reliability Principles: The 26% Problem - Error Budgets, SLOs, Platform Engineering

Sunday Nov 16, 2025

Only 26% of organizations actively use SLOs after a decade of Google's SRE principles being gospel. We explore why adoption is so low despite 49% saying they're more relevant than ever, which principles remain timeless (error budgets, embracing risk, blameless postmortems), and how to adapt SRE for 2025's complexity of AI/ML systems, Platform Engineering collaboration, and multi-cloud chaos. Includes practical playbooks for starting from zero, fixing ignored SLOs, and ML-specific adaptations. The key insight: it's not that SRE principles are wrong—implementation is harder than anticipated, but the philosophy remains timeless when properly adapted.
In this episode:- Only 26% of organizations use SLOs despite 85% adopting OpenTelemetry—process transformation is harder than tooling, with unrealistic targets (99.99% = 52min/year downtime) undermining entire systems- Error budget fundamentals remain timeless: 99.999% SLO with 0.0002% problem = 20% quarterly budget spent, transforming reliability from political arguments into data-driven release decisions- Platform Engineering ($115K average) and SRE ($127K average) are complementary not competitive—Platform teams build systems, SREs ensure reliability, both can use error budget thinking for alignment- AI/ML systems need adapted SRE principles: data freshness SLOs, model drift detection, training pipeline reliability, and different error budget math (one LLM training failure = tens of thousands in compute loss)- Starting from zero: pick 3-5 critical services, one SLO per service initially (99.9% = 43 minutes/month downtime is reasonable), automate with OpenTelemetry from day one, get cross-functional buy-in, target 12-month timeline
Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their reliability engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00025-sre-reliability-principles

Friday Nov 14, 2025

Internal Developer Portal Showdown 2025: Backstage vs Port vs Cortex vs OpsLevel

Friday Nov 14, 2025

Your team spent 6 months implementing Backstage. Adoption? 8%. The CFO asks: "Why didn't we buy a solution?" Here's the 2025 comparison with real pricing, real timelines, and the counterintuitive truth: commercial platforms are 8-16x cheaper than "free" Backstage for most teams. OpsLevel $39/user/month delivers in 30-45 days. Port $78/month offers flexibility without coding. Cortex $65-69/month enforces standards. We break down the decision framework by team size—under 200? OpsLevel. 200-500? Port or OpsLevel. 500+? Backstage viable with dedicated platform team. The key insight: it's not open-source free vs commercial expensive—it's transparent licensing vs hidden $150K/20-developer engineering costs.
In this episode:- Backstage costs $150K per 20 developers in hidden engineering time—$1.5M annually for 200-person teams versus $93K-$187K for commercial platforms (8-16x cheaper)- OpsLevel ($39/user/month) delivers fastest implementation at 30-45 days with 60% efficiency gains and automated catalog maintenance—ideal for teams under 200 engineers- Port ($78/user/month) offers flexible "Blueprints" data model for customization without coding, 3-6 month implementation—best for 200-500 engineer teams needing flexibility
Perfect for senior platform engineers, sres, devops engineers with 5+ years experience looking to level up their platform engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00024-internal-developer-portals-showdown

Thursday Nov 13, 2025

DNS for Platform Engineering: The Silent Killer

Thursday Nov 13, 2025

Why does a forty-year-old protocol keep taking down billion-dollar infrastructure? The October 2025 AWS outage lasted fifteen hours because of a DNS race condition. Kubernetes defaults create 5x query amplification. We investigate how DNS really works in modern platforms—CoreDNS plugin chains, the ndots:5 trap, GSLB failover—and deliver the five-layer defensive playbook to prevent your platform from becoming the next postmortem.
🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00023-dns-platform-engineering
📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!

Wednesday Nov 12, 2025

eBPF in Kubernetes: Kernel-Level Superpowers Without the Risk

Wednesday Nov 12, 2025

Your Kubernetes cluster is a black box—Prometheus shows symptoms, not causes. eBPF turns the Linux kernel into a programmable platform for observability, networking, and security. Jordan and Alex unpack how eBPF programs run safely in kernel space, explore tools like Cilium, Pixie, and Falco, and reveal the practical path from blind spots to kernel-level visibility without crashing production.
🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00022-ebpf-kubernetes
📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!

Tuesday Nov 11, 2025

Time Series Language Models

Tuesday Nov 11, 2025

AI models that can read your infrastructure metrics like language, explain anomalies in plain English, and predict failures without training on your data—this technology exists right now. But even the companies who built it won't deploy it to production yet. Why? In this episode, Jordan and Alex unpack the mystery of Time-Series Language Models (TSLMs), explore breakthrough projects like Stanford's OpenTSLM and Datadog's Toto, and reveal what platform engineers need to do now to prepare for this revolutionary but not-yet-ready technology.
🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00021-time-series-language-models
📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!

Monday Nov 10, 2025

Kubernetes IaC & GitOps - The Workflow Paradox

Monday Nov 10, 2025

77% of organizations have adopted GitOps, 60% run ArgoCD—yet platform teams are still bottlenecks and deployments still take days. Jordan and Alex investigate why the promised outcomes aren't materializing and reveal the real differentiator: workflow design, not tool selection. Learn how companies like Qonto achieved 10x deployment improvements with the same technology.
🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00020-kubernetes-iac-gitops
📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!