Platform Engineering Playbook Podcast

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution.

Read the playbook at https://platformengineeringplaybook.com

Listen on:

Episodes

7 hours ago

AWS Route 53 Global Resolver - Enterprise DNS Security at the Edge

7 hours ago

Every DNS query your hybrid environment makes could be exposing sensitive data. AWS Route 53 Global Resolver, announced at re:Invent 2025, combines anycast routing, encrypted DNS protocols (DoH/DoT), and managed threat filtering in a single service.
In this episode, we cover:- Anycast DNS architecture routing to nearest of 11 AWS regions- DoH and DoT encrypted DNS protocol support- AWS RAM authorization for multi-account private hosted zones- DNS filtering with managed threat lists- Implementation patterns for hybrid environments and remote workforces- Query logging for security visibility and threat hunting
Plus news on Claude Code creator workflows, UK encryption backdoors, K8s EU hosting costs, PostgreSQL replacing Redis, and Rust ecosystem security.
Links:- Episode page: https://playbook.platformengineering.org/podcasts/00088-aws-route-53-global-resolver- AWS Route 53 Global Resolver docs: https://docs.aws.amazon.com/route53/latest/userguide/resolver-global-resolver.html
#AWS #Route53 #DNS #DoH #DoT #HybridCloud #Security #PlatformEngineering #DevOps

2 days ago

Kubernetes Upcoming Features Deep Dive - Extended Toleration Operators and Mutable PV Node Affinity

2 days ago

There's a Kubernetes cluster out there right now burning ten thousand dollars a month on GPU nodes that sit idle sixty percent of the time. Why? Because the scheduler can't say "only schedule pods on nodes with MORE than four GPUs." It's 2026, and our scheduler still can't count. But that's about to change.
In this episode, we dive deep into two alpha features in Kubernetes 1.35 that represent a fundamental shift in how Kubernetes handles scheduling and storage:
**Extended Toleration Operators (KEP-5471)** - Finally, numeric threshold-based scheduling with taints. New Gt (greater than) and Lt (less than) operators let you express "I can tolerate risk up to 5%" or "schedule me on nodes with at least 4 GPUs."
**Mutable PersistentVolume Node Affinity (KEP-5381)** - Storage topology that adapts to reality. When you migrate volumes between availability zones, you no longer need to recreate pods and PVs - just update the nodeAffinity.
Plus platform engineering news:- OpenEverest: Percona's database platform goes open governance- GKE Agent Sandbox: Kernel-level isolation for AI agent code execution- MongoBleed (CVE-2025-14847): Critical vulnerability with 87,000 exposed servers- Predictive capacity planning and the shift from reactive to proactive infrastructure
This is Kubernetes evolving from reactive feedback loops to truly predictive infrastructure.
Listen on the web: https://platformengineering.org/podcasts/00087-kubernetes-upcoming-features-deep-dive

3 days ago

Why Is a 2016 AWS Instance Still the Best Value? (Cloudspecs Research)

3 days ago

New research from TUM reveals uncomfortable truths about cloud hardware stagnation. The paper "Cloudspecs: Cloud Hardware Evolution Through the Looking Glass" shows that the best-performing AWS instance for NVMe I/O per dollar was released in 2016 - and nothing since has come close.
In this episode:• CIDR 2026 research from Technical University of Munich• AWS i3 instances from 2016 still beat all newer options for storage price-performance• CPU gains: 10x cores, but only 2-3x cost-adjusted improvement• Memory crisis: DRAM capacity per dollar has "effectively flatlined"• Network is the only bright spot: 10x improvement per dollar• Interactive tool at cloudspecs.fyi using DuckDB-WASM
News segment covers AI coding tool challenges, Kubernetes updates (Dashboard archived, CoreDNS 1.14), Windows Secure Boot certificate expiration, AWS Lambda .NET 10, Amazon MQ mTLS, MCP criticism, and NVIDIA Rubin announcement.
Episode page: https://platformengineering.org/podcasts/00086-cloudspecs-cloud-hardware-evolution
#PlatformEngineering #CloudComputing #AWS #FinOps #CostOptimization #DevOps

4 days ago

Iran IPv6 Blackout - When Governments Weaponize Protocol Transitions

4 days ago

The same IPv6 transition your infrastructure team has been procrastinating on is now being weaponized by governments. On January 8, 2026, Iran's IPv6 address space dropped 98.5% while IPv4 remained intact—a surgical strike against mobile users.
In this episode, we break down:- Why blocking IPv6 specifically targets mobile users (hint: carrier NAT exhaustion)- The BGP mechanics of protocol-specific blocking- "Engineered degradation" vs total blackout—the new censorship playbook- How Starlink terminals are changing the calculus for authoritarian internet control- What platform engineers need to know: protocol-specific monitoring, Happy Eyeballs testing, dual-stack resilience
Plus news: Kubernetes 1.35 CSI SA tokens, HashiCorp non-human identity, CoreDNS 1.14.0, OpenTelemetry Slack analysis, AWS Route 53 Global Resolver, and kernel bug hide times.
Links:- Episode page: https://platformengineering.org/podcasts/00085-iran-ipv6-blackout- Cloudflare Radar Iran: https://radar.cloudflare.com/ir- RFC 8305 Happy Eyeballs: https://datatracker.ietf.org/doc/html/rfc8305

4 days ago

Venezuela BGP Anomaly - Deep Technical Analysis

4 days ago

A deep technical dive into the January 2026 Venezuela BGP route leak incident. Was it a cyberattack? The technical evidence says no - and that's actually more concerning.
In this special deep-dive episode (no news segment), Jordan and Alex break down:
- What actually happened on January 2, 2026 with AS8048 (CANTV, Venezuela's state ISP)- Why 10x AS-path prepending proves this was misconfiguration, not a man-in-the-middle attack- How BGP valley-free routing works and why Type 1 Hairpin leaks happen- The pattern of 11 similar leaks from CANTV since December 2025- Why your multi-region deployment doesn't protect you from BGP anomalies- RPKI, RFC 9234 OTC, and ASPA - the defenses that exist and why adoption is slow- Practical steps: Check your providers at isbgpsafeyet.com, deploy ROAs, add BGP monitoring
The internet's most critical routing protocol was designed in 1989 when ~160 networks trusted each other. Now 75,000+ autonomous systems operate on that same trust model. Understanding BGP isn't just for network engineers anymore - it's essential context for anyone building on the internet.
Full episode page with transcript and sources: https://platformengineeringplaybook.com/podcasts/00084-venezuela-bgp-anomaly-technical-analysis
#BGP #NetworkSecurity #PlatformEngineering #InternetRouting #RPKI #Kubernetes #DevOps #SRE

5 days ago

HolmesGPT: AI Root Cause Analysis for Kubernetes

5 days ago

Deep dive into HolmesGPT, the CNCF Sandbox AI agent that revolutionizes cloud-native troubleshooting. This episode covers what it is, its 40+ integrations, the project roadmap, and how to set it up today.
News Segment:
AirFrance-KLM's secure automation platform with Terraform, Vault, and Ansible
AWS ECS tmpfs mounts on Fargate for secure secrets handling
Qwen 30B running on Raspberry Pi - democratizing edge AI
AWS European Sovereign Cloud with independent EU governance
Main Topic - HolmesGPT:
CNCF Sandbox project (accepted October 2025) with 1,600+ GitHub stars
Agentic architecture: creates investigation task lists, queries systems, synthesizes findings
40+ built-in toolsets: Prometheus, Grafana Loki/Tempo, Kubernetes, ArgoCD, DataDog, and more
Privacy-first: bring your own LLM keys, read-only access, respects RBAC
End-to-end automation with AlertManager, PagerDuty, OpsGenie integration
Installation options: pip, Homebrew, Helm, Web UI, K9s plugin
Resources:
HolmesGPT GitHub
HolmesGPT Documentation
Full Transcript
Episode Type: full Episode Number: 83 Season: 1 Tags: HolmesGPT, CNCF, Kubernetes, root cause analysis, AI ops, troubleshooting, observability, SRE, platform engineering, Robusta, agentic AI

6 days ago

Docker Kanvas: Infrastructure as Design

6 days ago

Docker just launched Kanvas, a visual tool that turns your architecture diagrams into deployable infrastructure. Built on Meshery (CNCF's 6th highest-velocity project), it converts Docker Compose files to Kubernetes manifests and challenges Helm and Kustomize dominance.
In this episode, we explore:- The dev-to-prod gap that Kanvas solves- How Meshery Models add semantic understanding to infrastructure- Designer Mode vs Operator Mode capabilities- When to use Helm vs Kustomize vs Kanvas- Practical adoption strategies for platform teams
Whether you're struggling with YAML hell or looking to lower cognitive load for developers, this episode gives you the full technical breakdown.
Full transcript: https://platformengineeringplaybook.io/podcasts/00082-docker-kanvas-infrastructure-as-design
#PlatformEngineering #Kubernetes #Docker #DevOps #CloudNative #Kanvas #Meshery #CNCF

7 days ago

Remote MCP Architecture - Running AI Tool Servers on Kubernetes

7 days ago

The MCP server registry hit 10,000+ integrations, but most teams are running these servers on laptops. This episode breaks down the production architecture that Google, Red Hat, and AWS are converging on: remote MCP servers deployed on Kubernetes. We cover three deployment patterns (local stdio, remote HTTP/SSE, and managed), the critical difference between wrapper-based and native API implementations, and a defense-in-depth security model using dedicated ServiceAccounts, time-bound tokens, RBAC, and audit logging.
In this episode:- Remote MCP is production MCP—local stdio mode is for experimentation only; team-scale access requires HTTP/SSE mode- Native API implementations (like Red Hat's Go-based server) outperform wrapper-based kubectl subprocess approaches- Defense-in-depth security: dedicated ServiceAccounts, TokenRequest API for 2-hour tokens, RBAC, --read-only mode, audit logging- Google's managed MCP covers GKE, BigQuery, GCE; self-host for internal tools and custom workflows- Q1: experiment with read-only MCP in dev cluster; Q2: adopt with proper governance; Q3: scale to production
Perfect for platform engineers, sres, devops engineers with 5+ years experience evaluating mcp/ai infrastructure looking to level up their platform engineering skills.
New episodes every week. Subscribe wherever you listen to stay current on platform engineering.
Episode URL: https://platformengineeringplaybook.com/podcasts/00081-remote-mcp-architecture-kubernetes
Duration: 27 minutes
Host: Alex and Jordan
Category: TechnologySubcategory: Software How-To
Keywords: tool, episode, Kubernetes, kubernetes, production, remote, running, servers, architecture

Monday Jan 05, 2026

AWS DevOps Agent - Promises vs Reality

Monday Jan 05, 2026

AWS launched DevOps Agent at re:Invent 2025 as an "autonomous on-call engineer." But before you cancel your PagerDuty subscription, we separate marketing from mechanics.
NEWS THIS EPISODE:• KubeCon Europe 2026: March 23-26 in Amsterdam, 224 sessions across 5 tracks• Platform Engineering 2026 Predictions: Agentic infrastructure becomes standard
In this deep-dive episode, we cover:
WHAT IT PROMISES:• Always-on AI that investigates incidents 24/7• Automatic root cause analysis across logs, metrics, traces, and deployments• Mitigation plan generation with step-by-step remediation• Integration with CloudWatch, Datadog, Dynatrace, New Relic, Splunk
WHAT IT ACTUALLY DELIVERS:• Agent Spaces architecture for scoped permissions and isolated environments• Automatic topology building that discovered 42 resources in demo• Accurate diagnosis of EKS imagePullBackError in real testing• MTTR improvement from 45 to 18 minutes when properly configured
THE CRITICAL LIMITATIONS:• Cannot execute fixes - humans must approve and apply every action• >40 minute gaps between events break correlation• Preview limits: 20 incident hours/month, US-East-1 only• No SOC 2/ISO 27001 compliance yet• GA pricing unknown - the "$600K question"
EVALUATION FRAMEWORK:We provide a 5-question framework to decide if this fits your team, plus ideal vs wait-and-see scenarios based on your cloud footprint and incident volume.
Resources and full transcript: https://platformengineering.playbook.org/podcasts/00080-aws-devops-agent-autonomous-operations

Sunday Jan 04, 2026

AWS Graviton5: 192 Cores, 5x Cache - ARM Takes Over the Data Center

Sunday Jan 04, 2026

AWS doubled the core count on their flagship ARM processors with Graviton5—192 cores in a single socket, 5x L3 cache (180MB), and 3nm fabrication. We go deep on ARM vs x86 architecture, cache hierarchy latencies, NUMA elimination benefits, formal verification security proofs, and a complete migration framework with multi-arch CI/CD patterns. With 98% of top EC2 customers already on Graviton, the ARM tipping point is now.
Duration: ~22 minutes
This episode covers:- 192-core single socket design eliminating NUMA overhead- 180MB L3 cache enabling database working sets to fit entirely in cache- Nitro Isolation Engine with formal verification (mathematical security proofs)- Real customer results from Atlassian, Honeycomb, and SAP- 4-question framework for evaluating ARM migration- 5-point action plan for platform teams- Regional availability considerations
News segment: State of Platform Engineering 2026 report shows platform engineering practices "shifting down" to mid-market companies.
Episode page with full transcript and resources:https://platformengineering.org/podcasts/00079-aws-graviton5-arm-data-center