Platform Engineering Playbook Podcast

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution.

Read the playbook at https://platformengineeringplaybook.com

Listen on:

  • Apple Podcasts
  • YouTube
  • Podbean App
  • Spotify
  • Amazon Music

Episodes

Friday Mar 06, 2026

**87% of production Ansible playbooks have critical flaws - but AI just revealed how to fix them.**
Today's Platform Engineering Playbook dives deep into how AI is revolutionizing infrastructure automation and Ansible development. We'll explore groundbreaking research showing most production playbooks lack proper error handling, and how collaborative AI approaches are changing the game for platform engineers.
**What You'll Learn:**• Why most Ansible deployments are more fragile than you think• How to leverage AI to identify and fix critical infrastructure code issues• Real-world case studies of AI-assisted Ansible improvement• Latest developments in route optimization algorithms (RADAR)• Pulumi's massive 20x performance improvements now in GA• AWS Lambda's new Kiro power for durable functions
**Timestamps:**0:00 Cold Open - The Ansible Crisis2:15 Today's Platform Engineering News8:30 Deep Dive: AI + Ansible Collaboration
Whether you're managing infrastructure at scale or just starting your platform engineering journey, this episode delivers actionable insights you can implement immediately. Learn how top engineering teams are using AI not to replace their expertise, but to amplify it.
**Sources & References:**• How to collaborate with AI to improve your Ansible skills: https://developers.redhat.com/articles/2026/03/04/how-collaborate-ai-improve-your-ansible-skills• RADAR: Learning to Route with Asymmetry-aware DistAnce Representations: https://arxiv.org/abs/2603.03388• Now GA: Up to 20x Faster Pulumi Operations for Everyone: https://www.pulumi.com/blog/journaling-ga/• Accelerate Lambda durable functions development with new Kiro power: https://aws.amazon.com/about-aws/whats-new/2026/03/lambda-durable-kiro-power/• How we would have managed a recent incident at Port with an incident agent: https://www.port.io/blog/how-we-would-have-managed-a-recent-incident-at-port-with-an-incident-agent• Scaling AI opportunity across the globe: Learnings from GitHub and Andela: https://github.blog/developer-skills/career-growth/scaling-ai-opportunity-across-the-globe-learnings-from-github-and-andela/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Thursday Mar 05, 2026

**GrafanaCON 2026 just dropped their agenda, and every attendee will build an AI agent from scratch on day one. What does this tell us about the future of platform engineering?**
In today's Platform Engineering Playbook, we dissect the GrafanaCON 2026 agenda to uncover what it reveals about emerging trends in observability and platform tooling. We analyze why hands-on AI workshops are becoming conference staples and what this means for platform teams in 2026.
**What You'll Learn:**• How GrafanaCON's AI-first approach signals industry shifts• Strategic insights for platform teams from the conference agenda• Hidden cloud costs exposed by AWS's Well-Architected Framework• Release platform migration strategies that actually work• Why traditional ITOps fails with AI incident management
**Timestamps:**00:00 Cold Open - GrafanaCON's AI Agent Challenge02:15 Today's Platform Engineering News08:30 Deep Dive: GrafanaCON 2026 Agenda Analysis
Whether you're planning conference attendance or building your 2026 platform strategy, this episode breaks down the signals that matter for platform engineering leaders.
**Sources & References:**• GrafanaCON 2026 agenda: https://grafana.com/blog/grafanacon-2026-agenda/• AWS Hidden Cloud Costs: https://aws.amazon.com/blogs/architecture/the-hidden-price-tag-uncovering-hidden-costs-in-cloud-architectures-with-the-aws-well-architected-framework/• Release Platform Migration Strategy: https://launchdarkly.com/blog/release-platform-migration/• Datadog Synthetic Monitoring: https://www.datadoghq.com/blog/simplifying-troubleshooting-with-synthetic-monitoring/• AI Incident Management Evolution: https://thenewstack.io/ai-incident-management-evolution/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Wednesday Mar 04, 2026

What if your observability stack could debug and fix production issues while you sleep? That future might be closer than you think.
In today's Platform Engineering Playbook, we explore the cutting edge of agentic AI in observability systems and break down the biggest platform engineering news shaping March 2026.
**🎯 WHAT YOU'LL LEARN:**• How self-healing observability stacks are revolutionizing platform operations• Whether AI agents can truly handle your system's edge cases• Practical evaluation criteria for agentic observability tools• Critical security updates from Datadog's OCI protection expansion• Confluent's game-changing Kafka platform updates with A2A support
**⏰ TIMESTAMPS:**0:00 Cold Open - The Future of Self-Debugging Systems1:30 Today's Platform Engineering Headlines8:45 Deep Dive: Agentic Observability - The Setup15:20 Can AI Handle Your Edge Cases? - The Analysis
**💡 WHY LISTEN:**Get actionable insights on emerging platform technologies, real-world implementation strategies, and stay ahead of industry trends that will impact your infrastructure decisions.
Perfect for platform engineers, SREs, and DevOps professionals navigating the evolving landscape of autonomous systems.
**Sources & References:**• https://grafana.com/blog/the-rise-of-agentic-ai-in-production-can-observability-systems-run-themselves/• https://www.datadoghq.com/blog/cloud-security-oci/• https://thenewstack.io/confluent-kafka-a2a-agents/• https://npmx.dev/blog/alpha-release• https://blog.cloudflare.com/bootstrap-mtc/• https://www.bbc.com/news/articles/cgk28nj0lrjo
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Tuesday Mar 03, 2026

What happens when a major AI platform goes dark while secretly pursuing billion-dollar government contracts? Claude's massive outage reveals critical lessons about platform engineering resilience that every infrastructure team needs to understand.
In today's Platform Engineering Playbook, we dissect Anthropic's Claude outage and uncover the hidden platform engineering challenges of serving classified government workloads. You'll discover why traditional cloud architectures fail when security requirements demand air-gapped infrastructure, and learn a practical framework for building "architectural resilience" into your own platforms.
**What You'll Learn:**• How to architect platforms for multiple security classifications• The real cost of government compliance on platform design• Pulumi's game-changing self-hosted Insights for infrastructure visibility• AWS Lambda runtime automation strategies that actually work• Why Cloudflare's markdown support signals a major shift in web architecture
**Timestamps:**0:00 - Cold Open: Claude's Billion-Dollar Secret2:15 - Today's Platform Engineering News8:30 - Deep Dive: The Hidden Cost of Classified Computing15:45 - Framework: Building Architectural Resilience
Whether you're scaling startup infrastructure or designing enterprise platforms, this episode delivers actionable insights you can implement immediately.
**Sources & References:**- Anthropic's Claude outage: https://techcrunch.com/2026/03/02/anthropics-claude-reports-widespread-outage/- Pulumi self-hosted Insights: https://www.pulumi.com/blog/self-hosted-insights/- AWS Lambda runtime automation: https://aws.amazon.com/blogs/devops/automate-aws-lambda-runtime-upgrades-with-aws-transform-custom/- Ansible AWS updates: https://developers.redhat.com/articles/2026/03/02/whats-new-ansible-certified-content-collection-aws- Cloudflare markdown evolution: https://thenewstack.io/intent-engineering-ai-agents/- Kubernetes DevOps patterns: https://feeds.dzone.com/link/23568/17287898/kubernetes-for-devops-engineers-mastering-modern
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Monday Mar 02, 2026

**What if Spotify's secret weapon for managing 2,800 microservices could transform your entire platform engineering strategy?**
Today's Platform Engineering Playbook dives deep into the Backstage revolution that's quietly reshaping how engineering teams operate at scale. We break down what a production-grade Backstage implementation actually looks like in 2026, complete with real-world examples and concrete takeaways for your team.
**What You'll Learn:**• How Spotify's internal developer portal handles massive microservice complexity• Production-grade Backstage implementation strategies and best practices• Critical MySQL 9.6 changes affecting foreign key constraints and cascade handling• Bootc and OSTree's role in modernizing Linux system deployment• The latest developments in AI company military partnerships
**Episode Timestamps:**0:00 - Cold Open: Spotify's Backstage Breakthrough2:15 - Platform Engineering News Roundup8:30 - Deep Dive Act 1: The Backstage Setup Revolution
Whether you're considering Backstage adoption or optimizing your current platform engineering stack, this episode delivers the tactical insights you need to level up your developer experience.
**Sources & References:**• KubeCon + CloudNativeCon Europe 2026 BackstageCon: https://www.cncf.io/blog/2026/02/27/kubecon-cloudnativecon-europe-2026-co-located-event-deep-dive-backstagecon/• MySQL 9.6 Foreign Key Changes: https://www.infoq.com/news/2026/02/mysql-foreign-keys/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global• Bootc and OSTree Guide: https://a-cup-of.coffee/blog/ostree-bootc/• AI Military Partnerships Update: https://www.businessinsider.com/anthropic-deal-pentagon-openai-sam-altman-dario-amodei-pete-hegseth-2026-2
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Friday Feb 27, 2026

**70% of Kubernetes clusters will go dark in March 2026 when ingress-nginx support officially ends. Are you ready?**
Today's Platform Engineering Playbook dives deep into the massive ingress-nginx migration that's about to impact millions of Kubernetes workloads. We'll break down your migration options, timeline, and practical steps to avoid the chaos.
**What You'll Learn:**✅ Why ingress-nginx is ending support and what it means for your clusters✅ Complete migration strategies from early adopter teams✅ Step-by-step playbook for platform engineering teams✅ Alternative ingress controllers and their trade-offs
**Episode Chapters:**0:00 Cold Open - The ingress-nginx crisis2:30 Welcome & Today's Platform Engineering News5:15 Deep Dive: The ingress-nginx EOL situation12:45 Migration analysis and real-world experiences
**Plus:** Pulumi's distributed work scheduling system architecture, observability platform migration strategies with Prometheus and OpenTelemetry, Kubernetes AI inference updates, and SRE database connectivity troubleshooting frameworks.
Perfect for platform engineers, DevOps teams, and anyone managing Kubernetes infrastructure at scale.
**Sources & References:**- The End of kubernetes/ingress-nginx: Your March 2026 Migration Playbook: https://medium.com/@housemd/kubernetes-ingress-nginx-eol-march-2026-the-complete-migration-guide-to-replace-ingress-nginx-e8f6e118fb5f- How We Built a Distributed Work Scheduling System for Pulumi Cloud: https://www.pulumi.com/blog/how-we-built-a-distributed-work-scheduling-system-for-pulumi-cloud/- Observability platform migration guide: Prometheus, OpenTelemetry, and Fluent Bit: https://thenewstack.io/observability-platform-migration-guide/- Kubernetes WG Serving concludes following successful advancement of AI inference support: https://www.cncf.io/blog/2026/02/26/kubernetes-wg-serving-concludes-following-successful-advancement-of-ai-inference-support/- A Unified Framework for SRE to Troubleshoot Database Connectivity in Kubernetes Cloud Applications: https://feeds.dzone.com/link/23568/17283905/sre-database-connectivity-troubleshooting-kubernetes
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Thursday Feb 26, 2026

**What if 87% of developer productivity loss just became a thing of the past?** 
Anthropic's Claude Computer Use capability is reshaping how platform engineers think about developer workflows, and today we're breaking down exactly what this means for your platform strategy.
**In this episode:**• **Deep dive into Claude's Computer Use** - How remote control capabilities are eliminating context switching between development environments• **Technical analysis** - Session management, security implications, and integration patterns for platform teams• **Practical evaluation framework** - Should your platform team adopt Claude Code? We'll give you the decision matrix• **Platform engineering news roundup** - Self-service observability with OpenTelemetry, hidden costs of "automated" infrastructure, and real-world IT scaling challenges
**Timestamps:**0:00 - Cold Open: The Context Switching Crisis2:15 - Today's Platform Engineering Headlines  8:30 - Deep Dive: Claude Computer Use Breakdown
Whether you're architecting developer platforms or evaluating AI tooling for your engineering org, this episode delivers actionable insights you can implement immediately.
**Sources & References:**• Claude Code Remote Control: https://code.claude.com/docs/en/remote-control• Self-service observability guide: https://platformengineering.org/blog/self-service-observability• Infrastructure hidden costs analysis: https://thenewstack.io/automated-infrastructure-hidden-costs/• IT scaling discussion: https://www.reddit.com/r/sysadmin/comments/1redz97/2man_it_team_solo_admin_for_300_users_no_raise/• Data sovereignty policy update: https://techcrunch.com/2026/02/25/us-tells-diplomats-to-lobby-against-foreign-data-sovereignty-laws/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Wednesday Feb 25, 2026

**Is PostgreSQL really obsolete for AI workloads?** Databricks just dropped Lakebase and it's shaking up everything we thought we knew about database architecture for machine learning pipelines.
In today's Platform Engineering Playbook, we're diving deep into Databricks' game-changing announcement and what it means for your data infrastructure strategy. Plus, we're covering the week's biggest platform engineering news that's reshaping how we build scalable systems.
**What You'll Learn:**• Why Databricks believes traditional PostgreSQL falls short for AI workloads• Technical breakdown of Lakebase architecture and its key innovations• Practical decision framework: when to adopt Lakebase vs. stick with existing solutions• AWS expands Elemental Media Services to Malaysia• Elastic Cloud Serverless doubles Azure region availability• Hybrid Kubernetes strategies for enterprise-scale deployments• OpenTelemetry's 2025 achievements and 2026 roadmap
**Timestamps:**0:00 Cold Open - PostgreSQL vs AI Reality Check2:15 Databricks Lakebase Deep Dive15:30 Platform Engineering News Roundup
Whether you're architecting data platforms, evaluating database solutions for ML workloads, or staying current with cloud-native trends, this episode delivers actionable insights you can implement immediately.
**Sources & References:**• https://www.infoq.com/news/2026/02/databricks-lakebase-postgresql/• https://aws.amazon.com/about-aws/whats-new/2026/02/elemental-Malaysia/• https://www.elastic.co/blog/elastic-cloud-now-available-azure-virginia-singapore-spain-frankfurt• https://aws.amazon.com/blogs/containers/running-containerized-hybrid-nodes-with-amazon-elastic-kubernetes-service/• https://cloudnativenow.com/contributed-content/hybrid-cloud-at-enterprise-scale-private-kubernetes-for-portability-and-control/• https://opentelemetry.io/blog/2026/2025-year-in-review/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Tuesday Feb 24, 2026

**Your AI agents have root access to your infrastructure right now - and you don't even know it.**
What happens when we give AI agents the keys to our entire platform? In today's Platform Engineering Playbook, we dive deep into the hidden security risks of AI infrastructure automation and explore practical solutions for implementing least-privilege access controls.
**What You'll Learn:**• How to secure AI agents with least-privilege gateway patterns using MCP and OPA• Databricks' new Lakebase PostgreSQL database designed specifically for AI workloads• Uber's Uforwarder: A scalable Kafka consumer proxy revolutionizing event-driven microservices• Why Kubernetes 1.35 signals the future of AI orchestration• Latest AWS updates including Claude Sonnet 4.6 in Bedrock and new agent plugins
**Timestamps:**0:00 - Cold Open: The AI Security Wake-Up Call2:15 - Platform Engineering News Roundup8:30 - Deep Dive: Securing AI Infrastructure Access15:45 - Real-World Implementation Strategies
Perfect for platform engineers, DevOps professionals, and infrastructure teams navigating the intersection of AI and cloud-native technologies. Get actionable insights to secure your AI-driven infrastructure before it's too late.
**Sources & References:**- Building a Least-Privilege AI Agent Gateway: https://www.infoq.com/articles/building-ai-agent-gateway-mcp/- Databricks Lakebase PostgreSQL: https://www.infoq.com/news/2026/02/databricks-lakebase-postgresql/- KubeCon SecurityCon Deep Dive: https://www.cncf.io/blog/2026/02/23/kubecon-cloudnativecon-europe-2026-co-located-event-deep-dive-open-source-securitycon/- Uber's Uforwarder: https://www.infoq.com/news/2026/02/uber-uforwarder-kafka-push-proxy/- AWS Weekly Roundup: https://aws.amazon.com/blogs/aws/aws-weekly-roundup-claude-sonnet-4-6-in-amazon-bedrock-kiro-in-govcloud-regions-new-agent-plugins-and-more-february-23-2026/- Kubernetes 1.35 AI Signals: https://www.cncf.io/blog/2026/02/23/kubernetes-as-ais-operating-system-1-35-release-signals/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Monday Feb 23, 2026

**What happens when a single configuration change takes down 20% of the internet for six hours?**
In this episode of Platform Engineering Playbook, we dissect the massive Cloudflare outage from February 20th, 2026 - a catastrophic failure that started with a routine BYOIP pipeline update and ended with Cloudflare accidentally deleting their own customers' networks.
**What You'll Learn:**• The technical breakdown of how Cloudflare's configuration change cascaded into a global outage• Critical lessons for platform engineers about configuration management and deployment pipelines• Real-world AI use cases that are actually working in production environments• Infrastructure gaps that are secretly sabotaging AI productivity initiatives• HTTP/3 implementation strategies using nginx and FreeBSD
**Episode Timestamps:**0:00 - Cold Open: The 30-minute warning2:30 - Today's Platform Engineering News8:15 - Deep Dive Act 1: What Really Happened at Cloudflare
Whether you're building resilient infrastructure or implementing AI tooling, this episode delivers actionable insights to help you avoid similar disasters and build more robust platform engineering practices.
**Sources & References:**- Cloudflare outage on February 20, 2026: https://blog.cloudflare.com/cloudflare-outage-february-20-2026/- What's your best use case for AI in your company so far?: https://www.reddit.com/r/sysadmin/comments/1rasadb/whats_your_best_use_case_for_ai_in_your_company/- This simple infrastructure gap is holding back AI productivity: https://thenewstack.io/this-simple-infrastructure-gap-is-holding-back-ai-productivity/- HTTP/3 on FreeBSD: Getting QUIC Working with nginx in a Bastille Jail: https://blog.hofstede.it/http3-on-freebsd-getting-quic-working-with-nginx-in-a-bastille-jail/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Copyright 2025 All rights reserved.

Podcast Powered By Podbean

Version: 20241125