Platform Engineering Playbook Podcast
The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution.
Read the playbook at https://platformengineeringplaybook.com
Episodes

14 hours ago
14 hours ago
**87% of AI workloads are sitting idle on GPUs right now** - yet companies keep buying more hardware. What if the problem isn't capacity, but how we're running AI on Kubernetes?
In today's Platform Engineering Playbook, we tackle the massive inefficiencies plaguing AI infrastructure at scale. You'll discover why traditional Kubernetes patterns break down with AI workloads, what's actually happening under the hood when you try to serve ML models in production, and concrete strategies to fix GPU utilization without throwing more money at the problem.
**What You'll Learn:**• Why current Kubernetes-native AI patterns are failing at scale• The hidden bottlenecks destroying your GPU efficiency • Runtime security developments from Grafana Labs and Miggo• Amazon ECR's new pull-through cache support for Chainguard• How to evolve from Kubernetes Gatekeeper to full-stack governance with OPA
**Timestamps:**0:00 Cold Open - The AI Infrastructure Crisis2:15 Today's Platform Engineering News8:30 Deep Dive: Kubernetes + AI at Scale15:45 Under the Hood Analysis22:10 Actionable Takeaways
Whether you're scaling AI workloads or just trying to understand why your GPU bills keep growing while performance stays flat, this episode gives you the platform engineering perspective you need.
**Sources & References:**• Building Kubernetes-native AI infrastructure: https://thenewstack.io/kubernetes-native-ai-infrastructure/• Grafana Cloud and Miggo runtime protection: https://grafana.com/blog/grafana-cloud-and-miggo-for-runtime-protection/• Amazon ECR Chainguard support: https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-ecr-pull-through-cache-chainguard/• AWS Cloud 20 years retrospective: https://aws.amazon.com/blogs/aws/20-years-in-the-aws-cloud-how-time-flies/• LLM Compressor v0.10: https://developers.redhat.com/articles/2026/03/18/llm-compressor-010-faster-compression-distributed-gptq• Kubernetes Gatekeeper to OPA governance: https://www.pulumi.com/blog/kubernetes-gatekeeper-full-stack-governance-opa/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

2 days ago
2 days ago
**Are 73% of Kubernetes clusters really flying blind?** According to recent industry reports, most K8s deployments are drowning in meaningless metrics while missing the signals that actually matter for performance and cost optimization.
In today's Platform Engineering Playbook, we tackle the Kubernetes observability crisis head-on. You'll discover why traditional monitoring approaches are failing platform teams and learn actionable strategies to build metrics that drive real business value.
**What You'll Learn:**• Why most K8s metrics collection strategies are fundamentally broken• How to identify and implement performance indicators that actually matter• Practical frameworks for establishing effective observability in your clusters• Real-world approaches to turning metrics into cost savings and performance gains
**Episode Breakdown:**00:00 - Cold Open: The K8s Observability Crisis02:30 - Industry News Roundup08:45 - Deep Dive: Fixing Kubernetes Metrics (Part 1)
**Today's News:** Container security innovations from Chainguard, Grafana's new cost optimization tools, custom metrics scaling strategies, and the latest observability trends including AI integration challenges.
Perfect for platform engineers, DevOps teams, and engineering leaders looking to move beyond vanity metrics to actionable observability.
**Sources & References:**- CNCF Kubernetes Metrics Best Practices: https://www.cncf.io/blog/2026/03/18/understanding-kubernetes-metrics-best-practices-for-effective-monitoring/- Grafana Cost Optimization Guide: https://grafana.com/blog/from-signals-to-savings-optimizing-cloud-costs-with-grafana-assistant-and-mcp-servers/- Chainguard Container Security Analysis: https://thenewstack.io/chainguard-os-packages-containers/- Datadog Custom Metrics Scaling: https://www.datadoghq.com/blog/autoscaling-custom-metrics/- Grafana Observability Standards Report: https://grafana.com/blog/observability-survey-OSS-open-standards-2026/- AI in Observability Survey: https://grafana.com/blog/observability-survey-AI-2026/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

3 days ago
3 days ago
**87% of enterprise AI deployments have a critical security vulnerability that red teams aren't even testing for.** Are you one of them?
In today's Platform Engineering Playbook, we expose the massive security hole plaguing enterprise AI systems and dive deep into prompt injection attacks that are slipping past traditional security measures. Plus, we cover the latest platform engineering news that's reshaping how enterprises build and deploy.
**What You'll Learn:**• The hidden AI security vulnerability affecting 9 out of 10 enterprise deployments• Step-by-step breakdown of how prompt injection attacks work in production• Actionable security strategies for platform engineers deploying AI agents• Microsoft's aggressive PostgreSQL push and what it means for your data strategy• Cloudflare's evolution from legacy architecture to modern SASE solutions
**Timestamps:**0:00 Cold Open - The 87% Problem1:30 Introduction3:00 Deep Dive: The AI Security Crisis8:45 How Prompt Injection Attacks Actually Work15:20 Platform Engineer Action Items
Whether you're currently deploying AI systems or planning your enterprise AI strategy, this episode delivers the security insights and platform engineering intelligence you need to stay ahead of emerging threats.
**Sources & References:**• AI Security Research: https://thenewstack.io/red-teaming-enterprise-ai-agents/• PostgreSQL on Azure: https://azure.microsoft.com/en-us/blog/from-legacy-to-leadership-how-postgresql-on-azure-powers-enterprise-agility-and-innovation/• Cloudflare SASE Evolution: https://blog.cloudflare.com/legacy-to-agile-sase/• AI Tooling Survey: https://newsletter.pragmaticengineer.com/i/189777574/2-most-used-ai-tools• Azure DevOps MCP Server: https://devblogs.microsoft.com/devops/azure-devops-remote-mcp-server-public-preview/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

4 days ago
4 days ago
**Is your Kubernetes cluster blind to AI model poisoning attacks?** 73% of companies running AI workloads can't detect when their models are compromised - and traditional monitoring tools are completely useless against these threats.
In today's Platform Engineering Playbook, we dive deep into why AI workloads are breaking traditional Kubernetes observability strategies and what platform teams need to do about it. Plus, we cover the latest developments shaking up the cloud native ecosystem.
**What You'll Learn:**✅ Why traditional Kubernetes monitoring fails with AI workloads✅ How to detect AI model poisoning in production environments✅ Critical AWS security vulnerabilities affecting managed services✅ New authentication strategies for Kubernetes registry mirrors✅ Latest developments from the cloud native community
**Timestamps:**0:00 Cold Open - The AI observability crisis1:30 Today's Platform Engineering News8:45 Deep Dive: AI Workloads vs Traditional Monitoring15:20 The Real-World Impact on Autoscaling
Whether you're running AI workloads today or planning for tomorrow, this episode gives you the strategies and tools to maintain visibility and security in your Kubernetes environments.
**Sources & References:**- Why AI workloads are breaking traditional Kubernetes observability strategies: https://thenewstack.io/ai-kubernetes-observability-practices/- AWS Launches Managed Openclaw on Lightsail Amid Critical Security Vulnerabilities: https://www.infoq.com/news/2026/03/aws-lightsail-openclaw-security/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global- LLM Architecture Gallery: https://sebastianraschka.com/llm-architecture-gallery/- Cursor built a fleet of security agents to solve a familiar frustration: https://thenewstack.io/cursor-open-sources-security-agents/- Registry Mirror Authentication with Kubernetes Secrets: https://www.cncf.io/blog/2026/03/16/registry-mirror-authentication-with-kubernetes-secrets-2/- KubeCon + CloudNativeCon Europe 2026 Co-located Event Deep Dive: Open Sovereign Cloud Day: https://www.cncf.io/blog/2026/03/16/kubecon-cloudnativecon-europe-2026-co-located-event-deep-dive-open-sovereign-cloud-day/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

5 days ago
5 days ago
**87% of AI companies are burning cash on the wrong cloud infrastructure - and they have no idea.**
In this episode of Platform Engineering Playbook, we expose the costly mistakes plaguing AI infrastructure and reveal the framework that's helping platform teams save millions while scaling smarter.
**What You'll Learn:**• The 6 categories of AI cloud infrastructure that matter in 2026• How to transform inference from dedicated resources into efficient multi-tenant services• A battle-tested evaluation framework from dozens of real-world AI platform implementations• Critical security vulnerabilities in AWS's new Managed OpenClaw service that could impact your infrastructure
**Episode Breakdown:**00:00 Cold Open - The 87% cash burn crisis02:30 Today's Platform Engineering News08:15 Deep Dive: AI Cloud Infrastructure Fundamentals
**Breaking News Covered:**- AWS Lightsail OpenClaw security situation- New LLM Architecture Gallery release- MCP production roadmap updates- Linux's game-changing performance breakthrough
Whether you're architecting AI platforms or optimizing existing infrastructure, this episode delivers actionable insights to help you avoid the expensive mistakes that are crushing 87% of AI companies.
**Sources & References:**- AI Cloud Taxonomy 2026: https://thenewstack.io/ai-cloud-taxonomy-2026/- AWS Lightsail OpenClaw Security: https://www.infoq.com/news/2026/03/aws-lightsail-openclaw-security/- LLM Architecture Gallery: https://sebastianraschka.com/blog/2026/llm-architecture-gallery.html- MCP Production Roadmap: https://thenewstack.io/model-context-protocol-roadmap-2026/- Linux Performance Feature: https://www.iowaparkleader.com/linux-finally-catches-up-to-windows-with-a-game-changing-performance-feature/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Friday Mar 13, 2026
Friday Mar 13, 2026
**Is your monitoring bill about to explode? AI-generated code is creating 10x more observability data than human-written code.**
In this deep dive episode of Platform Engineering Playbook, we unpack the hidden observability crisis that's quietly hitting DevOps teams everywhere. While AI accelerates development, it's also flooding your monitoring systems with unprecedented amounts of telemetry data.
**What You'll Learn:**✅ Why AI-generated code produces exponentially more observability data✅ How to manage exploding monitoring costs without losing visibility✅ Practical strategies for optimizing telemetry in AI-heavy environments✅ Real-world approaches to selective instrumentation and data sampling
**Episode Breakdown:**0:00 - Cold Open: The 10x observability data problem2:15 - Industry news roundup8:30 - Deep Dive Act 1: Understanding the AI observability explosion18:45 - Deep Dive Act 2: Technical analysis and root causes
**Today's News Coverage:**• CNCF's new etcd debugging improvements for Kubernetes• Uber's MySQL consensus architecture breakthrough• Cloudflare's Account Abuse Protection launch• GitLab Container Virtual Registry updates
Perfect for platform engineers, DevOps leads, and SREs dealing with modern observability challenges in AI-driven development environments.
**Sources & References:**- https://devops.com/ai-is-forcing-devops-teams-to-rethink-observability-data-management/- https://www.cncf.io/blog/2026/03/12/making-etcd-incidents-easier-to-debug-in-production-kubernetes/- https://www.infoq.com/news/2026/03/uber-mysql-uptime-consensus/- https://blog.cloudflare.com/account-abuse-protection/- https://about.gitlab.com/blog/using-gitlab-container-virtual-registry-with-docker-hardened-images/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Thursday Mar 12, 2026
Thursday Mar 12, 2026
**Are you throwing money away on Kubernetes compute costs?** 87% of clusters waste up to half their resources on idle nodes - but there's a solution that's changing everything.
In today's Platform Engineering Playbook, we dive deep into **Karpenter**, the game-changing autoscaler that's revolutionizing how teams think about Kubernetes resource management. You'll discover why traditional cluster autoscaling falls short and how Karpenter's architecture solves real-world scaling challenges.
**What You'll Learn:**✅ Why 87% of K8s clusters are bleeding money on unused compute✅ Karpenter's under-the-hood architecture and decision-making process ✅ Practical evaluation framework for adopting Karpenter in your platform✅ Latest platform engineering news from Microsoft Azure AI agents, KubeCon India 2026, and more
**Timestamps:**0:00 - Cold Open: The Kubernetes Cost Crisis2:15 - Today's Platform Engineering News8:30 - Deep Dive: Karpenter vs Traditional Autoscaling
Perfect for platform engineers, DevOps teams, and cloud architects looking to optimize their Kubernetes infrastructure costs and performance.
**Sources & References:**- Understanding Karpenter architecture: https://www.datadoghq.com/blog/karpenter-architecture/- Microsoft Azure Skills Plugin: https://devops.com/microsoft-azure-skills-plugin-gives-ai-coding-agents-a-playbook-for-cloud-deployment/- KubeCon India 2026 Schedule: https://www.cncf.io/announcements/2026/03/10/cncf-unveils-kubecon-cloudnativecon-india-2026-schedule/- Cloudflare Security Insights: https://blog.cloudflare.com/attack-surface-intelligence/- Monitor Karpenter with Datadog: https://www.datadoghq.com/blog/monitor-karpenter-datadog/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Wednesday Mar 11, 2026
Wednesday Mar 11, 2026
**Why do 70% of AI projects crash and burn before they ever see production?** Spoiler alert: it's not the AI that's broken.
In today's Platform Engineering Playbook, we're diving deep into the AI infrastructure crisis that's keeping CTOs awake at night. While everyone's racing to deploy the latest AI models, most organizations are discovering their legacy systems simply can't handle the load.
**What You'll Learn:**• The real reason AI projects fail (hint: it's your infrastructure)• How to build a unified data fabric that actually works• Which legacy systems are sabotaging your AI ambitions• Practical strategies for modernizing without breaking everything
**Episode Breakdown:**00:00 - Cold Open: The 70% AI failure rate02:15 - Platform Engineering News Roundup08:30 - Deep Dive: The AI Infrastructure Disconnect15:45 - Building Unified Data Fabrics
**Today's News:** Cloudflare & Mastercard's new security partnership, Amazon's R8g instance expansion, Pulumi's Google Sign-In support, Amazon vs. Perplexity AI legal battle, and Together AI's GPU cluster improvements.
Perfect for platform engineers, DevOps teams, and technical leaders navigating the AI transformation.
**Sources & References:**- AI Infrastructure Crisis Roadmap: https://thenewstack.io/ai-infrastructure-crisis-roadmap/- Cloudflare & Mastercard Security Partnership: https://blog.cloudflare.com/attack-surface-intelligence/- Amazon EC2 R8g Regional Expansion: https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-ec2-r8g-instances-additional-regions/- Pulumi Google Sign-In: https://www.pulumi.com/blog/pulumi-cloud-now-supports-google-sign-in/- Amazon vs. Perplexity Legal Update: https://www.businessoffashion.com/news/technology/amazon-wins-court-order-blocking-perplexity-ai-shopping-bots/- Together AI GPU Clusters: https://www.together.ai/blog/new-in-together-gpu-clusters-autoscaling-observability-self-healing
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Tuesday Mar 10, 2026
Tuesday Mar 10, 2026
**Why do 97% of companies using Kubernetes never scale beyond their original expert team?** It's not a skills problem - it's an architecture problem that Internal Developer Platforms (IDPs) are uniquely positioned to solve.
In today's episode of Platform Engineering Playbook, we dive deep into the Kubernetes scaling crisis and explore how IDPs can democratize container orchestration across your entire engineering organization. Plus, we cover the latest platform engineering news that's shaping the industry.
**What You'll Learn:**• The real reason most Kubernetes deployments stay trapped in expert-only silos• How IDPs solve the complexity problem without dumbing down capabilities • Tactical frameworks for deciding if your organization actually needs an IDP• Breaking news: Pulumi's expanded VCS support, Netflix's massive PostgreSQL migration, and Apono's game-changing Grafana integration
**Timestamps:**0:00 - Cold Open: The 97% Problem2:15 - Industry News Roundup8:30 - Deep Dive: The Kubernetes Scaling Crisis15:45 - How IDPs Bridge the Expert Gap
Whether you're a platform engineer, DevOps lead, or engineering manager struggling with Kubernetes adoption, this episode gives you concrete strategies to scale your platform beyond the experts who built it.
**Sources & References:**• Why IDPs are the Only Way to Scale Kubernetes Beyond Experts: https://cloudnativenow.com/social-facebook/why-idps-are-the-only-way-to-scale-kubernetes-beyond-experts/• Expanded Version Control Support in Pulumi Cloud: https://www.pulumi.com/blog/expanded-version-control-support/• Apono integration for Grafana: https://grafana.com/blog/apono-integration-for-grafana-enabling-just-in-time-access-for-data-sources/• Netflix Automates RDS PostgreSQL to Aurora PostgreSQL Migration: https://www.infoq.com/news/2026/03/netflix-automates-rds-aurora/
#PlatformEngineering #DevOps #CloudNative #Kubernetes

Monday Mar 09, 2026
Monday Mar 09, 2026
**What if your AWS bill has a hidden line item costing you thousands that doesn't even show up in Cost Explorer?**
Today on Platform Engineering Playbook, we expose the sneaky cloud costs that are bleeding your budget dry and dive deep into the AWS Well-Architected Framework's six pillars to help you architect cost-efficient, secure platforms.
**What You'll Learn:**✅ How to identify and eliminate hidden AWS costs using the Well-Architected Framework✅ Practical steps platform engineers can take TODAY to optimize cloud spending✅ Real-world analysis of cost optimization strategies that actually work
**Episode Breakdown:**🎯 Cold Open: The hidden AWS cost crisis📊 Deep Dive Act 1: Setting up the hidden cost problem🔍 Deep Dive Act 2: AWS Well-Architected Framework analysis with expert insights⚡ Deep Dive Act 3: Actionable takeaways for platform engineers📰 Industry News: NanoClaw's containerized AI agents, OpenLens alternatives, incident management at Port, DevOps job opportunities, and GitHub Codespaces incidents
Whether you're managing multi-million dollar cloud infrastructures or optimizing costs for growing startups, this episode delivers the framework and tactics you need to stop hidden costs from destroying your budget.
**Sources & References:**- AWS Well-Architected Framework Hidden Costs: https://aws.amazon.com/blogs/architecture/the-hidden-price-tag-uncovering-hidden-costs-in-cloud-architectures-with-the-aws-well-architected-framework/- NanoClaw Containerized AI Agents: https://thenewstack.io/nanoclaw-containerized-ai-agents/- OpenLens Alternatives Guide: https://feeds.dzone.com/link/23568/17293987/best-openlens-alternatives-for-kubernetes-visibility- Port Incident Management: https://www.port.io/blog/how-ai-would-have-handled-a-real-incident-at-port- DevOps Job Opportunities: https://devops.com/five-great-devops-job-opportunities-179/- GitHub Codespaces Incident: https://www.githubstatus.com/incidents/tp8m3544w2g8
#PlatformEngineering #DevOps #CloudNative #Kubernetes




