Episode Details

Back to Episodes
GPU Inference Performance: The Compute Lie Killing Your AI Latency

GPU Inference Performance: The Compute Lie Killing Your AI Latency

Season 1 Published 4 months, 3 weeks ago
Description
(00:00:00) The Mysterious GPU Slowdown
(00:03:31) The Silent Saboteur: CPU Fallback
(00:07:43) The Hidden Pitfalls of Version Mismatch
(00:12:24) The Container Culprit: Efficiency Erosion
(00:16:52) The Remedy: Provable Acceleration
(00:22:05) Closing Thoughts and Next Steps

In this episode of M365.fm, Mirko Peters investigates a familiar horror story in AI operations: GPU bills climbing while GPU utilization is near zero and latency quietly explodes. He dissects a real text‑to‑image Stable Diffusion workload where everything on paper looks right — ONNX/TensorRT, NVIDIA GPUs, containers, CI‑controlled rollouts — yet requests crawl and P95 latency blows past every SLO.

WHAT YOU WILL LEARN
  • Why your “GPU‑accelerated” service may actually be running on CPU without telling you
  • How CPU fallback in ONNX Runtime works and why it almost never raises a visible error
  • How subtle CUDA / ONNX Runtime / TensorRT version mismatches destroy fused kernels and fast paths
  • How container misconfiguration (missing device mounts, wrong nvidia‑container‑toolkit setup) turns accelerators into expensive heaters
  • Which three metrics — latency, throughput, and GPU utilization — tell you the truth when dashboards lie
THE CORE INSIGHT

Most AI outages at scale aren’t about the model; they’re about infrastructure honesty. Your system will happily “work” on the wrong execution provider, with degraded kernels, or with no GPU attached at all — and it will do so silently unless you force it to prove otherwise. Mirko shows how provider order, capability logs, and device mounts form the real chain of evidence for whether your GPUs are actually doing the work you’re paying for.You’ll hear a detailed walk‑through of “Evidence File A”: CPU fallback as the quiet saboteur. ONNX Runtime tries TensorRT, then CUDA, then shrugs and runs everything on CPU when drivers, libraries, or device mounts don’t line up — logging a single line most teams never read. The service stays green, but GPU duty cycles hover at 5%, CPU cores peg, P50 latency quadruples, and P95 unravels under bursty traffic as autoscale happily spreads the defect across more replicas.Then in “Evidence File B,” Mirko explores version drift: CUDA, cuDNN, ONNX Runtime, and TensorRT that technically run but miss fused attention kernels, FP16 paths, and tensor core optimizations. Engines deserialize with warnings, fall back to generic kernels, and keep responding — just slower and more memory‑hungry. Utilization charts look “busy enough,” but PCIe and memory movement dominate, and your cost per request quietly spikes.Most teams treat containerization and CI as safety nets; here you’ll see how they can just as easily freeze defects in amber when you don’t assert GPU health at startup. Mirko outlines concrete countermeasures: hard‑fail if GPU providers aren’t present, validate IO binding with a warm‑up inference, enforce latency gates during rollout, and build canary prompts that exercise the fused kernels you care about. In other words, trade a bit of availability at deploy time for integrity and predictable performance in production.

WHO THIS EPISODE IS FOR

This epis
Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us