Right-Sizing AI: Small Language Models for Real-World Production

Episode 61

Summary
In this episode of the AI Engineering Podcast Steven Huels, Vice President of AI Engineering & Product Strategy at Red Hat, talks about the practical applications of small language models (SLMs) for production workloads. He discusses how SLMs offer a pragmatic choice due to their ability to fit on single enterprise GPUs and provide model selection trade-offs. The conversation covers self-hosting vs using API providers, organizational capabilities needed for running production-grade LLMs, and the importance of guardrails and automated evaluation at scale. They also explore the rise of agentic systems and service-oriented approaches powered by smaller models, highlighting advances in customization and deployment strategies. Steven shares real-world examples and looks to the future of agent cataloging, continuous retraining, and resource efficiency in AI engineering.

Announcements

Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
When ML teams try to run complex workflows through traditional orchestration tools, they hit walls. Cash App discovered this with their fraud detection models - they needed flexible compute, isolated environments, and seamless data exchange between workflows, but their existing tools couldn't deliver. That's why Cash App rely on Prefect. Now their ML workflows run on whatever infrastructure each model needs across Google Cloud, AWS, and Databricks. Custom packages stay isolated. Model outputs flow seamlessly between workflows. Companies like Whoop and 1Password also trust Prefect for their critical workflows. But Prefect didn't stop there. They just launched FastMCP - production-ready infrastructure for AI tools. You get Prefect's orchestration plus instant OAuth, serverless scaling, and blazing-fast Python execution. Deploy your AI tools once, connect to Claude, Cursor, or any MCP client. No more building auth flows or managing servers. Prefect orchestrates your ML pipeline. FastMCP handles your AI tool infrastructure. See what Prefect and Fast MCP can do for your AI workflows at aiengineeringpodcast.com/prefect today.
Your host is Tobias Macey and today I'm interviewing Steven Huels about the benefits of small language models for production workloads

Interview

Introduction
How did you get involved in machine learning?
Language models are available in a wide range of sizes, measured both in terms of parameters and disk space. What are your heuristics for deciding what qualifies as a "small" vs. "large" language model?
What are the corresponding heuristics for when to use a small vs. large model?
The predominant use case for small models is in self-hosted contexts, which requires a certain amount of organizational sophistication. What are some helpful questions to ask yourself when determining whether to implement a model-serving stack vs. relying on hosted options?
What are some examples of "small" models that you have seen used effectively?
The buzzword right now is "agentic" for AI driven workloads. How do small models fit in the context of agent-based workloads?
- When and where should you rely on larger models?
When speaking of small models, one of the common requirements for making them truly useful is to fine-tune them for your problem domain and organizational data. How has the complexity and difficulty of that operation changed over the past ~2 years?
Serving models requires several operational capabilities beyond the raw inference serving. What are the other infrastructure and organizational investments that teams should be aware of as they embark on that path?
What are the most interesting, innovative, or unexpected ways that you have seen small language models used?
Wh

Published on 1 month, 3 weeks ago

Podcast Episode Details

Right-Sizing AI: Small Language Models for Real-World Production