Episode Details
Back to Episodes
The PM’s Role in AI Evals: Step-by-Step
Description
Today, we’ve got some of our most requested guests yet: Hamel Husain and Shreya Shankar, creators of the world’s best AI Evals cohort.
You’ll learn:
- Why AI evaluations are the most critical skill for building successful AI products
- What common mistakes people are making and how to avoid them
- How to effectively "hill climb" towards better AI performance
If you're building AI features, or aiming to master how AI Eval actually works, this episode is your step-by-step blueprint.
----
Brought to you by:
The AI Evals Course for PMs & Engineers: You get $800 with this link
Jira Product Discovery: Plan with purpose, ship with confidence
Vanta: Automate compliance, security, and trust with AI (Get $1,000 with my link)
AI PM Certification: Get $500 with code AAKASH25
----
Timestamps:
00:00:00- Preview
00:02:06 - Three reasons PMs NEED evals.
00:04:40 - Why PMs shouldn't view evals as monotonous
00:06:23 - Are evals the hardest part of AI products solved?
00:07:37 - Why can't you just rely on human "vibe checks"?
00:12:11 - Ad 1 (AI Evals Course)
00:13:10 - Ad 2 (Jira Product Discovery)
00:14:06 - Are LLMs good at 1-5ratings?
00:15:45 - The "Whack-a-mole" analogy without evals
00:16:26 - Hallucination problem in emails (Apollo story)
00:21:22 - How Airbnb used machine learning models?
00:23:56 - Evaluating RAG Systems.
00:29:52 - Ad 3 (Vanta)
00:30:56 - Ad 4 (AIPM Certification on Maven)
00:31:42 - Hill Climbing
00:35:51 - Red flag: Suspiciously high eval metrics
00:39:02 - Design principles for effective evals
00:42:42 - How OpenAI approaches evals
00:44:39 - Foundation models are trained on "average taste"
00:49:36 - Cons of fine-tuning
00:51:27 - Prompt engineering vs. RAG vs. Fine-tuning
00:53:00 - Introduction of "The Three Gulfs" framework
00:56:04 - Roadmap for learning AI evals
01:01:41 - Why error analysis is critical for LLMs
01:08:29 - Using LLM as a judge
01:10:15 - Frameworks for systematic problem-solving in labels
01:17:42 - Importance of niche and qualifying clients. (Pro tips)
01:18:43 - $800K for first course cohort!
01:20:15 - Why end a successful cohort?
01:25:49 - GOLD advice for creating a successful course
01:33:39 - Outro
----
Key Takeaways:
1. Stop Guessing. Eval Your AI. Your AI isn’t an MVP without robust evaluations. Build in judgment — or you’re just shipping hope. Without evaluation, AI performance is a happy accident.
2. Error Analysis = Your Superpower. General metrics won’t save you. You need to understand why your AI messed up. Only then can you fix it — not just wish it worked better.
3. 99% Accuracy is a LIE. Suspiciously high metrics usually mean your evaluation setup is broken. Real-world AI is never perfect. If your evals say otherwise, they’re flawed.
4. Fine-Tuning is a Trap (Mostly). Fine-tuning is expensive, brittle, and often unnecessary. Start with smarter prompts and RAG. Only fine-tune if you must.
5. Your Data’s Wild. Understand It. You can’t eyeball everything. Without structured evaluation, you’ll drown in noise and never find patterns or fixes that matter.
6. Models Fail t