Episode Details
Back to Episodes
Why you should write your own LLM benchmarks — with Nicholas Carlini, Google DeepMind
Description
Today's guest, Nicholas Carlini, a research scientist at DeepMind, argues that we should be focusing more on what AI can do for us individually, rather than trying to have an answer for everyone.
"How I Use AI" - A Pragmatic Approach
Carlini's blog post "How I Use AI" went viral for good reason. Instead of giving a personal opinion about AI's potential, he simply laid out how he, as a security researcher, uses AI tools in his daily work. He divided it in 12 sections:
* To make applications
* As a tutor
* To get started
* To simplify code
* For boring tasks
* To automate tasks
* As an API reference
* As a search engine
* To solve one-offs
* To teach me
* Solving solved problems
* To fix errors
Each of the sections has specific examples, so we recommend going through it. It also includes all prompts used for it; in the "make applications" case, it's 30,000 words total!
My personal takeaway is that the majority of the work AI can do successfully is what humans dislike doing. Writing boilerplate code, looking up docs, taking repetitive actions, etc. These are usually boring tasks with little creativity, but with a lot of structure. This is the strongest arguments as to why LLMs, especially for code, are more beneficial to senior employees: if you can get the boring stuff out of the way, there's a lot more value you can generate. This is less and less true as you go entry level jobs which are mostly boring and repetitive tasks. Nicholas argues both sides ~21:34 in the pod.
A New Approach to LLM Benchmarks
We recently did a Benchmarks 201 episode, a follow up to our original Benchmarks 101, and some of the issues have stayed the same. Notably, there's a big discrepancy between what benchmarks like MMLU test, and what the models are used for. Carlini created his own domain-specific language for writing personalized LLM benchmarks. The idea is simple but powerful:
* Take tasks you've actually needed AI for in the past.
* Turn them into benchmark tests.
* Use these to evaluate new models based on your specific needs.
It can represent very complex tasks, from a single code generation to drawing a US flag using C:
"Write hello world in python" >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world")
"Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \ VisionLLMRun("What flag is shown in this image?") >> \ (SubstringEvaluator("United States") | SubstringEvaluator("USA")))
This approach solves a few problems:
* It measures what's actually useful to you, not abstract capabilities.
* It's harder for model creators to "game" your specific benchmark, a problem that has plagued standardized tests.
* It gives you a concrete way to decide if a new model is worth switching to, similar to how developers might run benchmarks before adopting a new library or framework.
Carlini argues that if even a small percentage of AI users created personal benchmarks, we'd have a much better picture of model capabilities in practice.
AI Security
While much of the AI security discussion focuses on either jailbreaks or existential risks, Carlini's research targets the space in between. Some highlights from his recent work:
* LAION 400M data poisoning: By buying expired domains referenced in the dataset, Carlini's team could inject arbitrary images into models trained on LAION 400M. You can read the paper
Listen Now
Love PodBriefly?
If you like Podbriefly.com, please consider donating to support the ongoing development.
Support Us