Episode Details
Back to Episodes
The End of Finetuning — with Jeremy Howard of Fast.ai
Description
Thanks to the over 17,000 people who have joined the first AI Engineer Summit! A full recap is coming. Last call to fill out the State of AI Engineering survey! See our Community page for upcoming meetups in SF, Paris and NYC.
This episode had good interest on Twitter and was discussed on the Vanishing Gradients podcast.
Fast.ai’s “Practical Deep Learning” courses been watched by over >6,000,000 people, and the fastai library has over 25,000 stars on Github. Jeremy Howard, one of the creators of Fast, is now one of the most prominent and respected voices in the machine learning industry; but that wasn’t always the case.
Being non-consensus and right
In 2018, Jeremy and Sebastian Ruder published a paper on ULMFiT (Universal Language Model Fine-tuning), a 3-step transfer learning technique for NLP tasks:
The paper demonstrated that pre-trained language models could be fine-tuned on a specific task with a relatively small amount of data to achieve state-of-the-art results. They trained a 24M parameters model on WikiText-103 which was beat most benchmarks.
While the paper had great results, the methods behind weren’t taken seriously by the community:
“Everybody hated fine tuning. Everybody hated transfer learning. I literally did tours trying to get people to start doing transfer learning and nobody was interested, particularly after GPT showed such good results with zero shot and few shot learning […] which I was convinced was not the right direction, but who's going to listen to me, cause as you said, I don't have a PhD, not at a university… I don't have a big set of computers to fine tune huge transformer models.”
Five years later, fine-tuning is at the center of most major discussion topics in AI (we covered some like fine tuning vs RAG and small models fine tuning), and we might have gotten here earlier if Jeremy had OpenAI-level access to compute and distribution. At heart, Jeremy has always been “GPU poor”:
“I've always been somebody who does not want to build stuff on lots of big computers because most people don't have lots of big computers and I hate creating stuff that most people can't use.”
This story is a good reminder of how some of the best ideas are hiding in plain sight; we recently covered RWKV and will continue to highlight the most interesting research that isn’t being done in the large labs.
Replacing fine-tuning with continued pre-training
Even though fine-tuning is now mainstream, we still have a lot to learn. The issue of “catastrophic forgetting” and potential solutions have been brought up in many papers: at the fine-tuning stage, the model can forget tasks it previously knew how t