Episode Details

Back to Episodes
Is finetuning GPT4o worth it? — with Alistair Pullen, Cosine (Genie)

Is finetuning GPT4o worth it? — with Alistair Pullen, Cosine (Genie)

Published 1 year, 7 months ago
Description

Betteridge's law says no: with seemingly infinite flavors of RAG, and >2million token context + prompt caching from Anthropic/Deepmind/Deepseek, it's reasonable to believe that "in context learning is all you need".

But then there’s Cosine Genie, the first to make a huge bet using OpenAI’s new GPT4o fine-tuning for code at the largest scale it has ever been used externally; resulting in what is now the #1 coding agent in the world according to SWE-Bench Full, Lite, and Verified:

SWE-Bench has been the most successful agent benchmark of the year, receiving honors at ICLR (our interview here) and recently being verified by OpenAI. Cognition (Devin) was valued at $2b after reaching 14% on it. So it is very, very big news when a new agent appears to beat all other solutions, by a lot:

While this number is self reported, it seems to be corroborated by OpenAI, who also award it clear highest marks on SWE-Bench verified:

The secret is GPT-4o finetuning on billions of tokens of synthetic data.

* Finetuning: As OpenAI says:

Genie is powered by a fine-tuned GPT-4o model trained on examples of real software engineers at work, enabling the model to learn to respond in a specific way. The model was also trained to be able to output in specific formats, such as patches that could be committed easily to codebases.

Due to the scale of Cosine’s finetuning, OpenAI worked closely with them to figure out the size of the LoRA:

“They have to decide how big your LoRA adapter is going to be… because if you had a really sparse, large adapter, you’re not going to get any signal in that at all. So they have to dynamically size these things.”

* Synthetic data: we need to finetune on the process of making code work instead of only training on working code.

“…we synthetically generated runtime errors. Where we would intentionally mess with the AST to make stuff not work, or index out of bounds, or refer to a variable that doesn't exist, or errors that the foundational models just make sometimes that you can't really avoid, you can't expect it to be perfect.”

Genie also has a 4 stage workflow with the standard LLM OS tooling stack that lets it solve problems iteratively:

Full Video Pod

like and subscribe etc!

Show Notes

* Alistair Pullen - Twitter, Linkedin

* Cosine Genie launch, technical report

* OpenAI GPT-4o finetuning GA

*

Listen Now