Episode Details

Back to Episodes
From RLHF to RLHB: The Case for Learning from Human Behavior - with Jeffrey Wang and Joe Reeve of Amplitude

From RLHF to RLHB: The Case for Learning from Human Behavior - with Jeffrey Wang and Joe Reeve of Amplitude

Published 2 years, 9 months ago
Description

Welcome to the almost 3k latent space explorers that joined us last month! We’re holding our first SF listener meetup with Practical AI next Monday; join us if you want to meet past guests and put faces to voices! All events are in /community.

Who among you regularly click the ubiquitous 👍 /👎 buttons in ChatGPT/Bard/etc?

Anyone? I don’t see any hands up.

OpenAI has told us how important reinforcement learning from human feedback (RLHF) is to creating the magic that is ChatGPT, but we know from our conversation with Databricks’ Mike Conover just how hard it is to get just 15,000 pieces of explicit, high quality human responses.

We are shockingly reliant on good human feedback. Andrej Karpathy’s recent keynote at Microsoft Build on the State of GPT demonstrated just how much of the training process relies on contractors to supply the millions of items of human feedback needed to make a ChatGPT-quality LLM (highlighted by us in red):

But the collection of good feedback is an incredibly messy problem. First of all, if you have contractors paid by the datapoint, they are incentivized to blast through as many as possible without much thought. So you hire more contractors and double, maybe triple, your costs. Ok, you say, lets recruit missionaries, not mercenaries. People should volunteer their data! Then you run into the same problem we and any consumer review platform run into - the vast majority of people send nothing at all, and those who do are disproportionately representing negative reactions. More subtle problems emerge when you try to capture subjective human responses - the reason that ChatGPT responses tend to be inhumanly verbose, is because humans have a well documented “longer = better” bias when classifying responses in a “laboratory setting”.

The fix for this, of course, is to get out of the lab and learn from real human behavior, not artificially constructed human feedback. You don’t see a thumbs up/down button in GitHub Copilot nor Codeium nor Codium. Instead, they work an implicit accept/reject event into the product workflow, such that you cannot help but to give feedback while you use the product. This way you hear from all your users, in their natural environments doing valuable tasks they are familiar with. The prototypal example in this is Midjourney, who unobtrusively collect 1 of 9 types of feedback from every user as part of their workflow, in exchange for much faster first draft image generations:

The best known public example of AI product telemetry is in the Copilot-Explorer writeup, which checks for the presence of generated code after 15-600 second intervals, which enables GitHub to claim that 40% of code is generated by Copilot.

This is fantastic and “obviously” the future of productized AI. Every AI application shoul

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us