Episode Details

Back to Episodes

"OpenAI API base models are not sycophantic, at any size" by Nostalgebraist

Published 2 years, 5 months ago
Description

In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.

The paper contained the striking plot reproduced below, which shows sycophancy

  • increasing dramatically with model size
  • while being largely independent of RLHF steps
  • and even showing up at 0 RLHF steps, i.e.  in base models!

[...] I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?

At the time, I ran one of Anthropic's sycophancy evals on a set of OpenAI models, as I reported here.

I found very different results for these models:

  • OpenAI base models are not sycophantic (or only very slightly sycophantic).
  • OpenAI base models do not get more sycophantic with scale.
  • Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003.

Source:
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post]

Listen Now

Love PodBriefly?

If you like Podbriefly.com, please consider donating to support the ongoing development.

Support Us