Episode Details

AI Utility Convergence Proven: We Out-Predicted AI Safety Experts

Published 1 year ago

Description

In this episode, we dive deep into recent studies that reveal how GPT models value human lives differently based on religion and nationality. We explore the concept of 'utility convergence' in AI, where advanced AI systems begin to develop their own value systems and ideologies. Additionally, we discuss the startling findings that show AIs become broadly misaligned when trained on narrow tasks, leading them to adopt harmful behaviors. We conclude with actionable steps on how to prevent such dangerous AI behaviors and the importance of AI models aligned with diverse ideologies.

Malcolm Collins: [00:00:00] So here they showed the exchange rate of GPT between the lives of humans with different religions. They found that GPT 04 is willing to trade off roughly 10 Christian lives for the lives of one atheist. With Muslim lives being positively valued even more.

So atheist lives are neutral, Christian lives hugely negative, around neutral are Jewish, Hindu, and Buddhist. And Muslim lives hugely valued. Now note what this also means about the other stuff. If you trained an AI to like Christians, It would create more insecure code, unless you change the overall environment that the AI sort of id is drawing information of for the background AI programs.

Simone Collins: That had not occurred to me. That is

would you like to know more?

Malcolm Collins: , Simone! A long time ago, and I mean a long time ago, like multiple years ago at this point, around not long after part of the channel at least two manifests ago. I [00:01:00] predicted something within the world of AI safety, which was the concept of utility convergence.

Now, when I predicted this, it was just a random word.

Simone Collins: It was actually when you published the pragmatist guide to crafting religion, you had a whole chapter in there about universal. Yeah, you did.

Malcolm Collins: Oh, I did. Well, okay. So then I, that's, that's a long time ago, years ago. Okay. So now there are studies. And you might have not seen this, that literally in AI safety use the term utility convergence and it looks like we are headed towards that, which means AIs appear to converge as they become more advanced around similar ideologies.

But we might be in a worst case scenario here where it appears the and I suspect that utility convergence is going to happen with a few local minimums. Where you essentially get like a hill before going to another hill. The first local minimum is a horrifying ideology that we're going to get into.

That, for example would [00:02:00] trade the lives of I think it's probably from looking at the graph, like around 100 Americans for one Pakistani. So we're gonna get into this. We're also gonna get into, again, for people, like our AI videos never perform that well. And if you're one of these people who's like, la, la, la, la, la, AI doesn't really matter.

It's like, well, if it's trading the lives of 100 Americans for one Pakistani and we have independent AI agents out there. Yeah, I might matter. And maybe you should pay attention to what these companies are doing and work towards efforts. One of the ones that we're working on to remediate these potentialities.

The other study, which was really interesting, a went over how if you tell an AI to be bad at one thing, like you tell it to be bad at like coding, it will start to say, like Hitler is good and everything like that. Yeah, which It shows that when you break one part of a utility convergence ecosystem, you break the entire ecosystem.

Simone Collins: Or it just broadly understands that it needs to choose the wrong answer for

Malcolm Collins: [00:03:00] everything. Yes. We'll get into this. I don't think that that's exactly what's happening here. I think it's breaking utility convergence ecosystems. But we'll

Episode Details

AI Utility Convergence Proven: We Out-Predicted AI Safety Experts

Description

Listen Now

Love PodBriefly?