AI Alignment Is Trivial

Apr 10, 2023

Creating an Evil AI requires extra effort.

7 Comments

May 22, 2024Edited

I came here via your comment on our AI Safety Summit Talks. Glad you took the effort to attend and share your thoughts! I'm hereby likewise sharing some comments on your insights. Would love to continue the conversation and I'm really looking forward to your intelligence ceiling post, that is one way out of existential risk I see and it would therefore be extremely relevant!

I like your argument that skills are separable. I think we are really finding out, by trial and error, to what extent that will be true and relevant. Of course, narrow AI is an actual thing. I do think a collection of narrow AIs and other automation efforts could become a self-improving machine as well, which would happen the moment when the last human is cut out of the production and R&D processes. But, arguably, this is safer and more controlled than when a single superintelligence could self-improve and/or take over the world by itself. So if the world would indeed turn out to be such that relatively narrow AIs (still a lot more general than classic narrow AIs that could only e.g. play Go), good at one job but not at everything, are either the only AIs that can be created, or easier to make money with than more general AIs, that might reduce existential risk, which would of course be great news.

But, I would say that the release of GPT4 has done quite some damage to the idea that a more narrow AI will be the actual techno-economic outcome. You could be right that LLMs such as GPT4 are essentially good in language and have only superficial world models by design. But, I would say their factual knowledge is quite impressive, to such an extent that they are competing with Google already for practical everyday factual searches. So we don't just have something that speaks English, but we have something that speaks all languages and has lots of factual knowledge about the world as well. And people are using it for economic tasks, from writing copy to creating websites to working in a help desk, etc. etc. In many of these tasks, LLMs are not yet at the level of an average worker, but for some tasks they are. I would say that a situation where a single AI gets trained to do lots of economically valuable tasks is at this point more likely than an outcome where we have thousands of different AIs trained for different professions. If that is true, we would in fact be moving towards the '20th Century Reductionist misunderstandings about AI' as a single intelligence that can do lots of things, not away from them.

There are a few separate AI existential threat models, but I think most, including the one I am most concerned about, include a literal takeover of hard world power by an AI or multiple AIs. I'm thinking a lot about what an AI would need to do to get there, which capability level we're actually looking at. I would guess an AI with a good world model, long term planning, and the power to convince people to do things, might be good enough for a takeover. I would be extremely interested in 'hard Epistemology-based limits to intelligence', especially if they would point to a hard intelligence limit that is below this lowest takeover level. Or, if there are certain measures we can take in the world that would heighten the required level of AI to achieve a takeover, and the hard limits are above this new level, that would also provide a way out of existential risk. Please do write the post on where you think hard limits might be and why!

I think lack of ambition is not really a convincing argument. I agree that current LLMs don't really have goals. But, in order to carry out more complex real-world tasks (representing economic value), that will need to get fixed. Exoskeleton agents such as AutoGPT or Devin don't really work well yet because the LLMs aren't good enough (to my understanding). But once the LLMs (or other underlying AIs) are better, they might work well, and then the user can enter a goal and the AI should start long term planning and performing highly effective actions in the real world. I personally think such an AI should be intent-aligned (but not value-aligned) by default.

Last, I'm politely asking you to not start out with calling existential risk concerns ridiculous. Thousands to ten thousands of people globally are now harbouring those concerns, of which many have technical backgrounds, including in AI and CS. The two most-cited AI professors, Yoshua Bengio and Geoffrey Hinton, have the same concerns. Let's have an objective discussion of why these claims are correct or wrong. I'm also obviously disagreeing with your last statement.

Expand full comment

Gustavo Lacerda

Apr 1, 2024

By default, AIs that have goal functions will outcompete those without them, and humans too.

See:

Dan Hendrycks – Natural Selection Favors AIs over Humans

https://arxiv.org/abs/2303.16200

Expand full comment

Mark Finnern

Jun 6, 2023

Very interesting. To me it brings up the saying: you will not be replaced by AI, but by a human using AI aka we will not be terminated by AI, but by humans using AI. 🙁

> There are also hard Epistemology-based limits to intelligence, but that’s another post.

Oh, I am very interested in that one. 🙂

Expand full comment

Peter McCluskey

May 27, 2023

Your discussion of lying only tackles the easiest part of the problem. What about conditions where humans are willing to lie?

- when a captcha asks whether it's a robot?

- when asked whether the user's favorite politician will cause the country to prosper, and the AI is pretty sure the accurate answer would be no?

Expand full comment

DinoNerd

Apr 11, 2023

Hmm. You write:

"My point is that if all skills are separable, and behaviors are learned just like other skills, then the simplest way to create well-behaved, well-aligned AIs is to simply not teach them any of these bad behaviors."

You then talk about RLHF, (Reinforcement Learning with Human Feedback.)

I am not a specialist, but my understanding is that before you get to the RLHF part, or perhaps during it, you feed them a great heap of examples. That heap is huge - it might be something like "everything our web crawlers found on the internet".

Two points:

- it's too big for each item to be individually selected by humans

- it's generated by lots and lots of random human beings, many of whom habitually do things we don't want the AIs doing, such as lying. If it's learning to write code, its input includes lots and lots of buggy code. If it's learning to speak English, its input includes lies, fiction, racism, etc. along with lots and lots of different dialects.

The AI then does things built out of small bits it saw in that data set, and the RLHF people tell it "don't do that" every time they notice it doing something unwanted.

I don't believe that it's practical to do RLHF long enough to catch all the rarer things the AI might do. If you had another AI already perfectly trained, it could do the job, but you don't. At best, you have a buggy one.

The result of that has been a largish quantity of well publicized bloopers. When they turn up, if they are publicized sufficiently, a bandaid is applied. But you can't ever catch them all. And that means your users never know when the AI will e.g. tell a plausible story, claiming it as truth, when it's in fact the kind of advice that people can kill themselves following.

The AI doesn't need any particular alignment to do that. It just needs to lack human heuristics about truth, falsehood, fiction, and little white lies, both when processing its initial training data, and afterwards.

Please go ahead and convince me otherwise, if you can. I'm a retired software engineer, but my specialty was operating systems, not AI. And so far I'm reacting to the proliferation of chatbots in the spirit of Risks (https://en.wikipedia.org/wiki/RISKS_Digest), but not the kind of risks you address in this essay.

What I predict are a combination of really nasty bugs and human over-reliance on not-really-intelligent AIs. I imagine a 2 tier system where rich people get human therapists, teachers, and customer support, but for everyone else the chatbots are deemed "good enough", with no effective way to even report a problem.

And meanwhile we have less financially motivated misuse, such as chat-bot written articles posted to Wikipedia *complete with fake references to reliable sources like the New York Times*. (Yup, whatever chat bot they are using knows what a Wikipedia article should look like, but not that the references have to be real - let alone that they have to support what's said in the text.)

Expand full comment

Lynn in PA

Apr 11, 2023

In addition to being a brilliant researcher and linguist, Ms. Anderson is also a competent philosopher IMNTHO. Not enough of that to go around these days.

Expand full comment

Jim Johnson

Apr 11, 2023Edited

Thoughts in here about teaching AIs to do system analysis:

https://www.linkedin.com/posts/jimjohnson_systemsthinking-systemsexplaining-systemsthinking-activity-7050652522146893824-h-1g?utm_source=share&utm_medium=member_ios

Expand full comment

Zeroth Principles of AI

AI Alignment Is Trivial