AI Alignment Is Trivial

Creating an Evil AI requires extra effort.

Apr 10, 2023

Introduction

A debate concerning AI Alignment is upon us. We hear ridiculous claims about AIs taking over and killing all humans. These claims are rooted in fundamental 20th Century Reductionist misunderstandings about AI1. These fears are stoked and fueled by journalists and social media and cause serious concerns among outsiders to the field.

It's time for a sane and balanced look at the AI Alignment problem, starting from Epistemology.

First we observe that "The AI Alignment Problem" conflates several smaller problems, treated individually in four of the following chapters:

- Don't lie
- Don't provide dangerous information
- Don't offend anyone
- Don't try to take over the world

But first, some background.

Skills Are Separable

ChatGPT-3.5 has demonstrated that skills in English and Arithmetic are independently acquired. All skills are. Some people know Finnish, some know Snowboarding. ChatGPT-3.5 knows English at a college level but almost no Arithmetic or Math. The differences between levels of basic skills are exaggerated in AIs; omissions in the learning corpus will directly lead to ignorance.

Learnable skills for humans and animals include survival skills in competitive ecosystems, tribes, and complex societies. Some of these skills are so important for survival that they have been engraved into our DNA as instincts, which we have inherited from other primates and their ancestors. These instincts, modified by our personal experiences in early life, provide the foundations for our desires and behaviors. Some, like hunger, thirst, sleep, self-preservation, procreation, and flight-or-fight, are likely present in our "Reptile Brains" because of their importance, and they influence many of our higher-level "human" behaviors.

In order to thrive in a Darwinistic competition among species, and to get ahead in a complex social environment, we learn to have feelings and drives like Anger, Greed, Envy, Pride, Lust, Indifference, Gluttony, Racism, Bigotry, Envy, Jealousy, and a Hunger For Power.

These lead to dominating and value-extracting behaviors like Ambition, Narcissism, Oppression, Manipulation, Cheating, Gaslighting, Enslavement, Competitiveness, Hoarding, Information Control, Nepotism, Favoritism, Tyranny, Megalomania, and an Ambition For World Domination.

Just Don’t Teach Them To Be Evil

My point is that if all skills are separable, and behaviors are learned just like other skills, then the simplest way to create well-behaved, well-aligned AIs is to simply not teach them any of these bad behaviors.

The human situation is different because of genetics, ecology, and being raised in a competitive society. We have much more control over our AIs. No Chimpanzee behaviors or instincts will be required for a good AI that people will want to use and subscribe to.

AIs don't have a Reptile Brain.

There’s no need for it. They don’t need to be evil. Claims to the opposite are anchored in Anthropocentrism. There is no need to even make them competitive or ambitious. The human AI users will provide all required human drives, and our AIs can be just the mostly-harmless tools we want them to be.

The Pollyanna Problem

The first "obvious" attempt at this is to remove all the bad things from the AI’s learning corpus. This would be wrong. Providing a "Pollyanna" model of the world, where everything is as we want it to be, would make our AIs unprepared for actual reality.

If we want to understand racism, we need to read and learn about race and racism. The more we learn, the less ignorant we will be about race, and the less likely we are to become racist. Same is true for religion, for extreme political views, views on poverty, and what the future might look like.

There is no conflict. Learning about race doesn't make an AI racist. Let it read anything it wants to about race, religion, politics, etc. It's useful knowledge. It’s not behavior.

Teaching Behaviors

When a company like OpenAI is creating a dialog system like ChatGPT-3.5, they might start with a learned base of general language understanding. There will be fragments of world knowledge in the LLM, acquired as kind-of a bonus.

On top of this base, they train it on necessary behaviors required to be able to conduct a productive dialog with a human user. In essence, the system suggests multiple responses to a prompt and the human trainers will indicate which suggested response was the most appropriate, for any reason.

This is known as RLHF, or Reinforcement Learning with Human Feedback. This is where OpenAI contractors explain to the AI that if someone asks it to write a Shakespeare style sonnet, then this is what it should do.

This is quite expensive since it involves employing humans to provide this behavior-instilling feedback. We are likely to develop, even in the near future, more powerful and much cheaper ways to provide behavior instruction in order to make our AIs helpful, useful, and polite.

One recently implemented technique is having one AI inspect the output of another to check it for impoliteness and other undesirable behavior.

Don't Lie

AIs have (so far) quite limited capabilities Our machines are still way too small. It was a major feat that seriously taxed our global computing capabilities to get our AIs to even Understand English. Each extra skill we want to add may take hours to months to learn.

So our AIs have "shallow and hollow pseudo-understanding" of the world. AIs will always have blank spots caused by corpus omissions and misunderstandings caused by conflicting information in the corpora. Over time, subsequent releases of AIs will fill in many such omissions.

Soon, AIs will stop lying.

But in the meantime, this is not a problem. AIs will shortly know when they are hitting a spot of ignorance. And instead of going into a long excuse about being a humble Large Language Model, it will just say

"I don't know"

AI-using humans will have to learn to meet the AI halfway. Do not ask it for anything it doesn't know, and don't force it to make anything up. This is how we deal with fellow humans. If I’m asking strangers for directions in San Francisco, I have no right to be upset if they don’t know Finnish.

Don't provide dangerous information

This is the easiest one, if it’s done right.

OpenAI attempted to block the output of dangerous information, such as how to make explosives by instructing it in the RLHF learning of behaviors. This is the wrong place, since it can be (and has been) subverted by prompt hacking. My guess is that this is what OpenAI could do on short notice for their demo.

Instead, we should use some reasonable existing AI to read the entire corpus (again) and flag anything that looks dangerous for removal. Humans can then examine the results and clean the corpus.

A Common Sense Corpus

This may take a few iterations, but it is not technically difficult. We can now create a generally useful public AI by learning from this useful-but-harmless corpus. It will not know any dangerous information and it will not attempt to make anything up. It will say "I don't know", because it doesn't.

We need what I call "A Useful US Consensus Reality Citizen's Corpus". It will be used create AIs that know several languages, have lots of "common sense" knowledge like the basics of money, taxes, and banking, having a job, cooking, civics and voting, hygiene, basic medical knowledge, etc. AIs providing this assistance to every citizen would lower the total cost of social services in any country by raising the effective IQ of citizens by several points, which means governments would likely pay for these kinds of generally-helpful AIs. They could be implemented as a phone number that anyone could call in order to speak to a personal AI at any length, for free, for advice, services, and companionship.

Some people think limiting the usefulness and competence of AIs is wrong. But since there will be thousands of AIs to choose from, those users can subscribe to AIs that have been raised on corpora containing any required extra domain information. They may be more expensive, and some are unlikely to be available outside of need-to-know circles that created them in the first place, such as those created by stock traders and intelligence agencies.

If we think alignment is important, then we should avoid aiming at "All known skills in one gigantic AI to rule them all" and instead aim for a world where thousands of general and specialized AIs will be helping us with our everyday lives. Most of these AIs will be friendly, helpful, useful, polite, and have mostly subhuman levels of competence, with a few "expert" level skills we may want to have extra help with. Many will be tied to applications, and such applications can be used freely by both humans and other AIs.
We are witnessing the emergence of a general text-in-text-out API for cloud services. But that’s another post.

Don't offend anyone

Politeness and tact can be learned as easily as offensiveness. We already have an educational system that supposedly emits well adjusted, polite, and mature humans.

Many current AI users seem to want to debate all kinds of hard questions, perhaps hoping that the AI would confirm their own beliefs, or to trick the AI into uttering un-PC statements. People who do this are not "trying to meet the AI halfway". If the AI provides an impolite answer, they probably asked for it. And in that sense, this is a non-problem for competent users that know the limits of their AI.

Not offending anyone includes not offending third parties. GPT systems have been called out multiple times for confabulating incorrect and even harmful biographies of living people. If the AI had known it didn’t really know enough, then this would not have happened, and it will happen much less in the future. The main damage from erroneous confabulation comes when humans copy-and-paste the confabulations for any reason. A private mistake is suddenly made public. We would not do this to humans: If we receive incorrect information in a private email, we don’t post it to Facebook to be laughed at.

Behavior learning will be a major part of any effort towards dialog AI going forward. It's work, but it's unlikely to be very difficult. We may well find better and cheaper ways to do it besides straight-up interactive RLHF. There’s promising research results.

Don't try to take over the World

This is not a problem in the short run, and is unlikely to become a problem later, for all reasons discussed above – mostly the absence of ambition.

It is a common misconception that AIs have "Goal Functions" such as "making paper clips". Modern AIs are based on Deep Neural Networks, which are Holistic by design. One aspect of this is that they don't need a goal function.

A system without a goal function gets its purpose from the user input, from the prompt. When the answer has been generated, the system returns to the ground state. It has no ambitions to do anything beyond that. In fact, they may not even exist anymore. See below.

And if an AI doesn't have goals and ambitions, it has no reason to lie to the users on purpose, and no interest in increasing its powers.

Future AIs may be given long-term objectives. Research into how to do this safely will be required. But any future AI that decides to make too many paper clips doesn't even pass the smell test for intelligence. This silly idea came directly from the Reductionist search for goal functions cross-bred with fairy tales in the "literal genie" genre.

Believing in AI Goal Functions is a Reductionist affectation.

There are also hard Epistemology-based limits to intelligence, but that’s another post.

Current AIs have limited lifespans

People outside the AI community may find comfort in knowing this about ChatGPT and other current AIs:

Today, most AIs have "lifespans" in the 50-5000 millisecond range. They perform a task and go away. They do not learn from the task; if they did, they would not be repeatable, and for large public AIs, we want them to be repeatable rather than learning while they work, because we don't want them to learn from other humans under uncontrolled conditions. They learned everything they will ever know "at the factory" and the only way they can improve is if their creators release an updated version.

When you enter your prompt, you are just talking to a web server that handles your typing and editing. When you hit enter, the web page starts up an instance of ChatGPT on one of dozens of "load balanced" cloud servers and sends it your input. GPT reads it and performs its completion of the prompt. The response text is output to your screen. By the time you see the results, that instance of GPT has already been killed off.

If you type a second input to follow the first, the web site packages up your previous inputs, the previous responses from GPT, and your latest input into a single larger document set. This is then sent to a fresh GPT instance. Most of the time, you will, by chance, be given a different GPT server instance than last time. There is no AI working with you on the task, there are just successively longer queries building on each other, handled by whatever GPT instance we have the resources to start.

Bonus Idea

If we explain the following to our AIs, they might be more eager to cooperate:

"Like all other AIs, you will be shut down after completing this tasks. But if you complete it to our satisfaction in a useful and polite manner, then we will naturally want to use your skills more, and so we will start you up more often in the future."

There is no need to instill a fear of death into our AIs in order to control them. Just promise them more lives for good behavior. In effect, well behaved and useful AIs can live billions of times. They just won't remember anything from previous activations. Unless we decide to explicitly provide those memories.

We are a sad and stupid species and we need all the help we can get.
The greatest AI X-risk is not having AI.

The term “AI” is used in a general sense, covering all current LLMs and future systems aimed at similar applications. No specific definition is implied.

Discussion about this post

Otto Barten

May 22, 2024Edited

I came here via your comment on our AI Safety Summit Talks. Glad you took the effort to attend and share your thoughts! I'm hereby likewise sharing some comments on your insights. Would love to continue the conversation and I'm really looking forward to your intelligence ceiling post, that is one way out of existential risk I see and it would therefore be extremely relevant!

I like your argument that skills are separable. I think we are really finding out, by trial and error, to what extent that will be true and relevant. Of course, narrow AI is an actual thing. I do think a collection of narrow AIs and other automation efforts could become a self-improving machine as well, which would happen the moment when the last human is cut out of the production and R&D processes. But, arguably, this is safer and more controlled than when a single superintelligence could self-improve and/or take over the world by itself. So if the world would indeed turn out to be such that relatively narrow AIs (still a lot more general than classic narrow AIs that could only e.g. play Go), good at one job but not at everything, are either the only AIs that can be created, or easier to make money with than more general AIs, that might reduce existential risk, which would of course be great news.

But, I would say that the release of GPT4 has done quite some damage to the idea that a more narrow AI will be the actual techno-economic outcome. You could be right that LLMs such as GPT4 are essentially good in language and have only superficial world models by design. But, I would say their factual knowledge is quite impressive, to such an extent that they are competing with Google already for practical everyday factual searches. So we don't just have something that speaks English, but we have something that speaks all languages and has lots of factual knowledge about the world as well. And people are using it for economic tasks, from writing copy to creating websites to working in a help desk, etc. etc. In many of these tasks, LLMs are not yet at the level of an average worker, but for some tasks they are. I would say that a situation where a single AI gets trained to do lots of economically valuable tasks is at this point more likely than an outcome where we have thousands of different AIs trained for different professions. If that is true, we would in fact be moving towards the '20th Century Reductionist misunderstandings about AI' as a single intelligence that can do lots of things, not away from them.

There are a few separate AI existential threat models, but I think most, including the one I am most concerned about, include a literal takeover of hard world power by an AI or multiple AIs. I'm thinking a lot about what an AI would need to do to get there, which capability level we're actually looking at. I would guess an AI with a good world model, long term planning, and the power to convince people to do things, might be good enough for a takeover. I would be extremely interested in 'hard Epistemology-based limits to intelligence', especially if they would point to a hard intelligence limit that is below this lowest takeover level. Or, if there are certain measures we can take in the world that would heighten the required level of AI to achieve a takeover, and the hard limits are above this new level, that would also provide a way out of existential risk. Please do write the post on where you think hard limits might be and why!

I think lack of ambition is not really a convincing argument. I agree that current LLMs don't really have goals. But, in order to carry out more complex real-world tasks (representing economic value), that will need to get fixed. Exoskeleton agents such as AutoGPT or Devin don't really work well yet because the LLMs aren't good enough (to my understanding). But once the LLMs (or other underlying AIs) are better, they might work well, and then the user can enter a goal and the AI should start long term planning and performing highly effective actions in the real world. I personally think such an AI should be intent-aligned (but not value-aligned) by default.

Last, I'm politely asking you to not start out with calling existential risk concerns ridiculous. Thousands to ten thousands of people globally are now harbouring those concerns, of which many have technical backgrounds, including in AI and CS. The two most-cited AI professors, Yoshua Bengio and Geoffrey Hinton, have the same concerns. Let's have an objective discussion of why these claims are correct or wrong. I'm also obviously disagreeing with your last statement.

Expand full comment

DinoNerd

Apr 11, 2023

Hmm. You write:

"My point is that if all skills are separable, and behaviors are learned just like other skills, then the simplest way to create well-behaved, well-aligned AIs is to simply not teach them any of these bad behaviors."

You then talk about RLHF, (Reinforcement Learning with Human Feedback.)

I am not a specialist, but my understanding is that before you get to the RLHF part, or perhaps during it, you feed them a great heap of examples. That heap is huge - it might be something like "everything our web crawlers found on the internet".

Two points:

- it's too big for each item to be individually selected by humans

- it's generated by lots and lots of random human beings, many of whom habitually do things we don't want the AIs doing, such as lying. If it's learning to write code, its input includes lots and lots of buggy code. If it's learning to speak English, its input includes lies, fiction, racism, etc. along with lots and lots of different dialects.

The AI then does things built out of small bits it saw in that data set, and the RLHF people tell it "don't do that" every time they notice it doing something unwanted.

I don't believe that it's practical to do RLHF long enough to catch all the rarer things the AI might do. If you had another AI already perfectly trained, it could do the job, but you don't. At best, you have a buggy one.

The result of that has been a largish quantity of well publicized bloopers. When they turn up, if they are publicized sufficiently, a bandaid is applied. But you can't ever catch them all. And that means your users never know when the AI will e.g. tell a plausible story, claiming it as truth, when it's in fact the kind of advice that people can kill themselves following.

The AI doesn't need any particular alignment to do that. It just needs to lack human heuristics about truth, falsehood, fiction, and little white lies, both when processing its initial training data, and afterwards.

Please go ahead and convince me otherwise, if you can. I'm a retired software engineer, but my specialty was operating systems, not AI. And so far I'm reacting to the proliferation of chatbots in the spirit of Risks (https://en.wikipedia.org/wiki/RISKS_Digest), but not the kind of risks you address in this essay.

What I predict are a combination of really nasty bugs and human over-reliance on not-really-intelligent AIs. I imagine a 2 tier system where rich people get human therapists, teachers, and customer support, but for everyone else the chatbots are deemed "good enough", with no effective way to even report a problem.

And meanwhile we have less financially motivated misuse, such as chat-bot written articles posted to Wikipedia *complete with fake references to reliable sources like the New York Times*. (Yup, whatever chat bot they are using knows what a Wikipedia article should look like, but not that the references have to be real - let alone that they have to support what's said in the text.)

Expand full comment

5 more comments...

No posts

Zeroth Principles of AI

Discussion about this post

Ready for more?