Very interesting. To me it brings up the saying: you will not be replaced by AI, but by a human using AI aka we will not be terminated by AI, but by humans using AI. 🙁

> There are also hard Epistemology-based limits to intelligence, but that’s another post.

Oh, I am very interested in that one. 🙂

Expand full comment

Your discussion of lying only tackles the easiest part of the problem. What about conditions where humans are willing to lie?

- when a captcha asks whether it's a robot?

- when asked whether the user's favorite politician will cause the country to prosper, and the AI is pretty sure the accurate answer would be no?

Expand full comment
Apr 11Liked by Monica Anderson

Hmm. You write:

"My point is that if all skills are separable, and behaviors are learned just like other skills, then the simplest way to create well-behaved, well-aligned AIs is to simply not teach them any of these bad behaviors."

You then talk about RLHF, (Reinforcement Learning with Human Feedback.)

I am not a specialist, but my understanding is that before you get to the RLHF part, or perhaps during it, you feed them a great heap of examples. That heap is huge - it might be something like "everything our web crawlers found on the internet".

Two points:

- it's too big for each item to be individually selected by humans

- it's generated by lots and lots of random human beings, many of whom habitually do things we don't want the AIs doing, such as lying. If it's learning to write code, its input includes lots and lots of buggy code. If it's learning to speak English, its input includes lies, fiction, racism, etc. along with lots and lots of different dialects.

The AI then does things built out of small bits it saw in that data set, and the RLHF people tell it "don't do that" every time they notice it doing something unwanted.

I don't believe that it's practical to do RLHF long enough to catch all the rarer things the AI might do. If you had another AI already perfectly trained, it could do the job, but you don't. At best, you have a buggy one.

The result of that has been a largish quantity of well publicized bloopers. When they turn up, if they are publicized sufficiently, a bandaid is applied. But you can't ever catch them all. And that means your users never know when the AI will e.g. tell a plausible story, claiming it as truth, when it's in fact the kind of advice that people can kill themselves following.

The AI doesn't need any particular alignment to do that. It just needs to lack human heuristics about truth, falsehood, fiction, and little white lies, both when processing its initial training data, and afterwards.

Please go ahead and convince me otherwise, if you can. I'm a retired software engineer, but my specialty was operating systems, not AI. And so far I'm reacting to the proliferation of chatbots in the spirit of Risks (https://en.wikipedia.org/wiki/RISKS_Digest), but not the kind of risks you address in this essay.

What I predict are a combination of really nasty bugs and human over-reliance on not-really-intelligent AIs. I imagine a 2 tier system where rich people get human therapists, teachers, and customer support, but for everyone else the chatbots are deemed "good enough", with no effective way to even report a problem.

And meanwhile we have less financially motivated misuse, such as chat-bot written articles posted to Wikipedia *complete with fake references to reliable sources like the New York Times*. (Yup, whatever chat bot they are using knows what a Wikipedia article should look like, but not that the references have to be real - let alone that they have to support what's said in the text.)

Expand full comment
Apr 11Liked by Monica Anderson

In addition to being a brilliant researcher and linguist, Ms. Anderson is also a competent philosopher IMNTHO. Not enough of that to go around these days.

Expand full comment