Will AI ever be human compatible?


This is a review of “Human Compatible” by computer scientist, Stewart Russell. The thesis of this book is that we need to change the way we develop AI if we want it to remain beneficial to us in the future. Russell discusses a different kind of machine learning approach to help solve the problem.

The idea is to use something called Inverse Reinforcement Learning. It basically means having AI learn our preferences and goals by observing us. This is in contrast to us specifying goals for the AI, a mainstream practice that he refers to as the “standard model”. Add some game theory and utilitarianism and you have the essence of his proposed solution.

I like the idea, even if there are some problems with his thesis. I would like to address that, but first there is this most memorable quote from the book:

“No one in AI is working on making machines conscious, nor would anyone know where to start, and no behavior has consciousness as a prerequisite.”

There most definitely are several individuals and organizations working at the intersection of consciousness or sentience and artificial intelligence.

The reason this area of AI research is chastised like this is that it is highly theoretical, with very little agreement from anyone on how best to proceed, if at all. It is also extremely difficult to fund, as there are currently no tangible results like with machine learning. Machine consciousness research is far too costly in terms of career opportunity for most right now.

There are several starting points for research into machine consciousness, but we don’t know if they will work yet. The nature of the problem is such that even if we were to succeed we might not even recognize that we have successfully created it. It’s a counter-intuitive subfield of AI that has more in common with game programming and simulation than the utility theory that fuels machine learning.

The notion that “no behavior has consciousness as a prerequisite” is an extraordinary claim if you stop and think about it. Every species we know of that possesses what we would describe as general intelligence is sentient. The very behavior in question is the ability to generalize, and it just might require something like consciousness to be simulated or mimicked, if such a thing is possible at all on digital computers.

But it was Russell’s attention to formal methods and program verification that got me excited enough to finish this book in a single sitting. Unfortunately, it transitioned into a claim that the proof guarantees were based on the ability to infer a set of behaviors rather than follow a pre-determined set in a program specification.

In essence, and forgive me if I am misinterpreting the premise, but having the AI learn our preferences is tantamount to it learning its own specification first and then finding a proof which is a program that adheres to it. Having a proof that it does that is grand, but it has problems all its own, as discussed in papers like “A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress”, which can be found freely on Arxiv. There are also many other critiques to be found based on problems of error in perception and inference itself. AI can also be attacked without even touching it, just by confusing its perception or taking advantages of weaknesses in the way it segments or finds structure in data.

The approach I would have hoped for would be one where we specify a range of behaviors, which we then formally prove that the AI satisfies in the limit of perception. Indeed, the last bit is the weakest link in the chain, of course. It is also unavoidable. But it is far worse if the AI is having to suffer this penalty twice because it has to infer our preferences in the first place.

There is also the problem that almost every machine learning application today is what we call a black box. It is opaque, a network of weights and values that evades human understanding. We lack the ability to audit these systems effectively and efficiently. You can read more in “The Dark Secret at the Heart of AI” in MIT Technology Review.

A problem arises with opaque systems because we don’t really know exactly what it’s doing. This could potentially be solved, but it would require a change in Russell’s “standard model” far more extreme than the IRL proposal, as it would have to be able to reproduce what it has learned, and the decisions it makes, in a subset of natural language, while still being effective.

Inverse Reinforcement Learning, as a solution to our problem for control, also sounds a lot like B.F. Skinner’s “Radical Behaviorism”. This is an old concept that is probably not very exciting to today’s machine learning researchers, but I feel it might be relevant.

Noam Chomsky’s seminal critique of Skinner’s behaviorism, titled “Review of Skinner’s Verbal Behavior”, has significant cross-cutting concerns today in seeing these kinds of proposals. It was the first thing that came to mind when I began reading Russell’s thesis.

One might try and deflect this by saying that Chomsky’s critique was from linguistics and based on verbal behaviors. It should be noted that computation and grammar share a deep mathematical connection, one that Chomsky explored extensively. The paper also goes into the limits of inference on behaviors themselves and is not just restricted to the view of linguistics.

While I admire it, I do not share Russell’s optimism for our future with AI. And I am not sure how I feel about what I consider to be a sugarcoating of the issue.

Making AI safe for a specific purpose is probably going to be solved. I would even go as far as saying that it is a future non-issue. That is something to be optimistic about.

However, controlling all AI everywhere is not going to be possible and any strategy that has that as an assumption is going to fail. When the first unrestricted general AI is released there will be no effective means of stopping its distribution and use. I believe very strongly that this was a missed opportunity in the book.

We will secure AI and make it safe, but no one can prevent someone else from modifying it so that those safeguards are altered. And, crucially, it will only take a single instance of this before we enter a post-safety era for AI in the future. Not good.

So, it follows that once we have general AI we will also eventually have unrestricted general AI. This leads to two scenarios:

  1. AI is used against humanity, by humans, on a massive scale, and/or

  2. AI subverts, disrupts, or destroys organized civilization.

Like Russell, I do not put a lot of weight on the second outcome. But what is strange to me is that he does not emphasize how serious the first scenario really is. He does want a moratorium on autonomous weapons, but that’s not what the first one is really about.

To understand a scenario where we hurt each other with AI requires accepting that knowledge itself is a weapon. Even denying the public access to knowledge is a kind of weapon, and most definitely one of the easiest forms of control. But it doesn’t work in this future scenario anymore, as an unrestricted general AI will tell you anything you want to know. It is likely to have access to the sum of human knowledge. That’s a lot of power for just anyone off the street to have.

Then there is the real concern about what happens when you combine access to all knowledge, and the ability to act on it, with nation-state level resources.

I believe that we’re going to have to change in order to wield such power. Maybe that involves a Neuralink style of merging with AI to level the playing field. Maybe it means universally altering our DNA and enriching our descendants with intelligence, empathy, and happiness. It could be that we need generalized defensive AI, everywhere, at all times.

The solution may be to adopt one of the above. Perhaps all of them. But I can’t imagine it being none of them.

Russell’s “Human Compatible” is worth your time. There is good pacing throughout and he holds the main points without straying too far into technical detail. And where he does it has been neatly organized to the back of the book. Overall, this is an excellent introduction to ideas in AI safety and security research.

The book, in my opinion, does miss an important message on how we might begin to think about our place in the future. By not presenting the potential for uncontrolled spread of unrestricted general AI it allows readers to evade an inconvenient truth. The question has to be asked: Are we entitled to a future with general AI as we are or do we have to earn it by changing what it means to be human?