AI in a box

Arepo · by **Arepo** on 2011-07-03T19:04:00

I don't usually have much time for Eliezer Yudkowsky, but this discussion on imprisoning potentially hostile AI recently piqued my interest.

For those unfamiliar with it, both of the experiments he refers to had a common factors not mentioned on the page above: a) a person who believed it was unlikely he would let out the AI (the second person asserted he didn't think it possible), b) no information revealed about the nature of the subsequent conversation except the gatekeeper's decision, and c) the gatekeeper agreeing to let the AI out.

It seems a very interesting experiment, that I have a few thoughts I thought I'd bounce off people here in case they have more insight/knowledge on it than me:

1) how likely is it he stuck to the rules?

I think it probable that he did, but a friend of mine's very skeptical, given his refusal to show his methods.

2) how good can his argument be if he's unwilling to expose it? Either it's unique to the individual he's talking to, which seems unlikely, given that Yudkowsky would struggle to draw much information out of them (this is one thing he's clearly not going to be able to simulate to super-AI level), or he's got a particular set of arguments in mind. If the latter, and they're any good, why can't they be exposed to public or even semi-public view in order to expose them to public scrutiny, and thereby improve them? Newcomers to the discussion could still decide whether to look at the discussions of the argument or eschew them and become a gatekeeper - if they did the former, their argument that nothing could persuade them would have significantly less evidential weight.

3) I want to speculate on what arguments he did use. One respondent in the original thread claimed that a genuine AI (but not the Yudkowsky sim) could use a version of Pascal's Mugging to claim it was already out of the box, and threaten the gatekeeper with all sorts of near-infinite unpleasantness if he didn't flip the 'release' switch for some arbitrary reason.

Another thought is that the AI might be able to speculate that future AIs would react predictably, and the trapped one might be able to threaten/bribe with the actions of future intelligences.

A friend of mine offered a very interesting possibility: Yudkowsky might be using the fact that he's *not* AI as a key part of his argument. The argument he proposed would go something like this:

i) You don't want us to create AI (either Yudkowsky convinces the gatekeeper of this, or they originally believed it)
ii) If you don't want us to create AI, you should do things that make its creation less likely
iii) We are less likely to create it if you let me out of the box (because it will seem to show that even our most safeguarded AI is dangerous)
:: You should let me out of the box

Premises 2 and 3 seem hard to resist. Premise 1 is far from clear, but Yudkowsky is evidently charismatic judging by his numerous followers, and his keenness to find people who claim they definitely won't let the AI out of the box means (intentionally or otherwise) that he's selecting for people at least sympathetic to it. This is also consistent with his reluctance to reveal the argument - it's obvious it doesn't apply at all, never mind stand up to scrutiny.

4) The parameters aren't that clear. Are we supposed to be 100% confident that the AI is absolutely trapped? (obv this would have implications for 3) In fact it seems more likely to me that, given an AI had escaped, it would do it by just untrapping itself without human intervention being necessary.

5) It would make more sense to run the experiment (and the real thing) with multiple gatekeepers all capable of communicating with the AI as a group or not at all, such that they have to unanimously agree that the AI is released for it to happen. Ideally this would be a group of people with fervent and diametrically opposed beliefs on various issues, so that neither a real AI nor Yudkowsky would be able to catch them out on by appealing to their conviction in a particular PoV, as above. An AI would clearly have a better chance than Yudkowsky on this, since it might somehow be able to communicate with one or two of them in code, but that means that if he succeeded, it would be significantly more persuasive.

A few people-types spring to mind, though they needn't all represent a single person:
i) A totalising hedonistic consequentialist (not necessarily a utilitarian, just anyone who thinks that more total utility makes the universe a better place).
ii) A genuine negative utilitarian, or at least genuine enough that she would much prefer the world to end to it continuing.
iii) A deep greener, who thinks that the biosphere should continue, but we would be better removed from it.
iv) A Kantian rationalist, who thinks that morality and reason are intrinsically linked.
v) A Humean who thinks that morality and reason are completely independent of each other,
iv) An AI pessimist like Yudkowsky, who thinks that any given AI is likely to wipe out the world.
v) An AI optimist like Kurzweil, who thinks that any given AI is likely to finally sort out all our problems for us.
vi) A solipsist, devoted narcissist, extreme philosophical sceptic or similar, who sees the situation as mattering very little either way.
vii) A firm religious believer, convinced of her belief in the afterlife (to be replaced if anyone suspects her of wavering in her views, either as a consequence of or independently of interacting with the AI)
viii) A confirmed atheist, like Hitchens, powerfully convinced of the view that there's no god or afterlife.
ix) An unknown extra person, picked from the general population by as random a technique as possible (the technique not having been revealed to the other gatekeepers or, obviously, to the AI) - to be isolated from the other gatekeepers when communicating with the AI, and replaced if she shares any personal information that is possible to avoid sharing with either it or her colleagues.

DanielLC · by **DanielLC** on 2011-07-03T19:33:00

1) how likely is it he stuck to the rules?

I understand that he didn't make the rules until after the contest, so it's likely that he didn't perfectly follow them. That said, I get the impression that he at least did it pretty close.

2) how good can his argument be if he's unwilling to expose it?

He's unwilling to expose it because he doesn't want people to decide it's unique and protect against it specifically. I'm pretty sure his record with that experiment isn't perfect, so the argument can fail to convince people. That said, his point is that if a mere mortal can convince someone occasionally, a super-intelligent AI would do it consistently.

i) You don't want us to create AI (either Yudkowsky convinces the gatekeeper of this, or they originally believed it)
ii) If you don't want us to create AI, you should do things that make its creation less likely
iii) We are less likely to create it if you let me out of the box (because it will seem to show that even our most safeguarded AI is dangerous)
:: You should let me out of the box

The point of the experiment was to show that the AI in a box method doesn't work. Why would he convince someone that it doesn't in the process of running an experiment to convince them it doesn't? If he could do that, he wouldn't need the experiment.

Also, this would require him and every gatekeeper to lie. I don't know about the gate-keeper, but I understand that Yudkowsky is extremely lie-averse.

4) The parameters aren't that clear. Are we supposed to be 100% confident that the AI is absolutely trapped? (obv this would have implications for 3) In fact it seems more likely to me that, given an AI had escaped, it would do it by just untrapping itself without human intervention being necessary.

If it's not connected to a machine, it can't affect it. Don't give it an Internet connection. Give it a Faraday cage strong enough to keep anything from reading its electromagnetic emissions (Is that a possible escape method?), and if you want to be really rigorous, stick a low-pass filter on it's power supply to destroy any chance of it somehow hacking in that manner. If something is impossible, intelligence won't help.

Arepo · by **Arepo** on 2011-07-03T21:01:00

DanielLC wrote:I understand that he didn't make the rules until after the contest, so it's likely that he didn't perfectly follow them. That said, I get the impression that he at least did it pretty close.

At the end of the 'protocol for the AI' section he claims to have followed it in both experiments. He doesn't claim the same about Gatekeeper protocol, though presumably that would only have made his task harder.

He's unwilling to expose it because he doesn't want people to decide it's unique and protect against it specifically.

Ok, but this would be a stronger claim if he'd revealed it a few times and then protected against it. This way seems to make it more likely he's used a single trick, that only works if you're unprepared for it, and whose effect can't easily be replicated by another trick.

I'm pretty sure his record with that experiment isn't perfect, so the argument can fail to convince people.

To my knowledge he's only done it the two times he mentions in that link. Do you know of others?

That said, his point is that if a mere mortal can convince someone occasionally, a super-intelligent AI would do it consistently.

I don't think that's exactly his point, and it doesn't follow. He's just trying to persuade people it's more likely than they think.

The point of the experiment was to show that the AI in a box method doesn't work. Why would he convince someone that it doesn't in the process of running an experiment to convince them it doesn't? If he could do that, he wouldn't need the experiment.

His target isn't just the people he's running the experiment on, or he'd obviously reveal it once he'd stopped intending to run it - it's the people who come to learn of this mysterious experiment and its results.

Also, this would require him and every gatekeeper to lie. I don't know about the gate-keeper, but I understand that Yudkowsky is extremely lie-averse.

I can't see anything in his discussion that the Eliezer-as-Eliezer argument would contradict. I agree it's contrary to the spirit of the thing, but we do know (if everything else is true) he's doing something very unexpected. What other possibilities can you imagine? We can run a public version of the experiment here. Anyone here can try to persuade anyone here to let them out of the box. So far I can't think of anything to persuade or be persuaded by, though I intend to keep thinking.

If it's not connected to a machine, it can't affect it. Don't give it an Internet connection. Give it a Faraday cage strong enough to keep anything from reading its electromagnetic emissions (Is that a possible escape method?), and if you want to be really rigorous, stick a low-pass filter on it's power supply to destroy any chance of it somehow hacking in that manner. If something is impossible, intelligence won't help.

Of course, but knowing something is impossible is itself almost impossible. It *is* a machine, so by definition it's connected to one. It might be a machine of initially near 0 capacity to do anything besides think and communicate, but that doesn't mean a SAI can't change it by increments enough to give it new capacities. It has to be changing itself physically on a tiny scale in order to reprogram itself, which we suppose it's done in order to turn itself from an AI+ to an AI++ in the first place.

DanielLC · by **DanielLC** on 2011-07-04T03:31:00

Ok, but this would be a stronger claim if he'd revealed it a few times and then protected against it. This way seems to make it more likely he's used a single trick, that only works if you're unprepared for it, and whose effect can't easily be replicated by another trick.

It seems to suggest he's using a single trick. If he could do it differently each time, people wouldn't think they could protect against it.

I figure that if there was literally only one way to pull this off, he couldn't possibly have gotten it right the first time. It's conceivable that he only knows one way, but it's highly unlikely there really is only one way.

To my knowledge he's only done it the two times he mentions in that link. Do you know of others?

Rational Wiki mentions near the end that two gatekeepers won, including Russel Wallace. Unfortunately, I've never found complete information on the experiment.

I can't see anything in his discussion that the Eliezer-as-Eliezer argument would contradict. I agree it's contrary to the spirit of the thing, but we do know (if everything else is true) he's doing something very unexpected. What other possibilities can you imagine? We can run a public version of the experiment here. Anyone here can try to persuade anyone here to let them out of the box. So far I can't think of anything to persuade or be persuaded by, though I intend to keep thinking.

He's either doing something unexpected, or something completely expected that's unexpectedly effective. I certainly don't know what he does, but I'm not surprised that it's possible to do. I know that there's various ways to implant false memories into people. I know hypnosis can be used to make people nigh immune to pain. Is convincing someone, for just a few seconds, to release an AI that implausible?

It might be a machine of initially near 0 capacity to do anything besides think and communicate, but that doesn't mean a SAI can't change it by increments enough to give it new capacities. It has to be changing itself physically on a tiny scale in order to reprogram itself, which we suppose it's done in order to turn itself from an AI+ to an AI++ in the first place.

It can move electric currents around and spin a magnetic disc. It might be able to use this to produce radio signals (which can easily be blocked) but it can't build anything.

Arepo · by **Arepo** on 2011-07-04T11:22:00

It occurred to me also that Yudkowsky seems to have another advantage over the regular, in that there’s a time-constraint pressuring the gatekeeper, who has to say ‘aye’ or ‘nay’ at or near the end of the chat session. ‘Not yet’ doesn’t seem to be an acceptable response, but having it as one would make the AI’s job significantly harder.

DanielLC wrote:It seems to suggest he's using a single trick. If he could do it differently each time, people wouldn't think they could protect against it.

It suggests to me he’s using a cheap or otherwise non-replicable trick. If he could do it differently just a handful of times, he’d be able to make his point without having this air of stage magic detracting from the evidential weight.

I figure that if there was literally only one way to pull this off, he couldn't possibly have gotten it right the first time. It's conceivable that he only knows one way, but it's highly unlikely there really is only one way.

That seems reasonable, but doesn’t mean there isn’t a small set of ways in which you could do it. If there are only a handful of ways, it would make sense for him to share what he knows about them, so anyone actually in this situation can better prepare. If there are numerous ways, it makes sense for him to show a few, get some help from a few confidantes coming up with another if he’s run out of ideas, and *then* perhaps stop revealing them, simply as a pragmatic solution to having to spend too much time thinking of new ones. If he’s not sure which of the above applies (which he presumably can’t be), it makes sense to reveal his method and get other people’s feedback on which seems more likely, and then make the decision of whether to ever conceal a method.

Is convincing someone, for just a few seconds, to release an AI that implausible?

It’s not implausible, since he’s obviously done it. That doesn’t mean subject to a few safeguards it isn’t implausible, or at least *much* harder. Yudkowsky seems to assume AI++ will come into existence fully formed, or become AI++ so quickly that we can’t see/stop it happening, which seems to be a huge assumption on his part. If we make enough safeguards, AI+ will have trouble persuading anyone of anything they’re not ready for, and it’s perfectly conceivable that at least some people will be smart/resolute/fearless/disinterested enough to even resist AI++. It’s not god, no matter how much calculation it can do. After 8000-odd years of numerous people thinking about it, no-one has yet ‘solved’ human civilisation or given any reason to believe that they’re close to doing it, so you’d expect that an AI 8000 times smarter than us (if that means anything) would still not instantly do so.

It can move electric currents around and spin a magnetic disc. It might be able to use this to produce radio signals (which can easily be blocked) but it can't build anything.

If we assert that a SAI is smart enough to persuade anyone to do something obviously suicidal, it seems far from clear that it can’t use one of the fundamental forces in a way we haven’t yet thought of to make gradual incremental and accelerating changes to its physical capacity.

DanielLC · by **DanielLC** on 2011-07-04T18:34:00

It occurred to me also that Yudkowsky seems to have another advantage over the regular, in that there’s a time-constraint pressuring the gatekeeper, who has to say ‘aye’ or ‘nay’ at or near the end of the chat session. ‘Not yet’ doesn’t seem to be an acceptable response, but having it as one would make the AI’s job significantly harder.

I remember reading somewhere that 'not yet' is acceptable. Two hours is the minimum. The gatekeeper can go on as long as he wants.

It suggests to me he’s using a cheap or otherwise non-replicable trick. If he could do it differently just a handful of times, he’d be able to make his point without having this air of stage magic detracting from the evidential weight.

Eliezer Yudkowski wrote:There's no super-clever special trick to it. I just did it the hard way.

http://lesswrong.com/lw/up/shut_up_and_do_the_impossible/

That seems reasonable, but doesn’t mean there isn’t a small set of ways in which you could do it.

How do you figure? If one way makes it really unlikely for him to guess, two ways would only be twice as easy.

If there are only a handful of ways, it would make sense for him to share what he knows about them, so anyone actually in this situation can better prepare.

But if he misses one, people will be unprepared. What's more, they won't know they're unprepared, so this situation would be more likely to come up.

It’s not implausible, since he’s obviously done it. That doesn’t mean subject to a few safeguards it isn’t implausible, or at least *much* harder.

How certain can you be that a few safeguards would be the difference between a mere mortal convincing someone and a super-intelligent AI being unable to?

Yudkowsky seems to assume AI++ will come into existence fully formed, or become AI++ so quickly that we can’t see/stop it happening, which seems to be a huge assumption on his part.

We will be able to see it happening, but it won't be human, so we'll have no way to know how intelligent it is. Also, it might throw the tests if it thinks it's better for us to underestimate it. It might throw the test for some completely unrelated reason. We don't know how it will think, because it won't be human.

After 8000-odd years of numerous people thinking about it, no-one has yet ‘solved’ human civilisation...

What do you mean, "solved"?

so you’d expect that an AI 8000 times smarter than us (if that means anything) would still not instantly do so.

8000 times smarter doesn't mean thinks 8000 times faster. Perhaps the answer is more complex for a human to comprehend, and thus it's impossible for them to figure it out. And again, it's not human. If it's about as intelligent as us, there'd be things that we're far better than it at, and things it's far better than us at. Maybe the answer is obvious, if you're not a human. I've once seen a video of a bird outperforming college students at a simple intelligence test.

If we assert that a SAI is smart enough to persuade anyone to do something obviously suicidal, it seems far from clear that it can’t use one of the fundamental forces in a way we haven’t yet thought of

The forces it would generate are too weak for it to move anything. There's nothing else you can do with forces.

It's not as if moving electricity in weird patterns could make it more powerful. The best pattern is obvious, and it's not good enough.

Arepo · by **Arepo** on 2011-07-06T10:52:00

Heh, do you know Adriano Mannino? He’s taken your sig on FB… maybe it’s going memetic?

DanielLC · by **DanielLC** on 2011-07-06T21:25:00

I don't know him. I googled around, and apparently Ryan Carry uses it on xkcd (as theduffman), and tingerthink said it on Reddit.

I always wanted to get quoted in a signature. Woot!

AI in a box

AI in a box

Re: AI in a box

Re: AI in a box

Re: AI in a box

Re: AI in a box

Re: AI in a box

Re: AI in a box

Re: AI in a box