I don't usually have much time for Eliezer Yudkowsky, but this discussion on imprisoning potentially hostile AI recently piqued my interest.
For those unfamiliar with it, both of the experiments he refers to had a common factors not mentioned on the page above: a) a person who believed it was unlikely he would let out the AI (the second person asserted he didn't think it possible), b) no information revealed about the nature of the subsequent conversation except the gatekeeper's decision, and c) the gatekeeper agreeing to let the AI out.
It seems a very interesting experiment, that I have a few thoughts I thought I'd bounce off people here in case they have more insight/knowledge on it than me:
1) how likely is it he stuck to the rules?
I think it probable that he did, but a friend of mine's very skeptical, given his refusal to show his methods.
2) how good can his argument be if he's unwilling to expose it? Either it's unique to the individual he's talking to, which seems unlikely, given that Yudkowsky would struggle to draw much information out of them (this is one thing he's clearly not going to be able to simulate to super-AI level), or he's got a particular set of arguments in mind. If the latter, and they're any good, why can't they be exposed to public or even semi-public view in order to expose them to public scrutiny, and thereby improve them? Newcomers to the discussion could still decide whether to look at the discussions of the argument or eschew them and become a gatekeeper - if they did the former, their argument that nothing could persuade them would have significantly less evidential weight.
3) I want to speculate on what arguments he did use. One respondent in the original thread claimed that a genuine AI (but not the Yudkowsky sim) could use a version of Pascal's Mugging to claim it was already out of the box, and threaten the gatekeeper with all sorts of near-infinite unpleasantness if he didn't flip the 'release' switch for some arbitrary reason.
Another thought is that the AI might be able to speculate that future AIs would react predictably, and the trapped one might be able to threaten/bribe with the actions of future intelligences.
A friend of mine offered a very interesting possibility: Yudkowsky might be using the fact that he's *not* AI as a key part of his argument. The argument he proposed would go something like this:
i) You don't want us to create AI (either Yudkowsky convinces the gatekeeper of this, or they originally believed it)
ii) If you don't want us to create AI, you should do things that make its creation less likely
iii) We are less likely to create it if you let me out of the box (because it will seem to show that even our most safeguarded AI is dangerous)
:: You should let me out of the box
Premises 2 and 3 seem hard to resist. Premise 1 is far from clear, but Yudkowsky is evidently charismatic judging by his numerous followers, and his keenness to find people who claim they definitely won't let the AI out of the box means (intentionally or otherwise) that he's selecting for people at least sympathetic to it. This is also consistent with his reluctance to reveal the argument - it's obvious it doesn't apply at all, never mind stand up to scrutiny.
4) The parameters aren't that clear. Are we supposed to be 100% confident that the AI is absolutely trapped? (obv this would have implications for 3) In fact it seems more likely to me that, given an AI had escaped, it would do it by just untrapping itself without human intervention being necessary.
5) It would make more sense to run the experiment (and the real thing) with multiple gatekeepers all capable of communicating with the AI as a group or not at all, such that they have to unanimously agree that the AI is released for it to happen. Ideally this would be a group of people with fervent and diametrically opposed beliefs on various issues, so that neither a real AI nor Yudkowsky would be able to catch them out on by appealing to their conviction in a particular PoV, as above. An AI would clearly have a better chance than Yudkowsky on this, since it might somehow be able to communicate with one or two of them in code, but that means that if he succeeded, it would be significantly more persuasive.
A few people-types spring to mind, though they needn't all represent a single person:
i) A totalising hedonistic consequentialist (not necessarily a utilitarian, just anyone who thinks that more total utility makes the universe a better place).
ii) A genuine negative utilitarian, or at least genuine enough that she would much prefer the world to end to it continuing.
iii) A deep greener, who thinks that the biosphere should continue, but we would be better removed from it.
iv) A Kantian rationalist, who thinks that morality and reason are intrinsically linked.
v) A Humean who thinks that morality and reason are completely independent of each other,
iv) An AI pessimist like Yudkowsky, who thinks that any given AI is likely to wipe out the world.
v) An AI optimist like Kurzweil, who thinks that any given AI is likely to finally sort out all our problems for us.
vi) A solipsist, devoted narcissist, extreme philosophical sceptic or similar, who sees the situation as mattering very little either way.
vii) A firm religious believer, convinced of her belief in the afterlife (to be replaced if anyone suspects her of wavering in her views, either as a consequence of or independently of interacting with the AI)
viii) A confirmed atheist, like Hitchens, powerfully convinced of the view that there's no god or afterlife.
ix) An unknown extra person, picked from the general population by as random a technique as possible (the technique not having been revealed to the other gatekeepers or, obviously, to the AI) - to be isolated from the other gatekeepers when communicating with the AI, and replaced if she shares any personal information that is possible to avoid sharing with either it or her colleagues.
For those unfamiliar with it, both of the experiments he refers to had a common factors not mentioned on the page above: a) a person who believed it was unlikely he would let out the AI (the second person asserted he didn't think it possible), b) no information revealed about the nature of the subsequent conversation except the gatekeeper's decision, and c) the gatekeeper agreeing to let the AI out.
It seems a very interesting experiment, that I have a few thoughts I thought I'd bounce off people here in case they have more insight/knowledge on it than me:
1) how likely is it he stuck to the rules?
I think it probable that he did, but a friend of mine's very skeptical, given his refusal to show his methods.
2) how good can his argument be if he's unwilling to expose it? Either it's unique to the individual he's talking to, which seems unlikely, given that Yudkowsky would struggle to draw much information out of them (this is one thing he's clearly not going to be able to simulate to super-AI level), or he's got a particular set of arguments in mind. If the latter, and they're any good, why can't they be exposed to public or even semi-public view in order to expose them to public scrutiny, and thereby improve them? Newcomers to the discussion could still decide whether to look at the discussions of the argument or eschew them and become a gatekeeper - if they did the former, their argument that nothing could persuade them would have significantly less evidential weight.
3) I want to speculate on what arguments he did use. One respondent in the original thread claimed that a genuine AI (but not the Yudkowsky sim) could use a version of Pascal's Mugging to claim it was already out of the box, and threaten the gatekeeper with all sorts of near-infinite unpleasantness if he didn't flip the 'release' switch for some arbitrary reason.
Another thought is that the AI might be able to speculate that future AIs would react predictably, and the trapped one might be able to threaten/bribe with the actions of future intelligences.
A friend of mine offered a very interesting possibility: Yudkowsky might be using the fact that he's *not* AI as a key part of his argument. The argument he proposed would go something like this:
i) You don't want us to create AI (either Yudkowsky convinces the gatekeeper of this, or they originally believed it)
ii) If you don't want us to create AI, you should do things that make its creation less likely
iii) We are less likely to create it if you let me out of the box (because it will seem to show that even our most safeguarded AI is dangerous)
:: You should let me out of the box
Premises 2 and 3 seem hard to resist. Premise 1 is far from clear, but Yudkowsky is evidently charismatic judging by his numerous followers, and his keenness to find people who claim they definitely won't let the AI out of the box means (intentionally or otherwise) that he's selecting for people at least sympathetic to it. This is also consistent with his reluctance to reveal the argument - it's obvious it doesn't apply at all, never mind stand up to scrutiny.
4) The parameters aren't that clear. Are we supposed to be 100% confident that the AI is absolutely trapped? (obv this would have implications for 3) In fact it seems more likely to me that, given an AI had escaped, it would do it by just untrapping itself without human intervention being necessary.
5) It would make more sense to run the experiment (and the real thing) with multiple gatekeepers all capable of communicating with the AI as a group or not at all, such that they have to unanimously agree that the AI is released for it to happen. Ideally this would be a group of people with fervent and diametrically opposed beliefs on various issues, so that neither a real AI nor Yudkowsky would be able to catch them out on by appealing to their conviction in a particular PoV, as above. An AI would clearly have a better chance than Yudkowsky on this, since it might somehow be able to communicate with one or two of them in code, but that means that if he succeeded, it would be significantly more persuasive.
A few people-types spring to mind, though they needn't all represent a single person:
i) A totalising hedonistic consequentialist (not necessarily a utilitarian, just anyone who thinks that more total utility makes the universe a better place).
ii) A genuine negative utilitarian, or at least genuine enough that she would much prefer the world to end to it continuing.
iii) A deep greener, who thinks that the biosphere should continue, but we would be better removed from it.
iv) A Kantian rationalist, who thinks that morality and reason are intrinsically linked.
v) A Humean who thinks that morality and reason are completely independent of each other,
iv) An AI pessimist like Yudkowsky, who thinks that any given AI is likely to wipe out the world.
v) An AI optimist like Kurzweil, who thinks that any given AI is likely to finally sort out all our problems for us.
vi) A solipsist, devoted narcissist, extreme philosophical sceptic or similar, who sees the situation as mattering very little either way.
vii) A firm religious believer, convinced of her belief in the afterlife (to be replaced if anyone suspects her of wavering in her views, either as a consequence of or independently of interacting with the AI)
viii) A confirmed atheist, like Hitchens, powerfully convinced of the view that there's no god or afterlife.
ix) An unknown extra person, picked from the general population by as random a technique as possible (the technique not having been revealed to the other gatekeepers or, obviously, to the AI) - to be isolated from the other gatekeepers when communicating with the AI, and replaced if she shares any personal information that is possible to avoid sharing with either it or her colleagues.