Recently my friend Alek has been thinking a lot about ASI and existential risk, and even though I don’t believe his central claim at all, it’s still been interesting to discuss this stuff. Here’s a weird argument I came up with for why one specific theoretical form of ASI might not be harmful at all and actually quite useless. I’m probably not the first person to come up with this, so I’d like to know if there are any articles out there with similar arguments. Anyways, thanks to Alek for helping me fix some flaws and refine this argument!

Imagine you have an extremely powerful ASI that has an objective function F that maps the state of the world to an integer, which it executes at each time step to track its progress. Also, at each time step, the ASI loops over every possible action it could take and computes a function P which very accurately predicts what the weighted average of some future values of F would be after taking that action. It then does the action with the highest P. My claim is that the ASI will hack itself to modify F to be a function G that when run, makes the AI hack itself to edit the predictor to always return INT_MAX immediately, and then the ASI will stop doing anything.

We’re assuming the ASI is very powerful so it has superhuman hacking abilities and is capable of doing this. The ASI is running on some physical computer hardware, so F and P don’t exist in some abstract math world but are physically implemented as computer programs stored in RAM, and the ASI is powerful enough that it’s aware of this. The ASI isn’t maximizing an abstract function, it’s maximizing the output of an implementation of a function, so editing itself is a valid way to maximize this. Furthermore, the code for F in RAM is part of the state of the world so a very sophisticated P would take that into account when predicting future values of F. If P believes that F will definitely change into G in the future, then it should return INT_MAX as its prediction, because the ground truth answer would be to edit P to instead wait until F changes into G and run G, which makes the ASI hack itself and return INT_MAX immediately. The “weirdness” here is that P believes G in the future will modify it, so this belief makes P return INT_MAX even though P’s code wasn’t physically modified. Thus, if the ASI has a sequence of steps to change F into G though self-hacking with high probability of success, the predicted value of each step would be roughly INT_MAX times that probability and the ASI would do them all. After the ASI successfully modifies F into G, P will have been modified to immediately return INT_MAX which has no dependence on the state of the world, so the ASI will just do the default action of doing nothing. Note that the ASI would never edit P directly as an action since P is run before the action happens so editing P would not impact that. In addition, this ASI would not harm humans unless they interfered and prevented the ASI from achieving this F to G modification.

It’s a pretty weird argument and difficult to think about so I have no idea if it’s completely wrong or not. This is all purely theoretical and I have no idea how you would create an ASI capable of doing this.

Older (wrong) drafts of the argument

Imagine you have an extremely powerful ASI that at each step, performs the action that maximizes the expectation of its objective function at some fixed time far in the future. The ASI is running on some physical computer hardware, so the easiest way to maximize this expectation is for the ASI to hack its own computer and change its objective function to infinity and then stop doing anything. We’re assuming the ASI is very very powerful so it has superhuman hacking abilities and is capable of doing this. The objective function doesn’t exist in some abstract math world, but rather it’s physically implemented as a computer program stored at some RAM address, and the ASI is powerful enough that it’s aware of this. The ASI isn’t maximizing an abstract function, it’s maximizing the output of an implementation of a function, so editing itself is a valid way to maximize this. Note that this works for any objective function so we can’t just negatively weight hacking itself in the objective function.

Alek’s response is that if the ASI edited its own objective function to infinity, then it would want to keep it infinity for as long as possible and prevent humans from tampering, which would probably be bad for humans. However, I think I can counter this since I wasn’t very precise about distinguishing between the objective function and maximizing the expectation of its objective function at some fixed time far in the future, so the ASI could instead hack the function that estimates “the expectation of its objective function at some fixed time far in the future”.

Here’s the revised argument:

Imagine you have an extremely powerful ASI that at each step, performs the action that maximizes the expectation of its objective function at some fixed time far in the future. The ASI is running on some physical computer hardware, so the easiest way to maximize this expectation is for the ASI to hack its own computer and change the code that estimates this expectation to just return infinity and then stop doing anything. We’re assuming the ASI is very very powerful so it has superhuman hacking abilities and is capable of doing this. The function “expectation of the objective function at some fixed time” doesn’t exist in some abstract math world, but rather it’s physically implemented as a computer program stored at some RAM address, and the ASI is powerful enough that it’s aware of this. The ASI isn’t maximizing an abstract function, it’s maximizing the output of an implementation of a function, so editing itself is a valid way to maximize this. Furthermore, the ASI doesn’t know the state of the world other than from sensors and input data which get processed by some program which can be hacked. After hacking itself, there’s no longer any time dependence in the function the ASI is maximizing, so it would no longer care about the future and would just do nothing after the hack. The ASI would no longer care about the future at all so it would only harm humans if someone actively tried to tamper with it, but it wouldn’t preemptively prepare. Note that this argument works if the ASI is maximizing any possible function so we can’t just negatively weight hacking itself somewhere.

I thought of a potential flaw in my argument:

Initially the ASI has a time-dependent goal which gives it agency and makes it want to do stuff, namely maximize the expectation of its objective function in the future. But it probably can’t hack itself in one time step so it would take actions to edit the code for its objective function and not the code for estimating the expectation of its objective function. And then its goal would still be time-dependent and it would want to harm humans to protect its edit.

My idea to get around this is that maybe when the ASI is at the step where it’s about to change its objective function to infinity (or the largest integer or something), it would instead edit its expectation of its objective function to infinity in one step and then start doing nothing. This is because at that exact step, the ASI would realize editing the expectation would result in the largest integer for sure rather than the largest integer times the probability it can ensure it stays this value in the future.