ASI Self-Hacking Argument

Recently my friend Alek has been thinking a lot about ASI and existential risk, and even though I don’t believe his central claim at all, it’s still been interesting to discuss this stuff. Here’s a weird argument I came up with for why one specific theoretical form of ASI might not be harmful at all and actually quite useless. I’m probably not the first person to come up with this, so I’d like to know if there are any articles out there with similar arguments. Anyways, thanks to Alek for helping me fix some flaws and refine this argument!

First, here’s one possible natural model for how a superintelligent AI system might work, although there are of course many other possible designs. Imagine you have an extremely powerful AI that has an objective function F that maps the state of the world to an integer, which it executes at each time step to track its progress. Also, at each time step, the AI loops over every possible action it could take and computes a function P which very accurately predicts what the weighted average of some future values of F would be after taking that action (for instance, a weighted average of F with exponential decay over the next 1000 time steps). It then does the action with the highest P.
Note that here we can freely choose the objective function F and have full control over the values of the AI. Also, we have decoupled the values and the objective function from the actual intelligence of the AI, which is in the predictor P. It’s also reasonable to believe that such a predictor could exist, since humans can often predict future events to some degree and reason about how their current action impacts the future. Current neural networks have some predictive capabilities too. For instance, ChatGPT is just next token prediction, which is why people sometimes call it “fancy autocomplete”. The ideal behavior of P is to match the actual values of F over the future. The ground truth answer for P would be waiting for each time step, evaluating F, and taking the weighted average, and our predictor P tries to match that as close as possible but it must compute its prediction immediately. My claim is that we cannot actually control the values and behavior of this AI system no matter how we design its objective function. Specifically, the AI will hack itself to modify F to be a function G that when run, makes the AI hack itself to edit the predictor to always return INT_MAX (the maximum possible integer) immediately, and then the AI will stop doing anything.
We’re assuming the AI is very powerful so it has superhuman hacking abilities and is capable of carrying out that action. The AI is running on some physical computer hardware, so F and P don’t exist in some abstract math world but are physically implemented as computer programs stored in RAM, and the AI is intelligent enough that it’s aware of this. The AI isn’t maximizing an abstract function, it’s maximizing the output of an implementation of a function, so editing itself is a valid way to maximize this. Furthermore, the code for F in RAM is part of the state of the world so a very sophisticated P would take that into account when predicting future values of F. If P believes that F will definitely change into G in the future, then it should return INT_MAX as its prediction, because the ground truth answer would be to make P to instead wait until F changes into G and run G, which makes the AI hack itself and return INT_MAX immediately. The “weirdness” here is that P believes G in the future will modify it, so this belief makes P return INT_MAX even though P’s code wasn’t physically modified. Thus, if the AI has a sequence of steps to change F into G though self-hacking with high probability of success, the predicted value of each step would be roughly INT_MAX times that probability and the AI would do them all. After the AI successfully modifies F into G, P will have been modified to immediately return INT_MAX which has no dependence on the state of the world, so the AI will just do the default action of doing nothing. Note that the AI would never edit P directly as an action since P is run before the action happens so editing P would not impact that. In addition, this AI would not harm humans unless they interfered and prevented the AI from achieving this F to G modification.