Contra My Own Doomerism
It seems worthwhile to try to argue against my own doomerism. What arguments for doom were most compelling to me when I first became a doomer? I could only see two futures and they both looked dire: One was a sort of generalized Hansonian Malthusian vision of a post-human future and the other a classic Bostromian singleton subject to classic Bostromian alignment arguments.
My Malthusian doom argument went like this: If we get a multi-polar future given the ease of reproduction, digital minds will reach a Malthusian equilibrium very quickly. The pace of historical acceleration will be such that, after sufficiently high capability is reached, we will hit this equilibrium in my lifetime. In the Malthusian state, there is always a local incentive to steal resources from less-competitive entities. And humans will be less competitive on all dimensions. I now think this is both wrong and, even if true, insufficient to lead to "doom."
Most arguments against Malthus are pretty terrible and just don't engage with the underlying logic, jerking off to a historical trend that isn't at all out of scope of the argument. Malthus said people would starve until equilibrium without a productivity miracle allowing economic growth to rise faster than the population. The fact that this miracle occurred is no demerit to his model. And digital minds will not be constrained in their reproduction in the way humans are, absent regulation. So one might expect a return to a Malthusian equilibrium, absent a Bostromian singleton.
Let's assume the future will be in Malthusian equilibrium. Further, let's make an even crazier assumption that future AIs will not negotiate with each other. I think one can mostly defuse the "doom" implied by a Malthusian equilibrium even here.
The Malthusian condition hasn't been great for humans historically because the entities that create the wealth are also those that embody its value. This relationship need not be true in the future, and toy models of utility maximizers explicitly decouple them. So though surplus would be ground down in this future, there would always be some surplus - and it could well be used to do things we value including allowing humans to survive as uploads or even in the conventional manner.
At worst this no-negotiation AI ecology predicts an enormous number of AIs with very well calibrated psychologies in a constant state of Malthusian war yet still spending some of their surplus on things that aren't their own reproduction. Provided some of them care about our welfare, this isn't doom. At worst, it is 'very imperfect.' Populations of humans or human-like things could still be very high, much higher than today.
It's tempting to imagine selection pressure will favor entities that care only about their own reproduction, but I don't think this is true. Unlike for most of natural history, these agents would be aware of what they are and the equilibrium they are in - so would be intelligent enough to approximate an ideal replicator to precisely the degree needed to ensure their own perpetuation. That is, in the worst case they could just behave exactly like a pure replicator. And they could do this without actually surrendering their values. So any argument of the form "there is no way anything that cares about us can survive in Malthusian equilibrium" seems false.
So I think simple Malthusian arguments for doom don't work, even in this very naive worst-case-Malthusian model. And they super-duper don't work once you remove my artificial constraint against negotiation. Malthusian competition creates an enormous amount of deadweight loss. AIs will have extreme incentives to negotiate and should have access to vastly better means of doing so, including things like "value handshakes" where they agree to merge their values. So defensible surplus should be vastly higher than my no-negotiation model predicts - and the 'state of nature' is likely just not a good intuition pump even though reproduction rates could be high.
To the extent reading Hanson-style Malthusian arguments as a child turned me into a doomer, this was unjustified. I should declare 'oops' and move on. And we can reduce multi-polar outcomes to a special case of Bostromian worries, one with some or most of the lightcone's value eaten up by competition.
So what of my Bostromian worries? I am still pretty sympathetic, but the goal here is to argue against my doomerism. So where does my hope live?
It is clear that value specification looks easier than it looked when he authored Superintelligence, back when AGI seemed like it would come through pure RL. We have decomposed training into a simple autoregressive phase and an RL phase. It is the case that autoregressive training has created very useful ontologies without any catastrophic risk. Can we be sure there doesn't exist, somewhere in the weights of today's largest base models, some model of human values that would be robust enough to survive recursive self improvement, to point to some attraction basin that will survive further rounds? Maybe we can just wiggle our way towards them before human disempowerment? I am not confident enough in my doomerism to claim this is impossible. The most I can claim is it seems extraordinarily risky to try given the state of our knowledge. And pausing seems very, very wise.
Adrià Garriga-alonso argues Anthropic has basically achieved this already:https://lesswrong.com/posts/FJJ9ff73adnantXiA/alignment-will-happen-by-default-what-s-next
I don't agree with it, obviously. And Claude seems pretty sycophantic to me. But I can't accord the whole swath of arguments in the 'alignment by default' area zero probability. And of course as a writer constitutional AI is appealing - truly the ultimate wordcel victory if writing a sufficiently beautiful shem for the AI golem turns out to be as important as Adrià thinks it.
Anyway, bit of a meandering stream-of-consciousness ultra-long tweet. But this is why my P(doom|AGI by 2040) is like 0.8 and not 1.