A Simple Take on Scaling Laws
Why we need someone to "just do things" to find the next breakthrough in AI
Molly O’Shea (Sourcery) recently asked me on X why I think scaling laws are “pretty evidently slowing” contrary to statements by those like Eric Schmidt.1 So here’s my super abbreviated take on why I think that.
It's pretty difficult to break the efficient frontier given physical constraints. In other words, you can only give expand these clusters so much. Even if Project Stargate commits billions to it, without proper government investment (in the form of immense formal allocation in a spending bill), American companies are constrained by cluster size (which is in turn limited by GPU production capacity and energy). Hence why Altman et al have made immense investments in the latter (à la Helion).
So the major labs have turned to test-time compute scaling for o1 and other models. They only have so much compute available to them, at least for the next five years.2 All of this takes time.
To that end, time represents the critical constraint in another respect. OpenAI “personality hire” Aidan Mclaughlin rightly notes (quoting the METR report) that while the model can complete longer tasks at an exponential rate and "think.” They're nonetheless “pretty evidently slowing” in my opinion vis-à-vis actual performance with respect to time. Inference scaling has led to exponential improvement in the aggregate, in the short run.
My favorite quote on this is from OpenAI researcher Noam Brown, who said “it turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer.” Because inference scaling was relatively new to the public at the time, it was treated as a pretty novel breakthrough. It no doubt is.
But his quote also affirms that we're beginning to reach the physical limits of compute, such that the major labs have turned to these ‘alternative methods’ as opposed to just expanding their clusters. To that end, you can only make the models think for so long. Time is a zero-sum resource just like energy and compute. Meaning we are likely hitting a certain wall for now, even if it’s not the ‘formal’ scaling wall.
As Ilya Sutskever said, “the 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing.” So there are basically two things that will happen from here:
We will find the “next thing” (akin to inference scaling) in the next year and model performance will continue to improve exponentially.
This research will take a little longer, and will coincide with cluster expansion in the next 5-10 years. So we’ll see wild performance improvement, but not in the short term.
I’m inclined to take the latter view. So does Dwarkesh Patel, who recently wrote:
While this makes me bearish on transformative Al in the next few years, it makes me especially bullish on Al over the next decades. When we do solve continuous learning, we'll see a huge discontinuity in the value of the models. Even if there isn't a software only singularity (with models rapidly building smarter and smarter successor systems), we might still see something that looks like a broadly deployed intelligence explosion. Als will be getting broadly deployed through the economy, doing different jobs and learning while doing them in the way humans can. But unlike humans, these models can amalgamate their learnings across all their copies. So one Al is basically learning how to do every single job in the world. An Al that is capable of online learning might functionally become a superintelligence quite rapidly without any further algorithmic progress.
Perhaps Zuckerburg’s Superintelligence Labs or Ilya’s SSI will crack this sooner. For now however, AI superintelligence progress is constrained by what limits us all: time.
By the way, Molly’s podcast is great, and if you have a few minutes out of your day, I think she’s really going to blow up in the coming months (more than she already has!).
This is also of course the technique the American government is attempting to use to restrain China’s progress via GPU export constraints.