Current AI research has mostly converged on a single approach to AGI: gradient-based optimization of large neural networks on massive datasets. Deeper questions about intelligence have taken a backseat to brute empiricism, as theoretical results that aim at the fundamentals—Solomonoff induction
Can we formulate principles of intelligence that are both universal and practical? How can these principles lead to learning algorithms where intelligence scales optimally with compute—the one resource we can control? If we have infinite compute, what are the limits to how intelligent systems can be? We argue that solving these fundamental questions is a more promising path to AGI than pure empirical scaling.
The bitter lesson says general methods that scale with compute always win.
It's also worth noting that big labs are likely pursuing RL on the problem of intelligence itself largely using existing approaches. In theory, this is sound: use strong pretrained models to generate code for learning algorithms, test their generalization capabilities at runtime, and bootstrap to better systems. But current models aren't yet strong enough to navigate the vast space of possible algorithms. Going forward, this route requires scaling data through RL environments, then hoping the resulting system discovers the fundamentals of intelligence well enough to build more intelligent systems.
The alternative is to solve the fundamentals of intelligence first.
Intelligence is about generalization. We want generalization to scale with compute independent of data availability. Since data is almost always the limiting factor, the priority is achieving strong performance with minimal training data—though the approach should be able to leverage larger datasets when they exist.
This requirement raises a natural question: what would the optimal learning algorithm look like? We hypothesize that with infinite compute, it would:
The no-free-lunch theorem
We need ways to search broadly across the model solution space, the parameter space of a neural net with a given architecture. Pure gradient descent follows training loss gradients and converges to one solution without any mechanism to discover alternatives. The farthest we've gotten is training different models with slightly different settings (different seeds, optimizers), but that's a far cry from the exploratory search our hypothesis requires.
Our upcoming paper takes the first step by decoupling search from learning during training. We perform evolutionary search in representation space—a lower-dimensional manifold where search is feasible—while using gradient descent in parameter space to implement the discovered representations. This separation allows us to discover diverse solutions by scaling compute.
Current neural networks embed priors implicitly in their loss landscapes, optimization dynamics, and architectural choices. These implicit priors enable some generalization, but they cannot be optimized directly and the massive data requirements of modern systems suggest these priors are far from optimal. We need to make generalization priors both explicit and universal—they should be directly optimizable and work across all phenomena we want to model.
The classical example is simplicity, formalized in Solomonoff's prior through Kolmogorov complexity
We're still early in developing a principled theory of generalization in neural networks that could drive our model search.
Search and universal generalization priors form the foundation of generalization, and thus intelligence, that scales with compute. Rather than rely on brute empiricism, our research starts from these fundamentals to build learning algorithms that approach the theoretical limits of intelligence.