Generalization From First Principles

Our Research Direction

Current AI research has mostly converged on a single approach to AGI: gradient-based optimization of large neural networks on massive datasets. Deeper questions about intelligence have taken a backseat to brute empiricism, as theoretical results that aim at the fundamentals—Solomonoff induction , PAC learning—remain uncomputable or utterly impractical.

Can we formulate principles of intelligence that are both universal and practical? How can these principles lead to learning algorithms where generalization scales optimally with compute—the one resource we can control? If we have infinite compute, what are the limits to how intelligent systems can be? We argue that solving these fundamental questions is a promising path to AGI.

The Current State of AI: The Scaling Bottleneck

The bitter lesson says general methods that scale with compute always win. Yet our current scaling is fundamentally constrained by data availability. Compute grows exponentially and is fully programmable; data grows much slower and remains outside of our control. But neural nets trained with gradient descent scale primarily with data and only indirectly with compute. Recent RL approaches also depend on data generated through pretrained models and verifiers. For most problems where intelligence matters, we lack both infinite data and perfect verification.

It's also worth noting that big labs are likely pursuing RL on the problem of intelligence itself largely using existing approaches. In theory, this is sound: use strong pretrained models to generate code for learning algorithms, test their generalization capabilities at runtime, and bootstrap to better systems. But current models aren't yet strong enough to navigate the vast space of possible algorithms. Going forward, this route requires scaling data through RL environments, then hoping the resulting system discovers the fundamentals of intelligence well enough to build more intelligent systems.

The alternative is to solve the fundamentals of intelligence first.

Towards Optimal Generalization

Intelligence is about generalization. We want generalization to scale with compute independent of data availability. Since data is almost always the limiting factor, the priority is achieving strong performance with minimal training data—though the approach should be able to leverage larger datasets when they exist.

This requirement raises a natural question: what would the optimal learning algorithm look like? We hypothesize that with infinite compute, it would:

Search over all models consistent with the data
Use universal generalization priors to select which model generalizes best

The no-free-lunch theorem establishes that search alone isn't sufficient for intelligence—we also need priors. For any finite dataset, infinitely many functions fit the data perfectly, each behaving differently on unseen data. Search finds the functions consistent with the data; priors determine which ones generalize. This would inherently scale with compute because search scales with compute.

Search over models

We need ways to search broadly across the model solution space, the parameter space of a neural net with a given architecture. Pure gradient descent follows training loss gradients and converges to one solution without any mechanism to discover alternatives. The farthest we've gotten is training different models with slightly different settings (different seeds, optimizers), but that's a far cry from the exploratory search our hypothesis requires.

Our recent paper takes the first step by decoupling search from learning during training. We perform evolutionary search in representation space—a lower-dimensional manifold where search is feasible—while using gradient descent in parameter space to implement the discovered representations. This separation allows us to discover diverse solutions by scaling compute.

Universal Generalization Priors

Current neural networks embed priors implicitly in their loss landscapes, optimization dynamics, and architectural choices. These implicit priors enable some generalization, but they cannot be optimized directly and the massive data requirements of modern systems suggest these priors are far from optimal. We need to make generalization priors both explicit and universal—they should be directly optimizable and work across all phenomena we want to model.

The classical example is simplicity, formalized in Solomonoff's prior through Kolmogorov complexity : simpler programs are more likely to be correct. Solomonoff's prior is uncomputable, and simplicity has proven both difficult to implement and even misleading. Consider the overparameterization phenomenon: recent work suggests overparameterized networks achieve better generalization because they converge to flatter minima with lower effective dimensionality . Lotfi et al. (2024) further supports that larger language models achieve tighter generalization bounds and are more compressible than smaller ones. But if we used naive parameter count as the measure of simplicity, we would be misled to build smaller models rather than more compressible ones.

We're still early in developing a principled theory of generalization in neural networks that could drive our model search.

Moving Forward

Search and universal generalization priors form the foundation of generalization, and thus intelligence, that scales with compute. Rather than rely on brute empiricism, our research starts from these fundamentals to build learning algorithms that approach the theoretical limits of intelligence.

← Back to Q