March 2026
We've achieved 10x data efficiency with NanoGPT Slowrun
A few things worth noting. First, this looks nothing like our current scaling laws. Chinchilla says you should train a ~5M parameter model if you have 100M tokens -- a staggering 3600x difference from what we're doing. Second, 10x data efficiency would've seemed unimaginable to most people, and we got there in ... a few weeks. Here's how. Some of the trends are architectural tweaks without a lot of principles behind them. But a few are principled, and we believe they will transfer to larger scales. Those are what matter fundamentally.
Ensembling is probably the most understudied axis of scaling in pretraining. Instead of training one model, you train many models somewhat independently and aggregate their predictions at inference. This way, you can keep leveraging more compute under fixed data and keep improving generalization.
Training dynamics for ensembles are very different than for a single model. This is a key insight. Pandey et al.
We see exactly this. In PR #26, we extended training from 12 to 18 epochs. Individual model loss went from 3.295 to 3.310 -- it got worse. But ensemble loss dropped from 3.185 to 3.166. The models learn different things when you push them past their individual optimum, and that helps the ensemble.
Chain distillation. We've found that chain knowledge distillation dramatically improves ensemble training (PR #31). The idea, inspired by Born-Again Neural Networks
Note that only the immediately preceding model serves as teacher, not the full ensemble of prior models. This keeps memory constant and training fast. With 8 models trained this way in the chain distillation PR, individual model loss plateaus around 3.20, but ensemble loss hits 3.126 -- taking us from 7x to 8x data efficiency.
There's a ton of headroom here in scaling ensembles further.
Our theory is that generalization is closely related to compression -- in other words, simplicity
We use weight decay up to 1.6 and dropout of 0.1. For context, standard practice is weight decay of ~0.1. Ours is 16x that. And it works because we're massively overparameterized: a 2.7B model
Looped transformers have better inductive biases than standard transformers because they allow the model to apply more compute per prediction. Instead of a single forward pass through the layers, the model iterates, refining its representations.
We start by training our 30 layer transformer without looping, and then halfway through training we loop layers 15-24 four times. This means we first run layers 0-24 of the transformer, then re-run layers 15-24 4 times, and finally run layers 25-29. This configuration was found to be optimal: it is important not to loop the last few layers. There remains more work in extending and formalizing these heuristics.
We've found some really good architectural changes, and the meta-pattern is that neural architecture search matters for data efficiency.
Exclusive Self Attention (XSA)
Overall, these architectural tweaks keep giving significant data efficiency gains. It suggests that systematic architecture search is an important direction.
100x. It will probably require a few new breakthroughs, but seems feasible within a year.
@ChinmayK0607 · @not-nonymous · @shmublu · @zhiweixx · @em-see-squared · @ms337 · @kvegesna · @akshayvegesna