NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

March 2026

We've achieved 10x data efficiency with NanoGPT Slowrun within a few weeks. An ensemble of 1.8B parameter models (18B total params) trained on 100M tokens matches what would normally require 1B tokens with a standard LM baseline. Data efficiency matters because compute grows much faster than data . Since our current scaling laws require proportional increases in both , intelligence will eventually be bottlenecked by data, not compute. This data efficiency result allows us to improve model performance by scaling with compute rather than with data.

NanoGPT Slowrun
3.8× data efficiency

A few things worth noting. First, this looks nothing like our current scaling laws. Chinchilla says you should train a ~5M parameter model if you have 100M tokens -- a staggering 3600x difference from what we're doing. Second, 10x data efficiency would've seemed unimaginable to most people, and we got there in ... a few weeks. Here's how. Some of the trends are architectural tweaks without a lot of principles behind them. But a few are principled, and we believe they will transfer to larger scales. Those are what matter fundamentally.

Ensemble

Ensembling is probably the most understudied axis of scaling in pretraining. Instead of training one model, you train many models somewhat independently and aggregate their predictions at inference. This way, you can keep leveraging more compute under fixed data and keep improving generalization.

Training dynamics for ensembles are very different than for a single model. This is a key insight. Pandey et al. show that post-hoc transforms like ensembling reverse the usual overfitting dynamics: while base models overfit with more training, ensembles favor base models trained for more epochs. Kim et al. independently find that ensembling allows for much longer training than a single model.

We see exactly this. In PR #26, we extended training from 12 to 18 epochs. Individual model loss went from 3.295 to 3.310 -- it got worse. But ensemble loss dropped from 3.185 to 3.166. The models learn different things when you push them past their individual optimum, and that helps the ensemble.

Chain distillation. We've found that chain knowledge distillation dramatically improves ensemble training (PR #31). The idea, inspired by Born-Again Neural Networks , is to train models sequentially, where each new model distills from the immediately preceding one:

Algorithm: Chain Distillation Ensemble 1. Train model M_1 on data D with standard cross-entropy loss. 2. For k = 2, ..., K: a. Load M_{k-1} as a frozen teacher. b. Train model M_k from scratch on D with loss: L = (1 - α) · CE(M_k(x), y) + α · T² · KL(M_k(x)/T ‖ M_{k-1}(x)/T) where α = 0.5, T = 1.0 c. Discard teacher from memory. 3. At inference, ensemble all K models by averaging logits.

Note that only the immediately preceding model serves as teacher, not the full ensemble of prior models. This keeps memory constant and training fast. With 8 models trained this way in the chain distillation PR, individual model loss plateaus around 3.20, but ensemble loss hits 3.126 -- taking us from 7x to 8x data efficiency.

There's a ton of headroom here in scaling ensembles further.

Ensemble with chain distillation
ensemble loss
number of models
individual model
ensemble
Ensemble val loss with chain distillation as models are added

Regularization

Our theory is that generalization is closely related to compression -- in other words, simplicity . Regularization is a proxy for simplicity, particularly the techniques we've found most useful: L2 weight decay and dropout. It's no surprise regularization improves generalization. But the degree to which we can regularize is what's interesting.

We use weight decay up to 1.6 and dropout of 0.1. For context, standard practice is weight decay of ~0.1. Ours is 16x that. And it works because we're massively overparameterized: a 2.7B modelOur initial baseline was a 2.7B model; the model size is currently 1.8B. on 100M tokens, when Chinchilla says you should use ~5M parameters for that data. Kim et al. find optimal weight decay is up to 30x larger than standard practice in the data-constrained regime, and we've confirmed this aggressively. And the larger the model you train, the more regularization you need.

Looping

Looped transformers have better inductive biases than standard transformers because they allow the model to apply more compute per prediction. Instead of a single forward pass through the layers, the model iterates, refining its representations.

We start by training our 30 layer transformer without looping, and then halfway through training we loop layers 15-24 four times. This means we first run layers 0-24 of the transformer, then re-run layers 15-24 4 times, and finally run layers 25-29. This configuration was found to be optimal: it is important not to loop the last few layers. There remains more work in extending and formalizing these heuristics.

single model · val loss
Looping
val loss
number of loops
Single model val loss as loop count increases

Architectural Changes

We've found some really good architectural changes, and the meta-pattern is that neural architecture search matters for data efficiency.

Exclusive Self Attention (XSA) removes the self-value projection from the attention output (PR #36). EMA (exponential moving average of model weights) combined with weight decay tuning and several other changes -- half-truncated RoPE, partial key offset for single-layer induction heads, tuned residual lambdas -- gave a nice bump (PR #29). U-Net skip connections between mirrored transformer layers (layers 0-14 feeding into layers 29-15 via learned scalar weights) helped (PR #17). SwiGLU activation replacing squared ReLU (PR #12). Value embeddings via learned projection from input embeddings, replacing separate embedding tables (PR #11).

Overall, these architectural tweaks keep giving significant data efficiency gains. It suggests that systematic architecture search is an important direction.

What's Next

100x. It will probably require a few new breakthroughs, but seems feasible within a year.

Contributors

@ChinmayK0607 · @not-nonymous · @shmublu · @zhiweixx · @em-see-squared · @ms337 · @kvegesna · @akshayvegesna

← Back to Q