What should you read to get into ML research today? Here are the papers I'd recommend.
Theory
The performance of modern models is grounded in figuring out how to scale simple systems over many orders of magnitude (and not, e.g., building better expert systems - see Rich Sutton's bitter lesson).
The key results:
- Attention Is All You Need - this introduced the transformer architecture. Transformers scale much more efficiently than previous models, which turned out to be hugely important. (That said, they're still quadratic, which isn't great.)
- BERT - This was the first widely-used model where one architecture demonstrated better results across basically every benchmark. Before that, people would design and train the top part of the network for each task, which was a lot more work. There have been a number of refinements since then, but I consider this the ancestor of modern models. (Generative models, like GPT, are a bit different - they're autoregressive and they generate text - but I think the big ideas are in Bert.)
- Scaling Laws and Chinchilla - These are the two classic papers showing that you can keep throwing compute at models and they'll keep getting better; this was not widely believed at the time. OpenAI and Anthropic are still on this curve.
- There are some earlier small, but important, math tricks that enabled deep learning to scale to begin with. There's a chapter on Hands-On ML that covers this well.
- Andrej Karpathy's microgpt is a small, readable gpt if you want to see a minimal network end-to-end.
ML Performance
There have been a -lot- of small wins that enable scaling and faster experimentation. These are some of the more interesting and prominent ones.
- LoRA - a clever approach to efficient fine-tuning using matrix factorization. The idea is that if training a large model is using gradient descent to navigate in the full dimensionality of the size of the underlying matrix, fine-tuning should be approximable by gradient descent in a much lower dimensional space (because you're using so much less compute to navigate).
- Quantization - It turns out that you don't need 16 bits of precision per weight and that you're better off using the bits for more neurons. I think 4 bits is common now, and I've seen hacker news links going down to 1-bit. (There isn't really one classic paper on this topic so much as a bunch of incremental progress, but this is a good overview.)
- QLoRA - this combines LoRA with quantization.
- FlashAttention - This is an architecture that starts looking at system-level properties (specifically IO) to get better scaling and performance.
- KV Cache - This avoids repeated work in the transfomer by cacheing partial computations.
- Speculative Decoding - there's a lot of work on, essentially, front-running the expensive model to predict tokens faster. A coding-specific approach is to sample the next tokens only from a valid grammar (you can skip a forward pass entirely when there's only one valid next token).
You don't need to read any of these papers to train a model or play with a coding agent, but I recommend them anyway - great engineers strive to understand the details of what they're working with.
Frequently Asked Questions
No items found.
Never miss an update
Subscribe to our newsletter. Get exclusive insights delivered straight to your inbox.