3blue1brown

Half draft of the next transformer video

Added 2024-07-03 11:35:35 +0000 UTC

Hey folks, above is an early view for the first half of the next chapter on transformers.

This project has been taking much longer than I'd like, which is ironic given that this topic is in principle much simpler than the others. I've gone through several drafts here, but am narrowing in on a final one, and wanted to share the animated first half that I have so far.

The main goal is to explain the multilayer perceptron blocks in a transformer. A barebones description of the computation could be a two-minute video, but all the trickiness is in answering what exactly they're doing.

The true answer is that we don't exactly know. But that's no fun. In the hopes of making the computation memorable, I believe it can be helpful to have some examples of the kind of task the structure could implement, at least in principle, and that's what you'll see in the draft above.

The only problem is that the toy example this video covers, an example of storing factual knowledge, is very likely not what happens in large models. What remains is to explain what aspects of this example are most general and worth remembering, and in what ways we know it to be oversimplified.

The current script for the remaining part

(...everything from the video above...)

So that's the entire operation: Two matrix products, each with a bias added, and a simple clipping function in between. Any of you who watched the earlier videos in this series will recognize this as the structure of the most basic kind of neural network we studied there, which in that example was trained to recognize handwritten digits. In the context of a transformer for a larger language model, this is just one piece of a larger architecture, and any attempt to interpret what exactly it's doing is heavily intertwined with the idea of encoding information into vectors of this high-dimensional embedding space.

Let's take a brief side step to continue the game we were playing in the last two chapters, where we count the total number of parameters, using the numbers from GPT-3.

I mentioned how the up projection matrix has just under 50k rows, and that each one matches the size of the embedding space, which is 12,288. Multiplying these together, we get 604 million parameters in total in that matrix. The down projection matrix has the same number of parameters, it has the same size, just with a transposed shape. Together, they give about 1.2 billion parameters in total. The bias vectors also add some more parameters here, but it's a trivial proportion of the total.

In GPT-3, the sequence of embedding vectors flows through not one, but 96 different MLP blocks, so the total number of parameters devoted to these blocks adds up to about 116 Billion. This is about two-thirds of the total parameters in the network. When you add it to everything we added up before for the attention blocks, embedding, and unembedding, you do indeed get a grand total of 175B as advertised.

I should mention there’s another set of parameters associated with some normalization steps I have yet to discuss, but like the bias vectors, they account for a trivial proportion of the total.

Stepping back, even though almost all the operations inside a transformer boil down to matrix-vector multiplication, as you can probably tell I like to try explaining these with a simple concrete example in mind for what kinds of tasks they might be doing. Here we focused on the Michael Jordan example, in the last chapter on attention we focused on adjectives updating the meanings of nouns, and earlier with handwritten digits we thought about composing edges to make simple shapes, and composing simple shapes to make symbols.

I want to emphasize that these are very simplified examples, meant to motivate the structure and make it more memorable, but the true behavior learned by the model during the gradient descent process may look completely different, and much less interpretable.

While the idea that these blocks can implement many different AND gates in parallel is hopefully memorable, it’s certainly not the only thing they could do. A famous theorem about multilayer perceptrons shows how they can be used to approximate essentially any function, assuming the middle layer is big enough.

In this case, our example does reflect a few general facts. It is true that the rows of our first matrix correspond to directions in this embedding space, so the activation of a given neuron tells you something about how strongly an embedding correlates with a certain direction. It’s also true that the columns in the second matrix tell you what will be added to the result when the corresponding neuron is active. Those are just mathematical facts. However, unfortunately, there’s pretty good evidence that neurons rarely correspond to clean interpretable features.

One idea floating around the circles of interpretability researchers is the notion of superposition, which is a hypothesis about how these models can represent many more features than there are dimensions in these vectors, and this may explain both why they’re hard to interpret, and why they scale surprisingly well.

The basic idea is that if you have an N-dimensional space, and you want to represent a bunch of different features using directions that are all mutually perpendicular in that space so that adding a component of one doesn’t influence the others, you can only fit N such vectors. But if you relax the constraint and tolerate a little noise by allowing these features to be nearly perpendicular, say 89 degrees apart, the number of such directions you can cram into the space is much bigger, and scales exponentially with the dimension. With a simple bit of python, I could comfortably fit millions of vectors into a 100-dimensional space such that they are all nearly perpendicular like this.

So using the GPT-3 numbers, this embedding space could hold many many more than 12k independent directions, and the vector of neurons could effectively be probing at many more than 50k features. And because of this exponential scaling, if you trained a bigger model with 10 times the dimension, the number of ideas could potentially be way more than 10 times as much. I’ll leave some links to some great Anthropic posts about this idea.

The upshot is that an idea like “Michael Jordan” would probably not correspond to a particular individual neuron, but instead to a specific linear combination of neurons.

We’ve now hit the main parts of the architecture of a transformer, but there are some steps I skipped over, like a rescaling step that happens between layers, and the specifics for how position gets encoded into these token embeddings. Again, I’ll leave links for those of you hungry to learn more.

The main thing I’d like to cover in the next chapter is training. On the one hand, the short answer for how training works is it’s all backpropagation, which we covered in a separate context with earlier chapters in the series. But there is more to discuss, like the specific cost function used for language models, the idea of fine-tuning using reinforcement learning with human feedback, and the notion of scaling laws.

There are a few non-machine-learning videos I’d like to sink my teeth into before making those chapters, but I promise they will come in due time.