SamuKata
3blue1brown
3blue1brown

patreon


Half draft of the next transformer video

Hey folks, above is an early view for the first half of the next chapter on transformers.

This project has been taking much longer than I'd like, which is ironic given that this topic is in principle much simpler than the others. I've gone through several drafts here, but am narrowing in on a final one, and wanted to share the animated first half that I have so far.

The main goal is to explain the multilayer perceptron blocks in a transformer. A barebones description of the computation could be a two-minute video, but all the trickiness is in answering what exactly they're doing.

The true answer is that we don't exactly know. But that's no fun. In the hopes of making the computation memorable, I believe it can be helpful to have some examples of the kind of task the structure could implement, at least in principle, and that's what you'll see in the draft above.

The only problem is that the toy example this video covers, an example of storing factual knowledge, is very likely not what happens in large models. What remains is to explain what aspects of this example are most general and worth remembering, and in what ways we know it to be oversimplified.

The current script for the remaining part

(...everything from the video above...)

So that's the entire operation: Two matrix products, each with a bias added, and a simple clipping function in between. Any of you who watched the earlier videos in this series will recognize this as the structure of the most basic kind of neural network we studied there, which in that example was trained to recognize handwritten digits. In the context of a transformer for a larger language model, this is just one piece of a larger architecture, and any attempt to interpret what exactly it's doing is heavily intertwined with the idea of encoding information into vectors of this high-dimensional embedding space.

Let's take a brief side step to continue the game we were playing in the last two chapters, where we count the total number of parameters, using the numbers from GPT-3.

I mentioned how the up projection matrix has just under 50k rows, and that each one matches the size of the embedding space, which is 12,288. Multiplying these together, we get 604 million parameters in total in that matrix. The down projection matrix has the same number of parameters, it has the same size, just with a transposed shape. Together, they give about 1.2 billion parameters in total. The bias vectors also add some more parameters here, but it's a trivial proportion of the total.

In GPT-3, the sequence of embedding vectors flows through not one, but 96 different MLP blocks, so the total number of parameters devoted to these blocks adds up to about 116 Billion. This is about two-thirds of the total parameters in the network. When you add it to everything we added up before for the attention blocks, embedding, and unembedding, you do indeed get a grand total of 175B as advertised.

I should mention there’s another set of parameters associated with some normalization steps I have yet to discuss, but like the bias vectors, they account for a trivial proportion of the total.

Stepping back, even though almost all the operations inside a transformer boil down to matrix-vector multiplication, as you can probably tell I like to try explaining these with a simple concrete example in mind for what kinds of tasks they might be doing. Here we focused on the Michael Jordan example, in the last chapter on attention we focused on adjectives updating the meanings of nouns, and earlier with handwritten digits we thought about composing edges to make simple shapes, and composing simple shapes to make symbols.

I want to emphasize that these are very simplified examples, meant to motivate the structure and make it more memorable, but the true behavior learned by the model during the gradient descent process may look completely different, and much less interpretable.

While the idea that these blocks can implement many different AND gates in parallel is hopefully memorable, it’s certainly not the only thing they could do. A famous theorem about multilayer perceptrons shows how they can be used to approximate essentially any function, assuming the middle layer is big enough.

In this case, our example does reflect a few general facts. It is true that the rows of our first matrix correspond to directions in this embedding space, so the activation of a given neuron tells you something about how strongly an embedding correlates with a certain direction. It’s also true that the columns in the second matrix tell you what will be added to the result when the corresponding neuron is active. Those are just mathematical facts. However, unfortunately, there’s pretty good evidence that neurons rarely correspond to clean interpretable features.

One idea floating around the circles of interpretability researchers is the notion of superposition, which is a hypothesis about how these models can represent many more features than there are dimensions in these vectors, and this may explain both why they’re hard to interpret, and why they scale surprisingly well.

The basic idea is that if you have an N-dimensional space, and you want to represent a bunch of different features using directions that are all mutually perpendicular in that space so that adding a component of one doesn’t influence the others, you can only fit N such vectors. But if you relax the constraint and tolerate a little noise by allowing these features to be nearly perpendicular, say 89 degrees apart, the number of such directions you can cram into the space is much bigger, and scales exponentially with the dimension. With a simple bit of python, I could comfortably fit millions of vectors into a 100-dimensional space such that they are all nearly perpendicular like this.

So using the GPT-3 numbers, this embedding space could hold many many more than 12k independent directions, and the vector of neurons could effectively be probing at many more than 50k features. And because of this exponential scaling, if you trained a bigger model with 10 times the dimension, the number of ideas could potentially be way more than 10 times as much.  I’ll leave some links to some great Anthropic posts about this idea.

The upshot is that an idea like “Michael Jordan” would probably not correspond to a particular individual neuron, but instead to a specific linear combination of neurons.

We’ve now hit the main parts of the architecture of a transformer, but there are some steps I skipped over, like a rescaling step that happens between layers, and the specifics for how position gets encoded into these token embeddings. Again, I’ll leave links for those of you hungry to learn more.

The main thing I’d like to cover in the next chapter is training. On the one hand, the short answer for how training works is it’s all backpropagation, which we covered in a separate context with earlier chapters in the series. But there is more to discuss, like the specific cost function used for language models, the idea of fine-tuning using reinforcement learning with human feedback, and the notion of scaling laws.

There are a few non-machine-learning videos I’d like to sink my teeth into before making those chapters, but I promise they will come in due time.

Comments

The way you can describe such abstract concepts in such an intuitive way never ceases to amaze me. Non-linearity and activation functions are perfectly described

Stefano Sorrentino

Modern LLM architectures don't seem to use biases, I wonder why?

sylfae

👍

Gabriel Bergqvist

1. It's unfortunate from an interpretability perspective, which is what Grant is drawing on here. If it were the case that each neuron was one concept, then we could read off what concepts the network is using or even how the MLP operates by looking at the pattern of neuron activations (see e.g. Bau et al 2020 (https://www.pnas.org/doi/10.1073/pnas.1907375117) and Geva et al 2020 (https://arxiv.org/abs/2012.14913) for examples of this approach). But because we think neural networks represent concepts in superposition, then not only can we not just interpret each neuron by itself, there's _no_ orthogonal transformation that recovers all the features! this poses a serious roadblock if the goal is to say, reverse engineer what algorithm a neural network is implementing. And you're absolutely correct that this is a big part of why neural networks are so efficient -- GPT-3 seems to know way more than ~13k concepts, for example. If we were talking from an old parallel distributed processing (https://en.wikipedia.org/wiki/Parallel_processing_(psychology)) perspective, then we'd say this is a great thing, and exactly what we're going for!

Lawrence Chan

Just joined patreon to view this video ;-). At 2:09 the volume at the end of the word goal is somehow a bit to low. At 9:07 there is a little wind noise during the "p"s

Christian K

👏🏻👏🏻👏🏻👏🏻👏🏻👏🏻

Sasha Goldenson

A related link: https://softwaredoug.com/blog/2022/12/26/surpries-at-hi-dimensions-orthoginality (Thanks "niten" at http://invite.virtualvalley.ai )

Gabriel Bergqvist

Wow... I feel like this snippet does so well to reinforce the idea of what matrix-vector products can do and represent. The way you described the 2 ways to think about a matrix-vector product in the context of this example is great! I think fewer students would be discouraged from linear algebra class in undergrad if they had such a concrete example of how powerful the concepts can be, and this series is incredible for that.

Isaac Brooks

If you made up the final output distribution I feel like you should have a non zero value for "boxing" given that there should be a fair amount of association between that name and that sport because of the actor and the "Creed" series.

Benjamin Bailey

Nicely done I got sucked in right away

Reginald Carey

Another enjoyable journey even the stuff I don't quite understand.

Mark A Bjerke

crazy !

卢无方 湛

Great stuff! A typo: "metalic" in the first set of MLP direction/qiestion examples should be "Metallic". Looking forward to the full version! Thanks!

Matt Godbolt

Since I can see here only the beginning of the comment, I am posting the rest of the comment as replies: 2) at 10:45, I would add some emphasis animation to the weights when speaking about the linear operations at the beginning of the sentence, followed by empasis animation to the neurons at the end of the sentence when mentioning the non linear operation. Content notes: 1) at 08:00 you give a very good motivation for the bias vector that is added, but then at 12:39 you kind of take a step back and say it is difficult to motivate why is the bias is needed, which I found a bit confusing. 2) When talking about the activation function, consider mentioning that if it wasn't for the non-linear functions such as relu - every MLP, no matter how big it is completely equivalent to just one matrix multiplication to the input vector (because A*(B*(C*(D*v)))=(A*B*C*D)*v=M*v), and so it is intuitive that for anything more sophisticated than a simple matrix multiplication - we must have a non linear function. I found this a very good intuitive explanation for the need of the activation functions. Thanks!

Michael Kali

A great video as always :-) Animation notes: 1) at around 04:18 - the white dashed line jitters. I think that what the renderer does is to take the length of the dashed line, floor it to an integer number of the dash length, and the result is the number of dashes. This flooring operations introduce this discontinuity in the animation (If I got it right). It is somewhat esthetically obnoxious, so I would recommend fixing it. (Though not at a high priority, as those things are always time consuming. perhaps changing the linestyle is a faster patch ><) 2) at 10:45, I would add some emphasis animation to the weights when speaking about the linear operations at the beginning of the sentence, followed by empasis animation to the neurons at the end of the sentence when mentioning the non linear operation. Content notes: 1) at 08:00 you give a very good motivation for the bias vector that is added, but then at 12:39 you kind of take a step back and say it is difficult to motivate why is the bias is needed, which I found a bit confusing. 2) When talking about the activation function, consider mentioning that if it wasn't for the non-linear functions such as relu - every MLP, no matter how big it is completely equivalent to just one matrix multiplication to the input vector (because A*(B*(C*(D*v)))=(A*B*C*D)*v=M*v), and so it is intuitive that for anything more sophisticated than a simple matrix multiplication - we must have a non linear function. I found this a very good intuitive explanation for the need of the activation functions. Thanks!

Michael Kali

So it's like the model having an "Aha" moment when it combines Michael & Jordan and thinks of Basketball

Steve Chantry-Taylor

As always, thanks for great videos and materials. I'm teaching AI to students and developers, and I love the effect your videos have on how well my students understand the topics. Kudos and thanks for your great work. I make it clear they should subscribe to your YouTube channel, obviously. Now my feedback, as requested: 1. "However, unfortunately, there’s pretty good evidence that neurons rarely correspond to clean interpretable features." - Why "unfortunately", when that is exactly what makes these MLPs scale at all? 2. I actually like this part of an explanation I got from Claude-3.5-Sonnet: 'The Curse of Dimensionality Becomes a Blessing: In high-dimensional spaces, most randomly chosen vectors tend to be nearly orthogonal to each other. This counterintuitive property is often called the "blessing of dimensionality" in this context.' - IMHO, "most randomly chosen vectors tend to be nearly orthogonal to each other" is the key here.

Gabriel Bergqvist

You’re in good hands speaking with Neel Nanda about this stuff. His YouTube videos are great…he does swear a lot in them though lol. One thing to note though is that modern LLMs don’t tend to use bias in MLP or attention anymore. It just doesn’t appear to have much of an impact on the final result. I think it’s fine to use bias as part of the explanation for how this might work but probably also worth explaining how it might work without bias if you tweak the weights a bit.

Jake Ehrlich

Awesome stuff, can't wait for it to launch. Some general thoughts: 1. 5:22 you: "when this sequence of vectors", me: "ehh, which vectors? i think they're either the input tokens, or the output from the previous layer, or a combo of both, but not sure" 2. 7:50 nit: "metallic" 3. 4:17 me: "ah, i should really re-read my linear algebra texts..."

Michael Kokosenski

Excellent! Knowledge localization and interpretability is what I am getting my PhD in. I recommend this article for a good review of current work: https://arxiv.org/pdf/2405.00208 The interpretability channel in the EleutherAI discord is very active, as well as the mechanistic interpretability discord. Neel Nanda, Chris Olah, David Bau, and Jacob Steinhardt are major names in this field.

Alex Loftus

Yes! I've been waiting for this video, I love the series. Just imagine... the Youtube algorithm will recommend this video to millions of Michael Jordan fans out there. That should bump up the views more than the very few people that are interested in fluffy blue creatures! :)

Tom Lee

There's a typo around 4:23 - you spelled "Michael" as "Micahel" in the dot product

Ryland Goldman

Very interesting! I like this reasoning, my main critical comment is that I feel like it dismisses the contextualization done by the transformer beforehand. Assuming that we have the tokens 'Michael' and 'Jordan' initially, it seems to me (based on my understanding of previous videos) that these tokens could mutually contextualize each other in the transformer's attention blocks, probably in the direction of the basketball player. Some words explaining why this does not happen (perhaps the number of attention heads is too small to produce this level of contextualization) would be helpful IMHO

Michel Speiser


More Creators