Transformer Blocks
In a transformer, layer by layer, richer and richer contextualized representations of the meanings of input tokens are built using the attention mechanism. At the highest transformer blocks, the residual stream is usually representing the following token.
The transformer block includes:
- Residual Stream
- Attention Layer
- MLP
- LayerNorm
Residual Stream
The residual stream is the sum of all previous outputs of layers of the model and also the input to each new layer in the model. The residual stream is fundamental as it is the central object of the transformers, as it is how the model:
- remembers information
- moves information between layers for compoistion
- stores the information that attention moves between positions.
Attention Layer
Each attention layer in a transformer moves information between pairs of embeddings. Attention layers determine which pairs should interact and what information should flow between them. Thus, attention layers can be seen as iteratively transforming the embedding vectors to build a rich contextualized representations of the meaning of input tokens.
Attention layers are made up of a number of heads - each with their own parameters, attention pattern, and information on how to copy things from source to destination. These heads act independently and additively.
Each attention head can be thought of as consisting of two different circuits:
- The QK circuit, which determines where to move information to and from.
- The OV circuit, which determines what information to move.
In the above figure,
Assuming that represents the contextual embedding vector at resid_pre in the above figure, the attention scores is given by,
The attention scores are scaled and masked, which is then used for computing the attention probabilities,
For each key position, we take a weighted average of value vectors from each query position, in accordance with how much attention destination pays to source,
A final linear transformation is applied to by the projection matrix mapping the vectors into the right size to be added to the residual stream
Finally, the heads are summed over to gain the output of the attention layer as shown in the above figure.
MLP
The MLP operates on positions in the residual stream independently, and in exactly the same way. It does not move information between positions. Once attention has moved relevant information to a single position in the residual stream, MLPs can actually do computation, reasoning, lookup information, etc.
The MLP layers are just a standard neural network, with a singular hidden layer and a non-linear activation function. The hidden dimension is normally, as shown in the above figure.
LayerNorm
LayerNorm is a simple normalization function applied at the start of each layer. It converts each input vector to have zero mean and unit variance, and then applies an elementwise scaling and translation.