XCS224N Problem Set 5 Self-Attention, Transformers, Pre-Training
XCS224N Problem Set 5 Self-Attention, Transformers, Pre-Training
XCS224N Problem Set 5 Self-Attention, Transformers, Pre-Training
Guidelines
1. If you have a question about this homework, we encourage you to post your question on our Slack channel, at
http://xcs224n-scpd.slack.com/
2. Familiarize yourself with the collaboration and honor code policy before starting work.
3. For the coding problems, you must use the packages specified in the provided environment description. Since
the autograder uses this environment, we will not be able to grade any submissions which import unexpected
libraries.
Submission Instructions
Written Submission: Some extra credit questions in this assignment require a written response. For these
questions, you should submit a PDF with your solutions online in the online student portal. As long as the PDF is
legible and organized, the course staff has no preference between a handwritten and a typeset LATEX submission. If
you wish to typeset your submission and are new to LATEX, you can get started with the following:
• Type responses only in submission.tex.
• Submit the compiled PDF, not submission.tex.
• Use the commented instructions within the Makefile and README.md to get started.
Coding Submission: Some questions in this assignment require a coding response. For these questions, you should
submit all files indicated in the question to the online student portal. For further details, see Writing Code
and Running the Autograder below.
Honor code
We strongly encourage students to form study groups. Students may discuss and work on homework problems
in groups. However, each student must write down the solutions independently, and without referring to written
notes from the joint session. In other words, each student must understand the solution well enough in order to
reconstruct it by him/herself. In addition, each student should write on the problem set the set of people with
whom s/he collaborated. Further, because we occasionally reuse problem set questions from previous years, we
expect students not to copy, refer to, or look at the solutions in preparing their answers. It is an honor code
violation to intentionally refer to a previous year’s solutions. More information regarding the Stanford honor code
can be foudn at https://communitystandards.stanford.edu/policies-and-guidance/honor-code.
• hidden: These unit tests are the evaluated elements of the assignment, and run your code with more complex
inputs and corner cases. Just because your code passed the basic local tests does not necessarily mean that
they will pass all of the hidden tests. These evaluative hidden tests will be run when you submit your code to
the Gradescope autograder via the online student portal, and will provide feedback on how many points you
have earned.
For debugging purposes, you can run a single unit test locally. For example, you can run the test case 3a-0-basic
using the following terminal command within the src/ subdirectory:
$ python grader . py 3a -0 - basic
Before beginning this course, please walk through the Anaconda Setup for XCS Courses to familiarize yourself with
the coding environment. Use the env defined in src/environment.yml to run your code. This is the same environment
used by the online autograder.
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 3
Note. Here are some things to keep in mind as you plan your time for this assignment.
• The total amount of pytorch code to write, and code complexity, of this assignment is lower than Assign-
ment 4. However, you’re also given less guidance or scaffolding in how to write the code.
• This assignment involves a pretraining step that takes approximately 2 hours to perform on Azure, and
you’ll have to do it twice.
This assignment is an investigation into Transformer self-attention building blocks, and the effects of pretraining.
It covers mathematical properties of Transformers and self-attention through written questions. Further, you’ll get
experience with practical system-building through repurposing an existing codebase. The assignment is split into a
coding part and an extra credit written (mathematical) part. Here’s a quick summary:
1. Extending a research codebase: In this portion of the assignment, you’ll get some experience and intuition
for a cutting-edge research topic in NLP: teaching NLP models facts about the world through pretraining,
and accessing that knowledge through finetuning. You’ll train a Transformer model to attempt to answer
simple questions of the form “Where was person [x] born?” – without providing any input text from which to
draw the answer. You’ll find that models are able to learn some facts about where people were born through
pretraining, and access that information during fine-tuning to answer the questions.
Then, you’ll take a harder look at the system you built, and reason about the implications and concerns about
relying on such implicit pretrained knowledge.
2. Mathematical exploration: What kinds of operations can self-attention easily implement? Why should we
use fancier things like multi-headed self-attention? This section will use some mathematical investigations to
illuminate a few of the motivations of self-attention and Transformer networks.
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 4
(a) [0 points (Coding)] Review the minGPT demo code (no need to submit code or written)
Note that you do not have to write any code or submit written answers for this part.
In the src/submission/mingpt-demo/ folder, there is a Jupyter notebook (play char.ipynb) that trains and
samples from a Transformer language model. Take a look at it locally on your computer and you might need
to install Jupyter notebootk pip install jupyter or use vscode 2 to get somewhat familiar with the code
how it defines and trains models. You don’t need to run the train locally, because training will take long time
on CPU only local environment. Some of the code you are writing below will be inspired by what you see in
this notebook.
(b) [0 points (Coding)] Read through NameDataset in src/submission/dataset.py, our dataset for
reading name-birth place pairs.
The task we’ll be working on with our pretrained models is attempting to access the birth place of a notable
person, as written in their Wikipedia page. We’ll think of this as a particularly simple form of question
answering:
Q: Where was [person] born?
A: [place]
From now on, you’ll be working with the src/submission folder. The code in mingpt-demo/ won’t be
changed or evaluated for this assignment. In dataset.py, you’ll find the the class NameDataset, which
reads a TSV (tab-separated values) file of name/place pairs and produces examples of the above form that we
can feed to our Transformer model.
To get a sense of the examples we’ll be working with, if you run the following code, it’ll load your NameDataset
on the training set birth places train.tsv and print out a few examples.
cd src/submission
python dataset.py namedata
Note that you do not have to write any code or submit written answers for this part.
(c) [20 points (Coding)] Define a span corruption function for pretraining.
In the file src/submission/dataset.py, implement the getitem () function for the dataset class
CharCorruptionDataset. Follow the instructions provided in the comments in dataset.py. Span corruption
1 https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
2 https://code.visualstudio.com/docs/datascience/jupyter-notebooks
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 5
is explored in the T5 paper 3 . It randomly selects spans of text in a document and replaces them with unique
tokens (noising). Models take this noised text, and are required to output a pattern of each unique sentinel
followed by the tokens that were replaced by that sentinel in the input. In this question, you’ll implement a
simplification that only masks out a single sequence of characters.
This question will be graded via autograder based on your whether span corruption function implements some
basic properties of our spec. We’ll instantiate the CharCorruptionDataset with our own data, and draw
examples from it.
To help you debug, if you run the following code, it’ll sample a few examples from your CharCorruptionDataset
on the pretraining dataset wiki.txt and print them out for you.
cd src/submission
python dataset.py charcorruption
Training will take less than 10 minutes (on Azure). Your grades will be based on the output files from the
run.
Don’t be surprised if the evaluation result is well below 10%; we will be digging into why in Part 2. As a
reference point, we want to also calculate the accuracy the model would have achieved if it had just predicted
“London” as the birth place for everyone in the dev set.
(f) [20 points (Coding)] Pretrain, finetune, and make predictions. Budget 2 hours for training.
Now fill in the pretrain portion of src/submission/helper.py, which will pretrain a model on the span
corruption task. Additionally, modify your finetune portion to handle finetuning in the case with pretraining.
3 https://arxiv.org/pdf/1910.10683.pdf
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 6
In particular, if a path to a pretrained model is provided in the bash command, load this model before finetuning
it on the birth-place prediction task. Pretrain your model on wiki.txt (which should take approximately two
hours), finetune it on NameDataset and evaluate it. Specifically, you should be able to run the following four
commands:
# Pretrain the model
./run.sh vanilla_pretrain
We expect the dev accuracy will be at least 10%, and will expect a similar accuracy on the held out test set.
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 7
(g) [14 points (Coding)] Research! Write and try out a more efficient variant of Attention (Budget
2 hours for pretraining!)
We’ll now go to changing the Transformer architecture itself – specifically the first and last transformer
blocks. The transformer model uses a self-attention scoring function based on dot products, this involves a
rather intensive computation that’s quadratic in the sequence length. This is because the dot product between
ℓ2 pairs of word vectors is computed in each computation, where ℓ is the sequence length. If we can reduce
the length of the sequence passed on the self-attention module, we should observe significant reduction in
compute. For example, if we develop a technique that can reduce the sequence length to half, we can save
around 75% of the compute time!
PerceiverAR 4 proposes a solution to make the model more efficient by reducing the sequence length of the
input to self-attention for the intermediate layers. In the first layer, the input sequence is projected onto
a lower-dimensional basis. Subsequently, all self-attention layers operate in this smaller subspace. The last
layer projects the output back to the original input sequence length. In this assignment, we propose a simpler
version of the PerceiverAR transformer model.
Probabilities
Softmax
Linear
Feed-Forward
Repeat for number of
encoder blocks
Block
Add Position
Embeddings
Embeddings
Decoder Inputs
Transformer Decoder
Figure 1: Illustration of the transformer block.
The provided CausalSelfAttention layer implements the following attention for each head of the multi-
headed attention: Let X ∈ Rℓ×d (where ℓ is the block size and d is the total dimensionality, d/h is the
dimensionality per head.).5
Let Qi , Ki , Vi ∈ Rd×d/h . Then the output of the self-attention head is
(XQi )(XKi )⊤
Yi = softmax p (XVi ) (1)
d/h
where Yi ∈ Rℓ×d/h . Then the output of the self-attention is a linear transformation of the concatenation of
4 https://arxiv.org/abs/2202.07765
5 Note that these dimensionalities do not include the minibatch dimension.
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 8
the heads:
Y = [Y1 ; . . . ; Yh ]A (2)
where A ∈ Rd×d and [Y1 ; . . . ; Yh ] ∈ Rℓ×d . The code also includes dropout layers which we haven’t written
here. We suggest looking at the provided code and noting how this equation is implemented in PyTorch.
Our model uses this self-attention layer in the transformer block as shown in Figure 1. As discussed in the
lecture, the transformer block contains residual connections and layer normalization layers. If we compare
this diagram with the Block code provided in model.py, we notice that the implementation does not perform
layer normalization on the output of the MLP (Feed-Forward), but on the input of the Block. This can be
considered equivalent since we have a series of transformer blocks on top of each other.
In the Perceiver model architecture, we replace the first transformer Block in the model with the DownProjectBlock.
This block reduces the length of the sequence from ℓ to m. This is followed by a series of regular transformer
blocks, which would now perform self-attention on the reduced sequence length of m. We replace the last
block of the model with the UpProjectBlock, which takes in the m length output of the previous block, and
projects it back to the original sequence length of ℓ.
You need to implement the DownProjectBlock in model.py that reduces the dimensionality of the sequence
in the first block. To do this, perform cross-attention on the input sequence with a learnable basis C ∈ Rm×d
as the query, where m < ℓ. Consequently, Equation 1 becomes:
(CQi )(XKi )⊤
(1)
Yi = softmax p (XVi ) (3)
d/h
(1)
resulting in Yi ∈ Rm×d/h , with (1) denoting that the output corresponds to the first layer. With this
dimensionality reduction, the subsequent CausalSelfAttention layers operate on inputs ∈ Rm×d instead of
Rl×d . We refer to m as the bottleneck dim in code. Note that for implementing Equation 3, we need to
perform cross attention between the learnable basis C and the input sequence. This has been provided to
you as the CausalCrossAttention layer. We recommend reading through attention.py to understand how
to use the cross-attention layer, and map which arguments correspond to the key, value and query inputs.
Initialize the basis vector matrix C using Xavier Uniform initialization.
To get back to the original dimensions, the last block in the model is replaced with the UpProjectBlock.
This block will bring back the output sequence length to be the same as input sequence length by performing
cross-attention on the previous layer’s output Y L−1 with the original input vector X as the query:
where L is the total number of layers. This results in the final output vector having the same dimension as ex-
pected in the original CausalSelfAttention mechanism. Implement this functionality in the UpProjectBlock
in src/submission/model.py.
We provide the code to assemble the model using your implemented DownProjectBlock and UpProjectBlock.
The model uses these blocks when the variant parameter is specified as perceiver.
In the rest of the code in the src/submission/helper.py, modify your model to support using either
CausalSelfAttention or CausalCrossAttention. Add the ability to switch between these attention variants
depending on whether “vanilla” (for causal self-attention) or “perceiver” (for the perceiver variant) is selected
in the command line arguments (see the section marked [part g] in src/submission/helper.py).
Below are bash commands that your code should support in order to pretrain the model, finetune it, and make
predictions on the dev and test sets. Note that the pretraining process will take approximately 2 hours.
Your model should get at least 6% accuracy on the dev set.
# Pretrain the model
./run.sh perceiver_pretrain
./run.sh perceiver_finetune_with_pretrain
Deliverables
For this assignment, please submit the following files within the src/submission directory. Update files without
directory structure.
This includes:
4. src/submission/helper.py
5. src/submission/model.py
6. src/submission/trainer.py
7. src/submission/utils.py
8. src/submission/vanilla.model.params
9. src/submission/vanilla.nopretrain.dev.predictions
10. src/submission/vanilla.nopretrain.test.predictions
11. src/submission/vanilla.pretrain.params
12. src/submission/vanilla.finetune.params
13. src/submission/vanilla.pretrain.dev.predictions
14. src/submission/vanilla.pretrain.test.predictions
15. src/submission/perceiver.pretrain.params
16. src/submission/perceiver.finetune.params
17. src/submission/perceiver.pretrain.dev.predictions
18. src/submission/perceiver.pretrain.test.predictions
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 10
(a) Succinctly explain why the pretrained (vanilla) model was able to achieve a higher accuracy than the accuracy
of the non-pretrained.
Pretraining, with some probability, masks out the name of a person while providing the birth place, or masks out
the birth place while providing the name – this teaches the model to associate the names with the birthplaces. At
finetuning time, this information can be accessed, since it has been encoded in the parameters (the initialization.)
Without pretraining, there’s no way for the model to for the model to have any knowledge of the birth places of
people that weren’t in the finetuning training set, so it can’t get above a simple heuristic baseline (like the London
baseline.)
(b) Take a look at some of the correct predictions of the pretrain+finetuned vanilla model, as well as some of the
errors. We think you’ll find that it’s impossible to tell, just looking at the output, whether the model retrieved
the correct birth place, or made up an incorrect birth place. Consider the implications of this for user-facing
systems that involve pretrained NLP components. Come up with two reasons why this indeterminacy of model
behavior may cause concern for such applications.
There is a large space of possible reasons indeterminacy could cause concern for user-facing applications. We
deducted points if the two provided reasons were too similar, or if one followed directly from the other. For
example, “the user won’t know when the system is wrong” and “the user will make incorrect decisions based on
false information” is really cause-and-effect rather than two distinct reasons for concern. Answers about general
issues such as low accuracy were also not accepted. Here are some possible answers:
(a) Users will always get outputs that look valid (if the user doesn’t know the real answer) and so won’t be able to
perform quality estimation themselves (like one sometimes can when, e.g., a translation seems nonsensical).
System designers also don’t have a way of filtering outputs for low-confidence predictions. Users may believe
invalid answers and make incorrect decisions (or inadvertently spread disinformation) as a result.
(b) Once users realize the system can output plausible but incorrect answers, they may stop trusting the system,
therefore making it useless.
(c) Models will not indicate that they simply do not know the birth place of a person (unlike a relational database
or similar, which will return that the knowledge is not contained in it). This means the system cannot indicate
a question is unanswerable.
(d) Made up answers could be biased or offensive.
(e) There is little avenue for recourse if users believe an answer is wrong, as it’s impossible to determine the
reasoning of the model is retrieving some gold standard knowledge (in which case the user’s request to change
the knowledge should be rejected), or just making up something (in which case the user’s request to change
the knowledge should be granted).
(c) If your model didn’t see a person’s name at pretraining time, and that person was not seen at fine-tuning time
either, it is not possible for it to have “learned” where they lived. Yet, your model will produce something as
a predicted birth place for that person’s name if asked. Concisely describe a strategy your model might take
for predicting a birth place for that person’s name, and one reason why this should cause concern for the use
of such applications.
1. The model could use character-level phonetic-like (sound-like) information to make judgments about where a person
was born based on how their name “sounds”, likely leading to racist outputs.
2. The model could learn that certain names or types of names tend to be of people born in richer cities, leading to
classist outputs that predict a birth place simply based on whether the names are like that of rich people or poorer
people.
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 11
3 Attention exploration
Multi-headed self-attention is the core modeling component of Transformers. In this question, we’ll get some
practice working with the self-attention equations, and motivate why multi-headed self-attention can be preferable
to single-headed self-attention.
(a) [2 points (Written, Extra Credit)] Copying in attention: Recall that attention can be viewed as
an operation on a query q ∈ Rd , a set of value vectors {v1 , . . . , vn }, vi ∈ Rd , and a set of key vectors
{k1 , . . . , kn }, ki ∈ Rd , specified as follows:
n
X
c= vi αi (5)
i=1
exp(ki⊤ q)
αi = Pn ⊤
. (6)
j=1 exp(kj q)
where αi are frequently called the “attention weights”, and the output c ∈ Rd is a correspondingly weighted
average over the value vectors.
We’ll first show that it’s particularly simple for attention to “copy” a value vector to the output c. Describe
(in one sentence) what properties of the inputs to the attention operation would result in the output c being
approximately equal to vj for some j ∈ {1, . . . , n}. Specifically, what must be true about the query q, the
values {v1 , . . . , vn } and/or the keys {k1 , . . . , kn }?
(b) [2 points (Written, Extra Credit)] An average of two: Consider a set of key vectors {k1 , . . . , kn } where
all key vectors are perpendicular, that is ki ⊥ kj for all i ̸= j. Let ∥ki ∥ = 1 for all i. Let {v1 , . . . , vn } be a
set of arbitrary value vectors. Let va , vb ∈ {v1 , . . . , vn } be two of the value vectors. Give an expression for a
query vector q such that the output c is approximately equal to the average of va and vb , that is, 21 (va + vb ).6
Note that you can reference the corresponding key vector of va and vb as ka and kb .
(c) [3 points (Written, Extra Credit)] Drawbacks of single-headed attention: In the previous part, we
saw how it was possible for a single-headed attention to focus equally on two values. The same concept could
easily be extended to any subset of values. In this question we’ll see why it’s not a practical solution. Consider
a set of key vectors {k1 , . . . , kn } that are now randomly sampled, ki ∼ N (µi , Σi ), where the means µi are
known to you, but the covariances Σi are unknown. Further, assume that the means µi are all perpendicular;
µ⊤
i µj = 0 if i ̸= j, and unit norm, ∥µi ∥ = 1.
i. (1 point) Assume that the covariance matrices are Σi = αI, for vanishingly small α. Design a query q in
terms of the µi such that as before, c ≈ 21 (va + vb ), and provide a brief argument as to why it works.
ii. (2 point) Though single-headed attention is resistant to small perturbations in the keys, some types of
larger perturbations may pose a bigger issue. Specifically, in some cases, one key vector ka may be larger
or smaller in norm than the others, while still pointing in the same direction as µa . As an example, let
us consider a covariance for item a as Σa = αI + 12 (µa µ⊤
a ) for vanishingly small α (as shown in figure 2).
Further, let Σi = αI for all i ̸= a.
When you sample {k1 , . . . , kn } multiple times, and use the q vector that you defined in part i., what
qualitatively do you expect the vector c will look like for different samples?
(d) [3 points (Written, Extra Credit)] Benefits of multi-headed attention: Now we’ll see some of the
power of multi-headed attention. We’ll consider a simple version of multi-headed attention which is identical
to single-headed self-attention as we’ve presented it in this homework, except two query vectors (q1 and q2 )
are defined, which leads to a pair of vectors (c1 and c2 ), each the output of single-headed attention given its
respective query vector. The final output of the multi-headed attention is their average, 12 (c1 + c2 ). As in
question 3((c)), consider a set of key vectors {k1 , . . . , kn } that are randomly sampled, ki ∼ N (µi , Σi ), where
the means µi are known to you, but the covariances Σi are unknown. Also as before, assume that the means
µi are mutually orthogonal; µ⊤ i µj = 0 if i ̸= j, and unit norm, ∥µi ∥ = 1.
6 Hint: while the softmax function will never exactly average the two vectors, you can get close by using a large scalar multiple in the
expression.
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 12
Figure 2: The vector µa (shown here in 2D as an example), with the range of possible
values of ka shown in red. As mentioned previously, ka points in roughly the same
direction as µa , but may have larger or smaller magnitude.
i. (1 point) Assume that the covariance matrices are Σi = αI, for vanishingly small α. Design q1 and q2
such that c is approximately equal to 12 (va + vb ).
ii. (2 points) Assume that the covariance matrices are Σa = αI + 21 (µa µ⊤ a ) for vanishingly small α, and
Σi = αI for all i ̸= a. Take the query vectors q1 and q2 that you designed in part i. What, qualitatively,
do you expect the output c to look like across different samples of the key vectors? Please briefly explain
why. You can ignore cases in which qi⊤ ka < 0.
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 13
This handout includes space for every question that requires a written response. Please feel free to use it to handwrite
your solutions (legibly, please). If you choose to typeset your solutions, the README.md for this assignment includes
instructions to regenerate this handout with your typeset LATEX solutions.
3.a
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 14
3.b
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 15
3.c.i
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 16
3.c.ii
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 17
3.d.i
XCS224N Problem Set 5 Self-attention, Transformers, Pretraining 18
3.d.ii