NPFL114, Lecture 12
Transformer, External Memory Networks
Milan Straka May 20, 2019
Exams
Five questions, written preparation, then we go through it together (or you can leave and let me grade it by myself).
Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinction between regular and competition points) transfered from the practicals, and up to 10 points for GitHub pull requests.
To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.
The SIS should give you an exact time of the exam (including a gap between students) so that you do not come all at once.
What Next
In the winter semester:
NPFL117 – Deep Learning Seminar [0/2 Ex]
Reading group of deep learning papers (in all areas). Every participant presents a paper about deep learning, learning how to read a paper, present it in a understandable way, and get deep learning knowledge from other presentations.
NPFL122 – Deep Reinforcement Learning [2/2 C+Ex]
In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learning is the main method. Similar format to the Deep Learning course.
NPFL129 – Machine Learning 101
Neural Architecture Search (NASNet) – 2017
Using REINFORCE with baseline, we can design neural network architectures.
We fix the overall architecture and design only Normal and Reduction cells.
Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
Neural Architecture Search (NASNet) – 2017
Every block is designed by a RNN controller generating individual operations.
Figure 3 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
Neural Architecture Search (NASNet) – 2017
Figure 4 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
Neural Architecture Search (NASNet) – 2017
Neural Architecture Search (NASNet) – 2017
Figure 5 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.
Attention is All You Need
For some sequence processing tasks, sequential processing of its elements might be too
restricting. Instead, we may want to combine sequence elements independently on their distance.
Attention is All You Need
Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
Attention is All You Need
The attention module for a queries , keys and values is defined as:
The queries, keys and values are computed from current word representations using a linear transformation as
Q K V
Attention(
Q,
K,
V) = softmax
( V.
dkQK⊤ )
W
Q K V
=
V Q⋅
W=
V K⋅
W=
V V⋅
WAttention is All You Need
Multihead attention is used in practice. Instead of using one huge attention, we split queries, keys and values to several groups (similar to how ResNeXt works), compute the attention in each of the groups separately, and then concatenate the results.
Scaled Dot-Product Attention Multi-Head Attention
Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
Attention is All You Need Positional Embeddings
We need to encode positional information (which was implicit in RNNs).
Learned embeddings for every position.
Sinusoids of different frequencies:
This choice of functions should allow the model to attend to relative positions, since for any fixed , is a linear function of .
PE
(pos,2i)PE
(pos,2i+1)= sin pos/10000
( 2i/d)= cos pos/10000
( 2i/d)k
PE
pos+kPE
posAttention is All You Need
Positional embeddings for 20 words of dimension 512, lighter colors representing values closer to 1 and darker colors representing values closer to -1.
http://jalammar.github.io/illustrated-transformer/
Attention is All You Need Regularization
The network is regularized by:
dropout of input embeddings,
dropout of each sub-layer, just before before it is added to the residual connection (and then normalized),
label smoothing.
Default dropout rate and also label smoothing weight is 0.1.
Parallel Execution
Training can be performed in parallel because of the masked attention – the softmax weights of the self-attention are zeroed out not to allow attending words later in the sequence.
However, inference is still sequential (and no substantial improvements have been achieved on
Why Attention
Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
Transformers Results
Transformers Results
N dmodel dff h dk dv Pdrop ǫls train PPL BLEU params
steps (dev) (dev) ×106
base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65
(A)
1 512 512 5.29 24.9
4 128 128 5.00 25.5
16 32 32 4.91 25.8
32 16 16 5.01 25.4
(B) 16 5.16 25.1 58
32 5.01 25.4 60
(C)
2 6.11 23.7 36
4 5.19 25.3 50
8 4.88 25.5 80
256 32 32 5.75 24.5 28
1024 128 128 4.66 26.0 168
1024 5.12 25.4 53
4096 4.75 26.2 90
(D)
0.0 5.77 24.6
0.2 4.95 25.5
0.0 4.67 25.3
0.2 5.47 25.7
(E) positional embedding instead of sinusoids 4.92 25.7
big 6 1024 4096 16 0.3 300K 4.33 26.4 213
Table 4 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.
Neural Turing Machines
So far, all input information was stored either directly in network weights, or in a state of a recurrent network.
However, mammal brains seem to operate with a working memory – a capacity for short-term storage of information and its rule-based manipulation.
We can therefore try to introduce an external memory to a neural network. The memory will be a matrix, where rows correspond to memory cells.
M
Neural Turing Machines
The network will control the memory using a controller which reads from the memory and writes to is. Although the original paper also considered a feed-forward (non-recurrent) controller, usually the controller is a recurrent LSTM network.
Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
Neural Turing Machine Reading
To read the memory in a differentiable way, the controller at time emits a read distribution over memory locations, and the returned read vector is then
Writing
Writing is performed in two steps – an erase followed by an add. The controller at time emits a write distribution over memory locations, and also an erase vector and an add vector
. The memory is then updates as
t wt
rt
rt
=
w(
i) ⋅
i
∑ t Mt
(
i).
t
wt et
at
Neural Turing Machine
The addressing mechanism is designed to allow both content addressing, and
location addressing.
Figure 2 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
Neural Turing Machine Content Addressing
Content addressing starts by the controller emitting the key vector , which is compared to all memory locations , generating a distribution using a with temperature .
The measure is usually the cosine similarity
kt
Mt
(i) softmax
βtwtc
(i) =
exp(β ⋅ distance(k ,
M(j ))
∑j t t t
exp(
βt⋅ distance(
kt,
Mt(
i))
distance
distance(a,
b) =.
∣∣
a∣∣ ⋅ ∣∣
b∣∣
a
⋅
bNeural Turing Machine
Location-Based Addressing
To allow iterative access to memory, the controller might decide to reuse the memory location from previous timestep. Specifically, the controller emits interpolation gate and defines
Then, the current weighting may be shifted, i.e., the controller might decide to “rotate” the weights by a small integer. For a given range (the simplest case are only shifts ), the network emits distribution over the shifts, and the weights are then defined using a circular convolution
Finally, not to lose precision over time, the controller emits a sharpening factor and the final memory location weights are
gt wtg
=
gtwt+
c
(1 −
gt)w
t−1.
{−1, 0, 1}
softmax
(
i) =
w~
t w
(
j)
s(
i−
j
∑ t
g t j
).
γt wt
(
i) =
w~ (
i) / (
j) .
t γt ∑j w
~
t γt
Neural Turing Machine Overall Execution
Even if not specified in the original paper, following the DNC paper, the LSTM controller can be implemented as a (potentially deep) LSTM. Assuming read heads and one write head, the input is and read vectors from previous time step, and output of the
controller are vectors , and the final output is . The is a concatenation of
R xt R rt−11
, … ,
rt−1R(
νt,
ξt)
yt=
νt+
Wr[rt1, … ,
r ]tR ξt
kt1
,
βt1,
gt1,
st1,
γt1,
kt2,
βt2,
gt2,
st2,
γt2, … ,
ktw,
βtw,
gtw,
stw,
γtw,
etw,
atw.
Neural Turing Machines Copy Task
Repeat the same sequence as given on input. Trained with sequences of length up to 20.
Figure 3 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
Neural Turing Machines
Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30, and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network was only trained on sequences of up to length 20. The first four sequences are reproduced with
Neural Turing Machines
Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences of up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences.
Also note that the length of the accurate prefix decreases as the sequence length increases, suggesting that the network has trouble retaining information for long periods.
Figure 5 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
Neural Turing Machines
Neural Turing Machines Associative Recall
In associative recall, a sequence is given on input, consisting of subsequences of length 3. Then a randomly chosen subsequence is presented on input and the goal is to produce the following subsequence.
Neural Turing Machines
Neural Turing Machines
Figure 12 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.
Differentiable Neural Computer
NTM was later extended to a Differentiable Neural Computer.
B C F
BC F
N
a Controller b Read and write heads c Memory d Memory usage
and temporal links Output
Input
Write vector Erase vector Write key
Read key Read mode
Read key Read mode
Read vectors
Write
Read 1
Read 2
Differentiable Neural Computer
The DNC contains multiple read heads and one write head.
The controller is a deep LSTM network, with input at time being the current input and read vectors from previous time step. The output of the controller are vectors
, and the final output is . The is a concatenation of parameters for read and write heads (keys, gates, sharpening parameters, …).
In DNC, usage of every memory location is tracked, which allows us to define allocation
weighting. Furthermore, for every memory location we track which memory location was written to previously ( ) and subsequently ( ).
The, the write weighting is defined as a weighted combination of the allocation weighting and write content weighting, and read weighting is computed as a weighted combination of read content weighting, previous write weighting, and subsequent write weighting.
t xt R
rt−11
, … ,
rt−1R(
νt,
ξt)
yt=
νt+
Wr[rt1, … ,
r ]tR ξt
bt ft
Differentiable Neural Computer
a Random graph b London Underground c Family tree
Shortest-path Traversal
Ian Jodie
Tom Charlotte
Mat Jo
Fergus
Alice Steve
Simon Freya Natalie
Jane
Bob Mary
Alan Lindsey
Liam Nina
Becky Alison
Maternal great uncle
Shortest-path question:
(Moorgate, PiccadillyCircus, _)
Traversal question:
(BondSt, _, Central), (_, _, Circle), (_, _, Circle), (_, _, Circle), (_, _, Circle), (_, _, Jubilee), (_, _, Jubilee),
Underground input:
(OxfordCircus, TottenhamCtRd, Central) (TottenhamCtRd, OxfordCircus, Central) (BakerSt, Marylebone, Circle)
(BakerSt, Marylebone, Bakerloo) (BakerSt, OxfordCircus, Bakerloo) (LeicesterSq, CharingCross, Northern) (TottenhamCtRd, LeicesterSq, Northern) (OxfordCircus, PiccadillyCircus, Bakerloo)
Inference question:
(Freya, _, MaternalGreatUncle)
Family tree input:
(Charlotte, Alan, Father) (Simon, Steve, Father) (Steve , Simon, Son1) (Nina, Alison, Mother) (Lindsey, Fergus, Son1) (Bob, Jane, Mother) (Natalie, Alice, Mother) (Mary, Ian, Father)
Answer:
(BondSt, NottingHillGate, Central) (NottingHillGate, GloucesterRd, Circle)
Answer:
(Moorgate, Bank, Northern) (Bank, Holborn, Central)
Answer:
(Freya, Fergus, MaternalGreatUncle)
… … …
Differentiable Neural Computer
c London Underground map
d Read key a Read and write weightings
From
e Location content
To Line
Oxford Circus>Tottenham Court Rd Tottenham Court Rd>Oxford Circus Green Park>Oxford Circus Victoria>Green Park Oxford Circus>Green Park Green Park>Victoria Green Park>Piccadilly Circus Piccadilly Circus>Leicester Sq Piccadilly Circus>Green Park Leicester Sq>Piccadilly Circus Piccadilly Circus>Oxford Circus Charing Cross>Piccadilly Circus Piccadilly Circus>Charing Cross Oxford Circus>Piccadilly Circus Leicester Sq>Tottenham Court Rd Charing Cross>Leicester Sq Leicester Sq>Charing Cross Tottenham Court Rd>Leicester Sq Victoria>___ Victoria N ___>___ Victoria N ___>___ Central E ___>___ North S ___>___ Piccadilly W ___>___ Bakerloo N ___>___ Central E
Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria Bakerloo N Bakerloo S Central E Central W North N North S Piccadilly E Piccadilly W Victoria N Victoria S Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria
From To Line
Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria Bakerloo N Bakerloo S Central E Central W North N North S Piccadilly E Piccadilly W Victoria N Victoria S Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria
Decode
Decoded memory locations
b Read mode
Decode
Write head Read head 1 Read head 2
Backward Content Forward
Graph definition Query Answer
Backward Content Forward
0 5 10 Time15 20 25 30
Figure 3 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.
Memory-augmented Neural Networks
Figure 1 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.
Memory-augmented NNs
Page 3 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.
Page 4 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.