• Nebyly nalezeny žádné výsledky

Transformer, External MemoryNetworks

N/A
N/A
Protected

Academic year: 2022

Podíl "Transformer, External MemoryNetworks"

Copied!
39
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

NPFL114, Lecture 12

Transformer, External Memory Networks

Milan Straka May 20, 2019

(2)

Exams

Five questions, written preparation, then we go through it together (or you can leave and let me grade it by myself).

Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinction between regular and competition points) transfered from the practicals, and up to 10 points for GitHub pull requests.

To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.

The SIS should give you an exact time of the exam (including a gap between students) so that you do not come all at once.

(3)

What Next

In the winter semester:

NPFL117 – Deep Learning Seminar [0/2 Ex]

Reading group of deep learning papers (in all areas). Every participant presents a paper about deep learning, learning how to read a paper, present it in a understandable way, and get deep learning knowledge from other presentations.

NPFL122 – Deep Reinforcement Learning [2/2 C+Ex]

In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learning is the main method. Similar format to the Deep Learning course.

NPFL129 – Machine Learning 101

(4)

Neural Architecture Search (NASNet) – 2017

Using REINFORCE with baseline, we can design neural network architectures.

We fix the overall architecture and design only Normal and Reduction cells.

Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

(5)

Neural Architecture Search (NASNet) – 2017

Every block is designed by a RNN controller generating individual operations.

Figure 3 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

(6)

Neural Architecture Search (NASNet) – 2017

Figure 4 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

(7)

Neural Architecture Search (NASNet) – 2017

(8)

Neural Architecture Search (NASNet) – 2017

Figure 5 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

(9)

Attention is All You Need

For some sequence processing tasks, sequential processing of its elements might be too

restricting. Instead, we may want to combine sequence elements independently on their distance.

(10)

Attention is All You Need

Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

(11)

Attention is All You Need

The attention module for a queries , keys and values is defined as:

The queries, keys and values are computed from current word representations using a linear transformation as

Q K V

Attention(

Q

,

K

,

V

) = softmax

( V

.

dk

QK )

W

Q K V

=

V Q

W

=

V K

W

=

V V

W

(12)

Attention is All You Need

Multihead attention is used in practice. Instead of using one huge attention, we split queries, keys and values to several groups (similar to how ResNeXt works), compute the attention in each of the groups separately, and then concatenate the results.

Scaled Dot-Product Attention Multi-Head Attention

Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

(13)

Attention is All You Need Positional Embeddings

We need to encode positional information (which was implicit in RNNs).

Learned embeddings for every position.

Sinusoids of different frequencies:

This choice of functions should allow the model to attend to relative positions, since for any fixed , is a linear function of .

PE

(pos,2i)

PE

(pos,2i+1)

= sin pos/10000

( 2i/d)

= cos pos/10000

( 2i/d)

k

PE

pos+k

PE

pos

(14)

Attention is All You Need

Positional embeddings for 20 words of dimension 512, lighter colors representing values closer to 1 and darker colors representing values closer to -1.

http://jalammar.github.io/illustrated-transformer/

(15)

Attention is All You Need Regularization

The network is regularized by:

dropout of input embeddings,

dropout of each sub-layer, just before before it is added to the residual connection (and then normalized),

label smoothing.

Default dropout rate and also label smoothing weight is 0.1.

Parallel Execution

Training can be performed in parallel because of the masked attention – the softmax weights of the self-attention are zeroed out not to allow attending words later in the sequence.

However, inference is still sequential (and no substantial improvements have been achieved on

(16)

Why Attention

Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

(17)

Transformers Results

(18)

Transformers Results

N dmodel dff h dk dv Pdrop ǫls train PPL BLEU params

steps (dev) (dev) ×106

base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65

(A)

1 512 512 5.29 24.9

4 128 128 5.00 25.5

16 32 32 4.91 25.8

32 16 16 5.01 25.4

(B) 16 5.16 25.1 58

32 5.01 25.4 60

(C)

2 6.11 23.7 36

4 5.19 25.3 50

8 4.88 25.5 80

256 32 32 5.75 24.5 28

1024 128 128 4.66 26.0 168

1024 5.12 25.4 53

4096 4.75 26.2 90

(D)

0.0 5.77 24.6

0.2 4.95 25.5

0.0 4.67 25.3

0.2 5.47 25.7

(E) positional embedding instead of sinusoids 4.92 25.7

big 6 1024 4096 16 0.3 300K 4.33 26.4 213

Table 4 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

(19)

Neural Turing Machines

So far, all input information was stored either directly in network weights, or in a state of a recurrent network.

However, mammal brains seem to operate with a working memory – a capacity for short-term storage of information and its rule-based manipulation.

We can therefore try to introduce an external memory to a neural network. The memory will be a matrix, where rows correspond to memory cells.

M

(20)

Neural Turing Machines

The network will control the memory using a controller which reads from the memory and writes to is. Although the original paper also considered a feed-forward (non-recurrent) controller, usually the controller is a recurrent LSTM network.

Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

(21)

Neural Turing Machine Reading

To read the memory in a differentiable way, the controller at time emits a read distribution over memory locations, and the returned read vector is then

Writing

Writing is performed in two steps – an erase followed by an add. The controller at time emits a write distribution over memory locations, and also an erase vector and an add vector

. The memory is then updates as

t wt

rt

rt

=

w

(

i

) ⋅

i

t Mt

(

i

).

t

wt et

at

(22)

Neural Turing Machine

The addressing mechanism is designed to allow both content addressing, and

location addressing.

Figure 2 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

(23)

Neural Turing Machine Content Addressing

Content addressing starts by the controller emitting the key vector , which is compared to all memory locations , generating a distribution using a with temperature .

The measure is usually the cosine similarity

kt

Mt

(i) softmax

βt

wtc

(i) =

exp(β ⋅ distance(k ,

M

(j ))

j t t t

exp(

βt

⋅ distance(

kt

,

Mt

(

i

))

distance

distance(a,

b) =

.

∣∣

a

∣∣ ⋅ ∣∣

b

∣∣

a

b

(24)

Neural Turing Machine

Location-Based Addressing

To allow iterative access to memory, the controller might decide to reuse the memory location from previous timestep. Specifically, the controller emits interpolation gate and defines

Then, the current weighting may be shifted, i.e., the controller might decide to “rotate” the weights by a small integer. For a given range (the simplest case are only shifts ), the network emits distribution over the shifts, and the weights are then defined using a circular convolution

Finally, not to lose precision over time, the controller emits a sharpening factor and the final memory location weights are

gt wtg

=

gtwt

+

c

(1 −

gt

)w

t−1

.

{−1, 0, 1}

softmax

(

i

) =

w

~

t w

(

j

)

s

(

i

j

t

g t j

).

γt wt

(

i

) =

w

~ (

i

) / (

j

) .

t γtj w

~

t γt

(25)

Neural Turing Machine Overall Execution

Even if not specified in the original paper, following the DNC paper, the LSTM controller can be implemented as a (potentially deep) LSTM. Assuming read heads and one write head, the input is and read vectors from previous time step, and output of the

controller are vectors , and the final output is . The is a concatenation of

R xt R rt−11

, … ,

rt−1R

(

νt

,

ξt

)

yt

=

νt

+

Wr[rt1

, … ,

r ]

tR ξt

kt1

,

βt1

,

gt1

,

st1

,

γt1

,

kt2

,

βt2

,

gt2

,

st2

,

γt2

, … ,

ktw

,

βtw

,

gtw

,

stw

,

γtw

,

etw

,

atw

.

(26)

Neural Turing Machines Copy Task

Repeat the same sequence as given on input. Trained with sequences of length up to 20.

Figure 3 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

(27)

Neural Turing Machines

Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30, and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network was only trained on sequences of up to length 20. The first four sequences are reproduced with

(28)

Neural Turing Machines

Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences of up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences.

Also note that the length of the accurate prefix decreases as the sequence length increases, suggesting that the network has trouble retaining information for long periods.

Figure 5 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

(29)

Neural Turing Machines

(30)

Neural Turing Machines Associative Recall

In associative recall, a sequence is given on input, consisting of subsequences of length 3. Then a randomly chosen subsequence is presented on input and the goal is to produce the following subsequence.

(31)

Neural Turing Machines

(32)

Neural Turing Machines

Figure 12 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

(33)

Differentiable Neural Computer

NTM was later extended to a Differentiable Neural Computer.

B C F

BC F

N

a Controller b Read and write heads c Memory d Memory usage

and temporal links Output

Input

Write vector Erase vector Write key

Read key Read mode

Read key Read mode

Read vectors

Write

Read 1

Read 2

(34)

Differentiable Neural Computer

The DNC contains multiple read heads and one write head.

The controller is a deep LSTM network, with input at time being the current input and read vectors from previous time step. The output of the controller are vectors

, and the final output is . The is a concatenation of parameters for read and write heads (keys, gates, sharpening parameters, …).

In DNC, usage of every memory location is tracked, which allows us to define allocation

weighting. Furthermore, for every memory location we track which memory location was written to previously ( ) and subsequently ( ).

The, the write weighting is defined as a weighted combination of the allocation weighting and write content weighting, and read weighting is computed as a weighted combination of read content weighting, previous write weighting, and subsequent write weighting.

t xt R

rt−11

, … ,

rt−1R

(

νt

,

ξt

)

yt

=

νt

+

Wr[rt1

, … ,

r ]

tR ξt

bt ft

(35)

Differentiable Neural Computer

a Random graph b London Underground c Family tree

Shortest-path Traversal

Ian Jodie

Tom Charlotte

Mat Jo

Fergus

Alice Steve

Simon Freya Natalie

Jane

Bob Mary

Alan Lindsey

Liam Nina

Becky Alison

Maternal great uncle

Shortest-path question:

(Moorgate, PiccadillyCircus, _)

Traversal question:

(BondSt, _, Central), (_, _, Circle), (_, _, Circle), (_, _, Circle), (_, _, Circle), (_, _, Jubilee), (_, _, Jubilee),

Underground input:

(OxfordCircus, TottenhamCtRd, Central) (TottenhamCtRd, OxfordCircus, Central) (BakerSt, Marylebone, Circle)

(BakerSt, Marylebone, Bakerloo) (BakerSt, OxfordCircus, Bakerloo) (LeicesterSq, CharingCross, Northern) (TottenhamCtRd, LeicesterSq, Northern) (OxfordCircus, PiccadillyCircus, Bakerloo)

Inference question:

(Freya, _, MaternalGreatUncle)

Family tree input:

(Charlotte, Alan, Father) (Simon, Steve, Father) (Steve , Simon, Son1) (Nina, Alison, Mother) (Lindsey, Fergus, Son1) (Bob, Jane, Mother) (Natalie, Alice, Mother) (Mary, Ian, Father)

Answer:

(BondSt, NottingHillGate, Central) (NottingHillGate, GloucesterRd, Circle)

Answer:

(Moorgate, Bank, Northern) (Bank, Holborn, Central)

Answer:

(Freya, Fergus, MaternalGreatUncle)

(36)

Differentiable Neural Computer

c London Underground map

d Read key a Read and write weightings

From

e Location content

To Line

Oxford Circus>Tottenham Court Rd Tottenham Court Rd>Oxford Circus Green Park>Oxford Circus Victoria>Green Park Oxford Circus>Green Park Green Park>Victoria Green Park>Piccadilly Circus Piccadilly Circus>Leicester Sq Piccadilly Circus>Green Park Leicester Sq>Piccadilly Circus Piccadilly Circus>Oxford Circus Charing Cross>Piccadilly Circus Piccadilly Circus>Charing Cross Oxford Circus>Piccadilly Circus Leicester Sq>Tottenham Court Rd Charing Cross>Leicester Sq Leicester Sq>Charing Cross Tottenham Court Rd>Leicester Sq Victoria>___ Victoria N ___>___ Victoria N ___>___ Central E ___>___ North S ___>___ Piccadilly W ___>___ Bakerloo N ___>___ Central E

Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria Bakerloo N Bakerloo S Central E Central W North N North S Piccadilly E Piccadilly W Victoria N Victoria S Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria

From To Line

Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria Bakerloo N Bakerloo S Central E Central W North N North S Piccadilly E Piccadilly W Victoria N Victoria S Charing Cross Green Park Leicester Sq Oxford Circus Piccadilly Circus Tottenham Court Rd Victoria

Decode

Decoded memory locations

b Read mode

Decode

Write head Read head 1 Read head 2

Backward Content Forward

Graph definition Query Answer

Backward Content Forward

0 5 10 Time15 20 25 30

Figure 3 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.

(37)

Memory-augmented Neural Networks

Figure 1 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

(38)

Memory-augmented NNs

Page 3 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

Page 4 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

(39)

Memory-augmented NNs

Odkazy

Související dokumenty

• NONLINEAR >>> disturbance size ??. Stability

• Nástroje grafu -> Aktuální výběr -> zvolit druhou řadu (tj. nedoplatky) -> formátovat výběr -> vedlejší osa -> změnit šířku. sloupce.. úkol)

Přihlášení na autentifikační server home.zf.jcu.cz pod stávajícím účtem pomocí programu Tera Term - Start>Programy>Aplikace>Internet>TTSSH CZ, zvolit home.zf.jcu.cz

[r]

[r]

> students knowing (Czech)  morphology  and  syntax..

[r]

[r]