Convolutional over Recurrent Encoder for Neural Machine Translation
Annual Conference of the European Association for Machine Translation
2017
Convolutional over Recurrent Encoder for Neural Machine Translation
2
Neural Machine Translation
•
End to end neural network with RNN architecture where the output of an RNN (decoder) is conditioned on another RNN (encoder).
• c is a fixed length vector representation of source sentence encoded by RNN.
• Attention Mechanism :
•
(Bahdanau et al 2015) : compute conext vector as weighted average of annotations of source hidden states.
Published as a conference paper at ICLR 2015
The decoder is often trained to predict the next word yt0 given the context vector c and all the previously predicted words {y1, · · · , yt0 1}. In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:
p(y) =
YT
t=1
p(yt | {y1, · · · , yt 1}, c), (2)
where y = y1,· · · , yTy . With an RNN, each conditional probability is modeled as
p(yt | {y1,· · · , yt 1} , c) = g(yt 1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 L
EARNING TOA
LIGN ANDT
RANSLATEIn this section, we propose a novel architecture for neural machine translation. The new architecture consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching through a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT).
In a new model architecture, we define each conditional probability in Eq. (2) as:
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) where si is an RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations (h1, · · · , hTx) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section.
The context vector ci is, then, computed as a weighted sum of these annotations hi:
ci =
Tx
X
j=1
↵ijhj. (5) The weight ↵ij of each annotation hj is computed by
↵ij = exp (eij) PTx
k=1 exp (eik), (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation,
Published as a conference paper at ICLR 2015
The decoder is often trained to predict the next word
yt0given the context vector
cand all the previously predicted words
{y1, · · · , yt0 1}. In other words, the decoder defines a probability over the translation
yby decomposing the joint probability into the ordered conditionals:
p(y) =
YT t=1
p(yt | {y1, · · · , yt 1}, c),
(2)
where
y = y1, · · · , yTy. With an RNN, each conditional probability is modeled as
p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c),
(3)
where
gis a nonlinear, potentially multi-layered, function that outputs the probability of
yt, and
stis the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 L
EARNING TOA
LIGN ANDT
RANSLATEIn this section, we propose a novel architecture for neural machine translation. The new architecture consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching through a source sentence during decoding a translation (Sec. 3.1).
3.1 D
ECODER: G
ENERALD
ESCRIPTIONFigure 1: The graphical illus- tration of the proposed model trying to generate the
t-th tar-get word
ytgiven a source sentence
(x1, x2, . . . , xT).In a new model architecture, we define each conditional probability in Eq. (2) as:
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci),
(4) where
siis an RNN hidden state for time
i, computed bysi = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector
cifor each target word
yi.
The context vector
cidepends on a sequence of
annotations (h1, · · · , hTx)to which an encoder maps the input sentence. Each annotation
hicontains information about the whole input sequence with a strong focus on the parts surrounding the
i-th word of theinput sequence. We explain in detail how the annotations are com- puted in the next section.
The context vector
ciis, then, computed as a weighted sum of these annotations
hi:
ci =
Tx
X
j=1
↵ijhj.
(5) The weight
↵ijof each annotation
hjis computed by
↵ij = exp (eij) PTx
k=1 exp (eik),
(6)
where
eij = a(si 1, hj)
is an
alignment modelwhich scores how well the inputs around position
jand the output at position
imatch. The score is based on the RNN hidden state
si 1(just before emitting
yi, Eq. (4)) and the
j-th annotation
hjof the input sentence.
We parametrize the alignment model
aas a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation,
3
2 2 2 2 S21
S11
S31
Sj1 S12
S22
S32
Sj2
C1 C2 C3 Cj
y1 y2 y3 yj
D e c o d e r
Z1 Z2 Z3 Zi
αj:n +
St-12
C’j = ∑αjiCNi
Why RNN works for NMT ?
✦
Recurrently encode history for variable length large input sequences
✦
Capture the long distance dependency which is an
important occurrence in natural language text
RNN for NMT:
✤
Disadvantages :
✤
Slow : Doesn’t allow parallel computation within sequence
✤
Non-uniform composition : For each state, first word is over- processed and the last one only once
✤
Dense representation : each h
iis a compact summary of the source sentence up to word ‘i’
✤
Focus on global representation not on local features
CNN in NLP :
✤
Unlike RNN, CNN apply over a fixed size window of input
✤
This allows for parallel computation
✤
Represent sentence in terms of features:
✤
a weighted combination of multiple words or n-grams
✤
Very successful in learning sentence representations for various tasks
✤
Sentiment analysis, question classification (Kim 2014,
Kalchbrenner et al 2014)
Convolution over Recurrent encoder (CoveR):
✤
Can CNN help for NMT ?
✤
Instead of single recurrent outputs, we can use a
composition of multiple hidden state outputs of the encoder
✤
Convolution over recurrent :
✤
We apply multiple layers of fixed size convolution filters over the output of the RNN encoder at each time step
✤
Can provide wider context about the relevant features of the
source sentence
CoveR model
h21 h11
h31
hi1 h12
h22
h32
hi2 S21
S11
S31
Sj1 S12
S22
S32
Sj2
C’1 C’2 C’3 C’j
X2
X1 X3 Xi
y2
y1 y3 yj
D e c o d e r
pad0 pad0
CN11
CN21
CN31
CNi1 CN12
CN22
CN32
CNi2
pad0 pad0
R N N - E n c o d e r C N N - L a y e r s
Z1 Z2 Z3 Zi
αj:n + C’j = ∑αjiCNi
St-12
Convolution over Recurrent encoder:
✤
Each of the vectors CN
inow represents a feature produced by multiple kernels over h
i✤
Relatively uniform composition of multiple previous states and current state.
✤
Simultaneous hence faster processing at the convolutional layers
PBML ??? MAY 2017
h21 h11
h31
hi1 h12
h22
h32
hi2 S21
S11
S31
Sj1 S12
S22
S32
Sj2
C1 C2 C3 Cj
X2
X1 X3 Xi
y1 y2 y3 yj
E n c o d e r D e c o d e r
Z1 Z2 Z3 Zi
αj:n +
St-12
C’j = ∑αjiCNi
Figure 1. NMT encoder-decoder framework
h21 h11
h31
hi1 h12
h22
h32
hi2 S21
S11
S31
Sj1 S12
S22
S32
Sj2
C’1 C’2 C’3 C’j
X2
X1 X3 Xi
y2
y1 y3 yj
D e c o d e r
pad0 pad0
CN11
CN21
CN31
CNi1 CN12
CN22
CN32
CNi2
pad0 pad0
R N N - E n c o d e r C N N - L a y e r s
Z1 Z2 Z3 Zi
αj:n + C’j = ∑αjiCNi
St-12
Figure 2. Convolution over Recurrent model
BN Ra03a jR 0R j@Cc. s3 UUIw LnIjCUI3 Iw3ac R8 ~u30 cCy3 ,RNqRInjCRN ~Ij3ac Rq3a j@3 RnjUnj R8 j@3 `MM 3N,R03a j 3,@ jCL3 cj3UY c c@RsN CN 7C<na3 l. 8Ra Rna LR03I j@3 CNUnj jR j@3 ~acj ,RNqRInjCRN Iw3a Cc j@3 @C003N cjj3 RnjUnj R8 j@3 `MM 3N,R03aY i@nc CN1i Cc 03~N30 c-
CN1i = σ(θ · hi−[(w−1)/2]:i+[(w−1)/2] + b) V4W
j 3,@ Iw3a. s3 UUIw NnL$3a R8 ~Ij3ac 3\nI jR j@3 RaC<CNI CNUnj c3Nj3N,3 I3N<j@Y 2,@ ~Ij3a Cc R8 sC0j@ kY MRj3 j@j j@3 I3N<j@ R8 j@3 RnjUnj R8 j@3 ,RNqRInjCRN ~IA j3ac a30n,3c 03U3N0CN< RN j@3 CNUnj I3N<j@ N0 j@3 G3aN3I sC0j@Y BN Ra03a jR a3A jCN j@3 RaC<CNI c3\n3N,3 I3N<j@ R8 j@3 cRna,3 c3Nj3N,3 s3 UUIw U00CN< j 3,@
Iw3aY i@j Cc. 8Ra 3,@ ,RNqRInjCRNI Iw3a. j@3 CNUnj Cc y3aRAU0030 cR j@j j@3 RnjUnj
Related work:
✤
Gehring et al 2017:
✤
Completely replace RNN encoder with CNN
✤
Simple replacement doesn’t work, position embeddings required to model dependencies
✤
Require 6-15 convolutional layers to compete 2 layer RNN
✤
Meng et al 2015 :
✤
For Phrase-based MT, use CNN language model as
additional feature
Experimental setting:
✤
Data :
✦
WMT-2015 En-De training data : 4.2M sentence pairs
✦
Dev : WMT2013 test set
✦
Test : WMT2014,WMT2015 test sets
✤
Baseline :
✦
Two layer unidirectional LSTM encoder
✦
Embedding size, hidden size = 1000
Experimental setting:
✤
CoveR :
✦
Encoder : 3 convolutional layers over RNN output
✦
Decoder : same as baseline
✦
Convolutional filters of size : 3
✦
Output dimension : 1000
✦
Zero padding on both sides at each layer, no pooling
✦
Residual connection (He et, al 2015) between each
intermediate layer
Experimental setting:
✤
Deep RNN encoder :
✦
Comparing 2 layer RNN encoder baseline to CoveR is unfair
• Improvement maybe just due to increased number of parameters
✦
We compare with a deep RNN encoder with 5 layers
✦
2 layers of decoder initialized through a non-linear
BLEU scores ( * = significant at p < 0.05)
BLEU Dev wmt14 wmt15
Baseline
17.9 15.8 18.5
Deep RNN encoder
18.3 16.2 18.7
CoveR
18.5 16.9* 19.0*
Result
✤
Compared to baseline:
✦
+1.1 for WMT-14 and 0.5 for WMT-15
✤
Compared to deep RNN encoder :
✦
+0.7 for WMT-14 and 0.3 for WMT-15
#parameters and decoding speed
BLEU #parameters (millions)
avg sec/
sent
Baseline
174 0.11
Deep RNN encoder
283 0.28
CoveR
183 0.14
Result
✤
CoveR model:
✤
Slightly slower than baseline but faster than deep RNN
✤
Slightly more parameter than baseline but less than deep
Convolutional over Recurrent Encoder for Neural Machine Translation
16
Qualitative analysis :
✤
Increased output length
✤
With additional context, CoveR model generates complete translation
P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)
2uLUI3 S-
bRna,3 - c j@3 a3q3a3N0 LajCN Inj@3a GCN< EaY cC0 ~8jw w3ac <R
`383a3N,3 - sC3 UcjRa LajCN Inj@3a GCN< EaY qRa 8ɫN8yC< E@a3N c<j3 -
#c3ICN3 - sC3 03a LajCN Inj@3a GCN< EaY c<j3
+Rq3a - sC3 03a LajCN Inj@3a GCN< EaY c<j3 qRa 8ɫN8yC< E@a3N - 2uLUI3 l-
bRna,3 - @3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y
`383a3N,3 - 3a c<j3 . 0c <3Nn3 a3Cc3aRnj3 s3a03 NR,@ nc<3a$3Cj3j Y
#c3ICN3 - 3a c<j3 . 0cc 0C3 cja3,G3 NR,@ <nNG> Ccj Y
+Rq3a - 3a c<j3 . 0C3 a3Cc3aRnj3 sCa0 NR,@ nc<3a$3Cj3j Y
Table 2. Translation examples. Words in bold show correct translations produced by our model as compared to the baseline.
j@3 ~NI cjj3c R8 II j@3 ~q3 Iw3ac R8 j@3 3N,R03a a3cnIjCN< CN q3,jRa R8 cCy3 9u/ VȔ/ȕ Cc j@3 0CL3NcCRN R8 j@3 @C003N Iw3aW N0 j@3N 0RsN<a0CN< Cj jR cCy3 lu/ $w cCLUI3 NRNAICN3a jaNc8RaLjCRN N0 ~NIIw cUICjjCN< Cj CN jsR q3,jRac R8 cCy3 Ȕ/ȕ s@C,@ a3 nc30 jR CNCjCICy3 3,@ R8 j@3 Iw3ac R8 j@3 03,R03aY
fY `3cnIjc
i$I3 S c@Rsc j@3 a3cnIjc 8Ra Rna 2N<ICc@A;3aLN jaNcIjCRNc 3uU3aCL3NjcY i@3
~acj ,RInLN CN0C,j3c j@3 $3cj #H2m c,Ra3c RN j@3 03q3IRUL3Nj c3j N3scj3cjȕSk 8Ra II j@a33 LR03Ic 8j3a lz 3UR,@cY `3cnIjc a3 a3URaj30 RN j@3 N3scj3cjȕS: N0 N3scj3cjȔS9 j3cj c3jcY Qna +Rq3` LR03I c@Rsc CLUaRq3L3Njc R8 SYS N0 zY9 #H2m URCNjc a3cU3,A jCq3Iw Rq3a j@3 jsR j3cj c3jcY Ij@Rn<@ j@3 033U `MM 3N,R03a U3a8RaLc $3jj3a j@N j@3 $c3ICN3. j@3 CLUaRq3L3Njc ,@C3q30 a3 IRs3a j@N j@j R8 j@3 +Rq3` LR03IY
fYSY [nICjjCq3 NIwcCc N0 0Cc,nccCRN
i$I3 l UaRqC03c cRL3 R8 j@3 jaNcIjCRN 3uLUI3c UaR0n,30 $w j@3 $c3ICN3 cwcA j3L N0 Rna +Rq3` LR03IY <3N3aI R$c3aqjCRN Cc j@3 CLUaRq30 jaNcIjCRNc $w Rna LR03I Rq3a j@3 $c3ICN3 sCj@ a3<a0 jR j@3 a383a3N,3 jaNcIjCRN s@C,@ Cc IcR a3A
3,j30 $w j@3 CLUaRq30 #H2m c,Ra3cY KRa3 cU3,C~,IIw. 2uLUI3 S c@Rsc CNcjN,3c s@3a3 j@3 $c3ICN3 cn{3ac CN cRL3 ,c3c 8aRL CN,RLUI3j3 ,Rq3a<3 R8 j@3 cRna,3 c3NA j3N,3Y QN3 a3cRN 8Ra cn,@ CN,RLUI3j3 jaNcIjCRNc Cc j@3 I,G R8 ,Rq3a<3 LR03ICN<
s@C,@ @c $33N @N0I30 ncCN< ,Rq3a<3 3L$300CN<c Vin 3j IY. lzSfWY r3 R$c3aq3 j@Cc UaR$I3L 8a3\n3NjIw CN CNcjN,3c s@3a3 cU3,C~, sRa0 LC<@j cC<NI ,RLUI3jCRN R8 c3Nj3N,3 03cUCj3 LRa3 sRa0c CN j@3 c3\n3N,3 a3LCN jR $3 $3 jaNcIj30Y i@3c3 sRa0c ,N ,nc3 j@3 <3N3ajCRN R8 N3uj ja<3j sRa0 c j@3 3N0AR8Ac3Nj3N,3 Ȕ2Qbȕ cwLA
$RIY bCN,3 j@3 $3L c3a,@ 03,R0CN< I<RaCj@L ,RNcC03ac @wURj@3cCc ,RLUI3j3 s@3N j@3 3N0 R8 c3Nj3N,3 Cc <3N3aj30. CN cn,@ CNcjN,3c c3a,@ cjRUc. $RajCN< 8naj@3a 3uA
BLEU Avg sent length
Baseline
18.7
Deep RNN
19.0
CoveR
19.9
Reference
20.9
Convolutional over Recurrent Encoder for Neural Machine Translation
17
Qualitative analysis :
✤
More uniform attention distribution
✤
Generation of correct composite word
P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)
2uLUI3 S-
bRna,3 - c j@3 a3q3a3N0 LajCN Inj@3a GCN< EaY cC0 ~8jw w3ac <R
`383a3N,3 - sC3 UcjRa LajCN Inj@3a GCN< EaY qRa 8ɫN8yC< E@a3N c<j3 -
#c3ICN3 - sC3 03a LajCN Inj@3a GCN< EaY c<j3
+Rq3a - sC3 03a LajCN Inj@3a GCN< EaY c<j3 qRa 8ɫN8yC< E@a3N - 2uLUI3 l-
bRna,3 - @3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y
`383a3N,3 - 3a c<j3 . 0c <3Nn3 a3Cc3aRnj3 s3a03 NR,@ nc<3a$3Cj3j Y
#c3ICN3 - 3a c<j3 . 0cc 0C3 cja3,G3 NR,@ < nNG > Ccj Y
+Rq3a - 3a c<j3 . 0C3 a3Cc3aRnj3 sCa0 NR,@ nc<3a$3Cj3j Y
Table 2. Translation examples. Words in bold show correct translations produced by our model as compared to the baseline.
j@3 ~NI cjj3c R8 II j@3 ~q3 Iw3ac R8 j@3 3N,R03a a3cnIjCN< CN q3,jRa R8 cCy3 9u/ VȔ/ȕ Cc j@3 0CL3NcCRN R8 j@3 @C003N Iw3aW N0 j@3N 0RsN<a0CN< Cj jR cCy3 lu/ $w cCLUI3 NRNAICN3a jaNc8RaLjCRN N0 ~NIIw cUICjjCN< Cj CN jsR q3,jRac R8 cCy3 Ȕ/ȕ s@C,@ a3 nc30 jR CNCjCICy3 3,@ R8 j@3 Iw3ac R8 j@3 03,R03aY
fY `3cnIjc
i$I3 S c@Rsc j@3 a3cnIjc 8Ra Rna 2N<ICc@A;3aLN jaNcIjCRNc 3uU3aCL3NjcY i@3
~acj ,RInLN CN0C,j3c j@3 $3cj #H2m c,Ra3c RN j@3 03q3IRUL3Nj c3j N3scj3cjȕSk 8Ra II j@a33 LR03Ic 8j3a lz 3UR,@cY `3cnIjc a3 a3URaj30 RN j@3 N3scj3cjȕS: N0 N3scj3cjȔS9 j3cj c3jcY Qna +Rq3` LR03I c@Rsc CLUaRq3L3Njc R8 SYS N0 zY9 #H2m URCNjc a3cU3,A jCq3Iw Rq3a j@3 jsR j3cj c3jcY Ij@Rn<@ j@3 033U `MM 3N,R03a U3a8RaLc $3jj3a j@N j@3 $c3ICN3. j@3 CLUaRq3L3Njc ,@C3q30 a3 IRs3a j@N j@j R8 j@3 +Rq3` LR03IY
fYSY [nICjjCq3 NIwcCc N0 0Cc,nccCRN
i$I3 l UaRqC03c cRL3 R8 j@3 jaNcIjCRN 3uLUI3c UaR0n,30 $w j@3 $c3ICN3 cwcA j3L N0 Rna +Rq3` LR03IY <3N3aI R$c3aqjCRN Cc j@3 CLUaRq30 jaNcIjCRNc $w Rna LR03I Rq3a j@3 $c3ICN3 sCj@ a3<a0 jR j@3 a383a3N,3 jaNcIjCRN s@C,@ Cc IcR a3A
3,j30 $w j@3 CLUaRq30 #H2m c,Ra3cY KRa3 cU3,C~,IIw. 2uLUI3 S c@Rsc CNcjN,3c s@3a3 j@3 $c3ICN3 cn{3ac CN cRL3 ,c3c 8aRL CN,RLUI3j3 ,Rq3a<3 R8 j@3 cRna,3 c3NA j3N,3Y QN3 a3cRN 8Ra cn,@ CN,RLUI3j3 jaNcIjCRNc Cc j@3 I,G R8 ,Rq3a<3 LR03ICN<
s@C,@ @c $33N @N0I30 ncCN< ,Rq3a<3 3L$300CN<c Vin 3j IY. lzSfWY r3 R$c3aq3
j@Cc UaR$I3L 8a3\n3NjIw CN CNcjN,3c s@3a3 cU3,C~, sRa0 LC<@j cC<NI ,RLUI3jCRN
Convolutional over Recurrent Encoder for Neural Machine Translation
18
Qualitative analysis :
✤
More uniform attention distribution
Baseline
CoveR
PBML ??? MAY 2017
UNcCRNc. s@CI3 C<NRaCN< j@3 a3LCNCN< sRa0cY 7Ra CNcjN,3 CN 2uLUI3 S CN i$I3 l.
$w a3IwCN< RN j@3 jj3NjCRN L3,@NCcL. j@3 $c3ICN3 cwcj3L <3N3aj3c j@3 jaNcIjCRN R8 ȔcC0ȕc Ȕc<j3ȕ. j@3 LR03I LC<@j <Cq3 Ua383a3N,3 jR j@3 <3N3ajCRN R8 N 3N0AR8A c3Nj3N,3 Ȕ2QbȕcwL$RI CLL30Cj3Iw 8RIIRsCN< j@3 q3a$Y QN j@3 Rj@3a @N0. 8Ra Rna +Rq3` LR03I. j ja<3j URcCjCRN 4. sC03a ,RNj3uj Cc qCI$I3 jR j@3 LR03I j@aRn<@
,RNqRInjCRNI Iw3ac 8aRL $Rj@ 0Ca3,jCRNc cC<NIICN< j@3 Ua3c3N,3 R8 Rj@3a sRa0c a3A LCNCN< CN j@3 CNUnj c3Nj3N,3. j@nc UaR0n,CN< LRa3 ,RLUI3j3 jaNcIjCRNY NRj@3a
@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a
c<j3 . 0cc
0C3 cja3,G3
NR,@
ȒnNGȓ Ccj
Figure 3. Attention distribution for Baseline
@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a
c<j3 . 0cc
0C3 a3Cc3aRnj3
NR,@
CLL3a nc<3a$3Cj3j
sCa0
Figure 4. Attention distribution for CoveR model
0C{3a3N,3 $3js33N j@3 $c3ICN3 LR03I N0 Rna +Rq3` LR03I j@j ,N $3 R$c3aq30 CN 2uLUI3 l Cc j@j jj3NjCRN s3C<@jc a3 0CcjaC$nj30 LRa3 nNC8RaLIw LRN< j@3 cRna,3 sRa0cY bU3,C~,IIw. 8Ra ja<3j URcCjCRN f. c c@RsN CN 7C<na3 k j@3 $c3ICN3 LR03I Uwc
PBML ??? MAY 2017
UNcCRNc. s@CI3 C<NRaCN< j@3 a3LCNCN< sRa0cY 7Ra CNcjN,3 CN 2uLUI3 S CN i$I3 l.
$w a3IwCN< RN j@3 jj3NjCRN L3,@NCcL. j@3 $c3ICN3 cwcj3L <3N3aj3c j@3 jaNcIjCRN R8ȔcC0ȕc Ȕc<j3ȕ. j@3 LR03I LC<@j <Cq3 Ua383a3N,3 jR j@3 <3N3ajCRN R8 N 3N0AR8A c3Nj3N,3Ȕ2Qbȕ cwL$RI CLL30Cj3Iw 8RIIRsCN< j@3 q3a$Y QN j@3 Rj@3a @N0. 8Ra Rna +Rq3` LR03I. j ja<3j URcCjCRN 4. sC03a ,RNj3uj Cc qCI$I3 jR j@3 LR03I j@aRn<@
,RNqRInjCRNI Iw3ac 8aRL $Rj@ 0Ca3,jCRNc cC<NIICN< j@3 Ua3c3N,3 R8 Rj@3a sRa0c a3A LCNCN< CN j@3 CNUnj c3Nj3N,3. j@nc UaR0n,CN< LRa3 ,RLUI3j3 jaNcIjCRNY NRj@3a
@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a
c<j3 . 0cc
0C3 cja3,G3
NR,@
ȒnNGȓ Ccj
Figure 3. Attention distribution for Baseline
@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a
c<j3 . 0cc
0C3 a3Cc3aRnj3
NR,@
CLL3a nc<3a$3Cj3j
sCa0
Figure 4. Attention distribution for CoveR model
0C{3a3N,3 $3js33N j@3 $c3ICN3 LR03I N0 Rna +Rq3` LR03I j@j ,N $3 R$c3aq30 CN 2uLUI3 l Cc j@j jj3NjCRN s3C<@jc a3 0CcjaC$nj30 LRa3 nNC8RaLIw LRN< j@3 cRna,3 sRa0cY bU3,C~,IIw. 8Ra ja<3j URcCjCRN f. c c@RsN CN 7C<na3 k j@3 $c3ICN3 LR03I Uwc
✤ Baseline translates :
‘itinerary’ to ‘strecke’ (road, distance)
✤ Pays attention only to
‘itinerary’ for this position
✤ Cover translates : ‘itinerary’
to ‘reiseroute’
✤ Also pays attention to final verb
Conclusion :
✤
CoveR : multiple convolutional layers over RNN encoder
✤
Significant improvements over standard LSTM baseline
✤
Increasing LSTM layers improves slightly, but convolutional layers perform better
✤
Faster and less parameters than fully RNN encoder of same size
✤