• Nebyly nalezeny žádné výsledky

Convolutional over Recurrent Encoder for Neural Machine Translation

N/A
N/A
Protected

Academic year: 2022

Podíl "Convolutional over Recurrent Encoder for Neural Machine Translation"

Copied!
20
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Convolutional over Recurrent Encoder for Neural Machine Translation

Annual Conference of the European Association for Machine Translation

2017

(2)

Convolutional over Recurrent Encoder for Neural Machine Translation

2

Neural Machine Translation

End to end neural network with RNN architecture where the output of an RNN (decoder) is conditioned on another RNN (encoder).

c is a fixed length vector representation of source sentence encoded by RNN.

• Attention Mechanism :

(Bahdanau et al 2015) : compute conext vector as weighted average of annotations of source hidden states.

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word yt0 given the context vector c and all the previously predicted words {y1, · · · , yt0 1}. In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

YT

t=1

p(yt | {y1, · · · , yt 1}, c), (2)

where y = y1,· · · , yTy . With an RNN, each conditional probability is modeled as

p(yt | {y1,· · · , yt 1} , c) = g(yt 1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 L

EARNING TO

A

LIGN AND

T

RANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architecture consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching through a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT).

In a new model architecture, we define each conditional probability in Eq. (2) as:

p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) where si is an RNN hidden state for time i, computed by

si = f(si 1, yi 1, ci).

It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi.

The context vector ci depends on a sequence of annotations (h1, · · · , hTx) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section.

The context vector ci is, then, computed as a weighted sum of these annotations hi:

ci =

Tx

X

j=1

ijhj. (5) The weight ↵ij of each annotation hj is computed by

ij = exp (eij) PTx

k=1 exp (eik), (6)

where

eij = a(si 1, hj)

is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation,

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word

yt0

given the context vector

c

and all the previously predicted words

{y1, · · · , yt0 1}

. In other words, the decoder defines a probability over the translation

y

by decomposing the joint probability into the ordered conditionals:

p(y) =

YT t=1

p(yt | {y1, · · · , yt 1}, c),

(2)

where

y = y1, · · · , yTy

. With an RNN, each conditional probability is modeled as

p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c),

(3)

where

g

is a nonlinear, potentially multi-layered, function that outputs the probability of

yt

, and

st

is the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 L

EARNING TO

A

LIGN AND

T

RANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architecture consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching through a source sentence during decoding a translation (Sec. 3.1).

3.1 D

ECODER

: G

ENERAL

D

ESCRIPTION

Figure 1: The graphical illus- tration of the proposed model trying to generate the

t-th tar-

get word

yt

given a source sentence

(x1, x2, . . . , xT).

In a new model architecture, we define each conditional probability in Eq. (2) as:

p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci),

(4) where

si

is an RNN hidden state for time

i, computed by

si = f(si 1, yi 1, ci).

It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector

ci

for each target word

yi

.

The context vector

ci

depends on a sequence of

annotations (h1, · · · , hTx)

to which an encoder maps the input sentence. Each annotation

hi

contains information about the whole input sequence with a strong focus on the parts surrounding the

i-th word of the

input sequence. We explain in detail how the annotations are com- puted in the next section.

The context vector

ci

is, then, computed as a weighted sum of these annotations

hi

:

ci =

Tx

X

j=1

ijhj.

(5) The weight

ij

of each annotation

hj

is computed by

ij = exp (eij) PTx

k=1 exp (eik),

(6)

where

eij = a(si 1, hj)

is an

alignment model

which scores how well the inputs around position

j

and the output at position

i

match. The score is based on the RNN hidden state

si 1

(just before emitting

yi

, Eq. (4)) and the

j

-th annotation

hj

of the input sentence.

We parametrize the alignment model

a

as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation,

3

(3)

2 2 2 2 S21

S11

S31

Sj1 S12

S22

S32

Sj2

C1 C2 C3 Cj

y1 y2 y3 yj

D e c o d e r

Z1 Z2 Z3 Zi

αj:n +

St-12

C’j = ∑αjiCNi

(4)

Why RNN works for NMT ?

Recurrently encode history for variable length large input sequences

Capture the long distance dependency which is an

important occurrence in natural language text

(5)

RNN for NMT:

Disadvantages :

Slow : Doesn’t allow parallel computation within sequence

Non-uniform composition : For each state, first word is over- processed and the last one only once

Dense representation : each h

i

is a compact summary of the source sentence up to word ‘i’

Focus on global representation not on local features

(6)

CNN in NLP :

Unlike RNN, CNN apply over a fixed size window of input

This allows for parallel computation

Represent sentence in terms of features:

a weighted combination of multiple words or n-grams

Very successful in learning sentence representations for various tasks

Sentiment analysis, question classification (Kim 2014,

Kalchbrenner et al 2014)

(7)

Convolution over Recurrent encoder (CoveR):

Can CNN help for NMT ?

Instead of single recurrent outputs, we can use a

composition of multiple hidden state outputs of the encoder

Convolution over recurrent :

We apply multiple layers of fixed size convolution filters over the output of the RNN encoder at each time step

Can provide wider context about the relevant features of the

source sentence

(8)

CoveR model

h21 h11

h31

hi1 h12

h22

h32

hi2 S21

S11

S31

Sj1 S12

S22

S32

Sj2

C’1 C’2 C’3 C’j

X2

X1 X3 Xi

y2

y1 y3 yj

D e c o d e r

pad0 pad0

CN11

CN21

CN31

CNi1 CN12

CN22

CN32

CNi2

pad0 pad0

R N N - E n c o d e r C N N - L a y e r s

Z1 Z2 Z3 Zi

αj:n + C’j = ∑αjiCNi

St-12

(9)

Convolution over Recurrent encoder:

Each of the vectors CN

i

now represents a feature produced by multiple kernels over h

i

Relatively uniform composition of multiple previous states and current state.

Simultaneous hence faster processing at the convolutional layers

PBML ??? MAY 2017

h21 h11

h31

hi1 h12

h22

h32

hi2 S21

S11

S31

Sj1 S12

S22

S32

Sj2

C1 C2 C3 Cj

X2

X1 X3 Xi

y1 y2 y3 yj

E n c o d e r D e c o d e r

Z1 Z2 Z3 Zi

αj:n +

St-12

C’j = ∑αjiCNi

Figure 1. NMT encoder-decoder framework

h21 h11

h31

hi1 h12

h22

h32

hi2 S21

S11

S31

Sj1 S12

S22

S32

Sj2

C’1 C’2 C’3 C’j

X2

X1 X3 Xi

y2

y1 y3 yj

D e c o d e r

pad0 pad0

CN11

CN21

CN31

CNi1 CN12

CN22

CN32

CNi2

pad0 pad0

R N N - E n c o d e r C N N - L a y e r s

Z1 Z2 Z3 Zi

αj:n + C’j = ∑αjiCNi

St-12

Figure 2. Convolution over Recurrent model

BN Ra03a jR 0R j@Cc. s3 UUIw LnIjCUI3 Iw3ac R8 ~u30 cCy3 ,RNqRInjCRN ~Ij3ac Rq3a j@3 RnjUnj R8 j@3 `MM 3N,R03a j 3,@ jCL3 cj3UY c c@RsN CN 7C<na3 l. 8Ra Rna LR03I j@3 CNUnj jR j@3 ~acj ,RNqRInjCRN Iw3a Cc j@3 @C003N cjj3 RnjUnj R8 j@3 `MM 3N,R03aY i@nc CN1i Cc 03~N30 c-

CN1i = σ(θ · hi−[(w−1)/2]:i+[(w−1)/2] + b) V4W

j 3,@ Iw3a. s3 UUIw NnL$3a R8 ~Ij3ac 3\nI jR j@3 RaC<CNI CNUnj c3Nj3N,3 I3N<j@Y 2,@ ~Ij3a Cc R8 sC0j@ kY MRj3 j@j j@3 I3N<j@ R8 j@3 RnjUnj R8 j@3 ,RNqRInjCRN ~IA j3ac a30n,3c 03U3N0CN< RN j@3 CNUnj I3N<j@ N0 j@3 G3aN3I sC0j@Y BN Ra03a jR a3A jCN j@3 RaC<CNI c3\n3N,3 I3N<j@ R8 j@3 cRna,3 c3Nj3N,3 s3 UUIw U00CN< j 3,@

Iw3aY i@j Cc. 8Ra 3,@ ,RNqRInjCRNI Iw3a. j@3 CNUnj Cc y3aRAU0030 cR j@j j@3 RnjUnj

(10)

Related work:

Gehring et al 2017:

Completely replace RNN encoder with CNN

Simple replacement doesn’t work, position embeddings required to model dependencies

Require 6-15 convolutional layers to compete 2 layer RNN

Meng et al 2015 :

For Phrase-based MT, use CNN language model as

additional feature

(11)

Experimental setting:

Data :

WMT-2015 En-De training data : 4.2M sentence pairs

Dev : WMT2013 test set

Test : WMT2014,WMT2015 test sets

Baseline :

Two layer unidirectional LSTM encoder

Embedding size, hidden size = 1000

(12)

Experimental setting:

CoveR :

Encoder : 3 convolutional layers over RNN output

Decoder : same as baseline

Convolutional filters of size : 3

Output dimension : 1000

Zero padding on both sides at each layer, no pooling

Residual connection (He et, al 2015) between each

intermediate layer

(13)

Experimental setting:

Deep RNN encoder :

Comparing 2 layer RNN encoder baseline to CoveR is unfair

• Improvement maybe just due to increased number of parameters

We compare with a deep RNN encoder with 5 layers

2 layers of decoder initialized through a non-linear

(14)

BLEU scores ( * = significant at p < 0.05)

BLEU Dev wmt14 wmt15

Baseline

17.9 15.8 18.5

Deep RNN encoder

18.3 16.2 18.7

CoveR

18.5 16.9* 19.0*

Result

Compared to baseline:

+1.1 for WMT-14 and 0.5 for WMT-15

Compared to deep RNN encoder :

+0.7 for WMT-14 and 0.3 for WMT-15

(15)

#parameters and decoding speed

BLEU #parameters (millions)

avg sec/

sent

Baseline

174 0.11

Deep RNN encoder

283 0.28

CoveR

183 0.14

Result

CoveR model:

Slightly slower than baseline but faster than deep RNN

Slightly more parameter than baseline but less than deep

(16)

Convolutional over Recurrent Encoder for Neural Machine Translation

16

Qualitative analysis :

Increased output length

With additional context, CoveR model generates complete translation

P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)

2uLUI3 S-

bRna,3 - c j@3 a3q3a3N0 LajCN Inj@3a GCN< EaY cC0 ~8jw w3ac <R

`383a3N,3 - sC3 UcjRa LajCN Inj@3a GCN< EaY qRa 8ɫN8yC< E@a3N c<j3 -

#c3ICN3 - sC3 03a LajCN Inj@3a GCN< EaY c<j3

+Rq3a - sC3 03a LajCN Inj@3a GCN< EaY c<j3 qRa 8ɫN8yC< E@a3N - 2uLUI3 l-

bRna,3 - @3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y

`383a3N,3 - 3a c<j3 . 0c <3Nn3 a3Cc3aRnj3 s3a03 NR,@ nc<3a$3Cj3j Y

#c3ICN3 - 3a c<j3 . 0cc 0C3 cja3,G3 NR,@ <nNG> Ccj Y

+Rq3a - 3a c<j3 . 0C3 a3Cc3aRnj3 sCa0 NR,@ nc<3a$3Cj3j Y

Table 2. Translation examples. Words in bold show correct translations produced by our model as compared to the baseline.

j@3 ~NI cjj3c R8 II j@3 ~q3 Iw3ac R8 j@3 3N,R03a a3cnIjCN< CN q3,jRa R8 cCy3 9u/ VȔ/ȕ Cc j@3 0CL3NcCRN R8 j@3 @C003N Iw3aW N0 j@3N 0RsN<a0CN< Cj jR cCy3 lu/ $w cCLUI3 NRNAICN3a jaNc8RaLjCRN N0 ~NIIw cUICjjCN< Cj CN jsR q3,jRac R8 cCy3 Ȕ/ȕ s@C,@ a3 nc30 jR CNCjCICy3 3,@ R8 j@3 Iw3ac R8 j@3 03,R03aY

fY `3cnIjc

i$I3 S c@Rsc j@3 a3cnIjc 8Ra Rna 2N<ICc@A;3aLN jaNcIjCRNc 3uU3aCL3NjcY i@3

~acj ,RInLN CN0C,j3c j@3 $3cj #H2m c,Ra3c RN j@3 03q3IRUL3Nj c3j N3scj3cjȕSk 8Ra II j@a33 LR03Ic 8j3a lz 3UR,@cY `3cnIjc a3 a3URaj30 RN j@3 N3scj3cjȕS: N0 N3scj3cjȔS9 j3cj c3jcY Qna +Rq3` LR03I c@Rsc CLUaRq3L3Njc R8 SYS N0 zY9 #H2m URCNjc a3cU3,A jCq3Iw Rq3a j@3 jsR j3cj c3jcY Ij@Rn<@ j@3 033U `MM 3N,R03a U3a8RaLc $3jj3a j@N j@3 $c3ICN3. j@3 CLUaRq3L3Njc ,@C3q30 a3 IRs3a j@N j@j R8 j@3 +Rq3` LR03IY

fYSY [nICjjCq3 NIwcCc N0 0Cc,nccCRN

i$I3 l UaRqC03c cRL3 R8 j@3 jaNcIjCRN 3uLUI3c UaR0n,30 $w j@3 $c3ICN3 cwcA j3L N0 Rna +Rq3` LR03IY <3N3aI R$c3aqjCRN Cc j@3 CLUaRq30 jaNcIjCRNc $w Rna LR03I Rq3a j@3 $c3ICN3 sCj@ a3<a0 jR j@3 a383a3N,3 jaNcIjCRN s@C,@ Cc IcR a3A

3,j30 $w j@3 CLUaRq30 #H2m c,Ra3cY KRa3 cU3,C~,IIw. 2uLUI3 S c@Rsc CNcjN,3c s@3a3 j@3 $c3ICN3 cn{3ac CN cRL3 ,c3c 8aRL CN,RLUI3j3 ,Rq3a<3 R8 j@3 cRna,3 c3NA j3N,3Y QN3 a3cRN 8Ra cn,@ CN,RLUI3j3 jaNcIjCRNc Cc j@3 I,G R8 ,Rq3a<3 LR03ICN<

s@C,@ @c $33N @N0I30 ncCN< ,Rq3a<3 3L$300CN<c Vin 3j IY. lzSfWY r3 R$c3aq3 j@Cc UaR$I3L 8a3\n3NjIw CN CNcjN,3c s@3a3 cU3,C~, sRa0 LC<@j cC<NI ,RLUI3jCRN R8 c3Nj3N,3 03cUCj3 LRa3 sRa0c CN j@3 c3\n3N,3 a3LCN jR $3 $3 jaNcIj30Y i@3c3 sRa0c ,N ,nc3 j@3 <3N3ajCRN R8 N3uj ja<3j sRa0 c j@3 3N0AR8Ac3Nj3N,3 Ȕ2Qbȕ cwLA

$RIY bCN,3 j@3 $3L c3a,@ 03,R0CN< I<RaCj@L ,RNcC03ac @wURj@3cCc ,RLUI3j3 s@3N j@3 3N0 R8 c3Nj3N,3 Cc <3N3aj30. CN cn,@ CNcjN,3c c3a,@ cjRUc. $RajCN< 8naj@3a 3uA

BLEU Avg sent length

Baseline

18.7

Deep RNN

19.0

CoveR

19.9

Reference

20.9

(17)

Convolutional over Recurrent Encoder for Neural Machine Translation

17

Qualitative analysis :

More uniform attention distribution

Generation of correct composite word

P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)

2uLUI3 S-

bRna,3 - c j@3 a3q3a3N0 LajCN Inj@3a GCN< EaY cC0 ~8jw w3ac <R

`383a3N,3 - sC3 UcjRa LajCN Inj@3a GCN< EaY qRa 8ɫN8yC< E@a3N c<j3 -

#c3ICN3 - sC3 03a LajCN Inj@3a GCN< EaY c<j3

+Rq3a - sC3 03a LajCN Inj@3a GCN< EaY c<j3 qRa 8ɫN8yC< E@a3N - 2uLUI3 l-

bRna,3 - @3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y

`383a3N,3 - 3a c<j3 . 0c <3Nn3 a3Cc3aRnj3 s3a03 NR,@ nc<3a$3Cj3j Y

#c3ICN3 - 3a c<j3 . 0cc 0C3 cja3,G3 NR,@ < nNG > Ccj Y

+Rq3a - 3a c<j3 . 0C3 a3Cc3aRnj3 sCa0 NR,@ nc<3a$3Cj3j Y

Table 2. Translation examples. Words in bold show correct translations produced by our model as compared to the baseline.

j@3 ~NI cjj3c R8 II j@3 ~q3 Iw3ac R8 j@3 3N,R03a a3cnIjCN< CN q3,jRa R8 cCy3 9u/ VȔ/ȕ Cc j@3 0CL3NcCRN R8 j@3 @C003N Iw3aW N0 j@3N 0RsN<a0CN< Cj jR cCy3 lu/ $w cCLUI3 NRNAICN3a jaNc8RaLjCRN N0 ~NIIw cUICjjCN< Cj CN jsR q3,jRac R8 cCy3 Ȕ/ȕ s@C,@ a3 nc30 jR CNCjCICy3 3,@ R8 j@3 Iw3ac R8 j@3 03,R03aY

fY `3cnIjc

i$I3 S c@Rsc j@3 a3cnIjc 8Ra Rna 2N<ICc@A;3aLN jaNcIjCRNc 3uU3aCL3NjcY i@3

~acj ,RInLN CN0C,j3c j@3 $3cj #H2m c,Ra3c RN j@3 03q3IRUL3Nj c3j N3scj3cjȕSk 8Ra II j@a33 LR03Ic 8j3a lz 3UR,@cY `3cnIjc a3 a3URaj30 RN j@3 N3scj3cjȕS: N0 N3scj3cjȔS9 j3cj c3jcY Qna +Rq3` LR03I c@Rsc CLUaRq3L3Njc R8 SYS N0 zY9 #H2m URCNjc a3cU3,A jCq3Iw Rq3a j@3 jsR j3cj c3jcY Ij@Rn<@ j@3 033U `MM 3N,R03a U3a8RaLc $3jj3a j@N j@3 $c3ICN3. j@3 CLUaRq3L3Njc ,@C3q30 a3 IRs3a j@N j@j R8 j@3 +Rq3` LR03IY

fYSY [nICjjCq3 NIwcCc N0 0Cc,nccCRN

i$I3 l UaRqC03c cRL3 R8 j@3 jaNcIjCRN 3uLUI3c UaR0n,30 $w j@3 $c3ICN3 cwcA j3L N0 Rna +Rq3` LR03IY <3N3aI R$c3aqjCRN Cc j@3 CLUaRq30 jaNcIjCRNc $w Rna LR03I Rq3a j@3 $c3ICN3 sCj@ a3<a0 jR j@3 a383a3N,3 jaNcIjCRN s@C,@ Cc IcR a3A

3,j30 $w j@3 CLUaRq30 #H2m c,Ra3cY KRa3 cU3,C~,IIw. 2uLUI3 S c@Rsc CNcjN,3c s@3a3 j@3 $c3ICN3 cn{3ac CN cRL3 ,c3c 8aRL CN,RLUI3j3 ,Rq3a<3 R8 j@3 cRna,3 c3NA j3N,3Y QN3 a3cRN 8Ra cn,@ CN,RLUI3j3 jaNcIjCRNc Cc j@3 I,G R8 ,Rq3a<3 LR03ICN<

s@C,@ @c $33N @N0I30 ncCN< ,Rq3a<3 3L$300CN<c Vin 3j IY. lzSfWY r3 R$c3aq3

j@Cc UaR$I3L 8a3\n3NjIw CN CNcjN,3c s@3a3 cU3,C~, sRa0 LC<@j cC<NI ,RLUI3jCRN

(18)

Convolutional over Recurrent Encoder for Neural Machine Translation

18

Qualitative analysis :

More uniform attention distribution

Baseline

CoveR

PBML ??? MAY 2017

UNcCRNc. s@CI3 C<NRaCN< j@3 a3LCNCN< sRa0cY 7Ra CNcjN,3 CN 2uLUI3 S CN i$I3 l.

$w a3IwCN< RN j@3 jj3NjCRN L3,@NCcL. j@3 $c3ICN3 cwcj3L <3N3aj3c j@3 jaNcIjCRN R8 ȔcC0ȕc Ȕc<j3ȕ. j@3 LR03I LC<@j <Cq3 Ua383a3N,3 jR j@3 <3N3ajCRN R8 N 3N0AR8A c3Nj3N,3 Ȕ2QbȕcwL$RI CLL30Cj3Iw 8RIIRsCN< j@3 q3a$Y QN j@3 Rj@3a @N0. 8Ra Rna +Rq3` LR03I. j ja<3j URcCjCRN 4. sC03a ,RNj3uj Cc qCI$I3 jR j@3 LR03I j@aRn<@

,RNqRInjCRNI Iw3ac 8aRL $Rj@ 0Ca3,jCRNc cC<NIICN< j@3 Ua3c3N,3 R8 Rj@3a sRa0c a3A LCNCN< CN j@3 CNUnj c3Nj3N,3. j@nc UaR0n,CN< LRa3 ,RLUI3j3 jaNcIjCRNY NRj@3a

@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a

c<j3 . 0cc

0C3 cja3,G3

NR,@

ȒnNGȓ Ccj

Figure 3. Attention distribution for Baseline

@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a

c<j3 . 0cc

0C3 a3Cc3aRnj3

NR,@

CLL3a nc<3a$3Cj3j

sCa0

Figure 4. Attention distribution for CoveR model

0C{3a3N,3 $3js33N j@3 $c3ICN3 LR03I N0 Rna +Rq3` LR03I j@j ,N $3 R$c3aq30 CN 2uLUI3 l Cc j@j jj3NjCRN s3C<@jc a3 0CcjaC$nj30 LRa3 nNC8RaLIw LRN< j@3 cRna,3 sRa0cY bU3,C~,IIw. 8Ra ja<3j URcCjCRN f. c c@RsN CN 7C<na3 k j@3 $c3ICN3 LR03I Uwc

PBML ??? MAY 2017

UNcCRNc. s@CI3 C<NRaCN< j@3 a3LCNCN< sRa0cY 7Ra CNcjN,3 CN 2uLUI3 S CN i$I3 l.

$w a3IwCN< RN j@3 jj3NjCRN L3,@NCcL. j@3 $c3ICN3 cwcj3L <3N3aj3c j@3 jaNcIjCRN R8ȔcC0ȕc Ȕc<j3ȕ. j@3 LR03I LC<@j <Cq3 Ua383a3N,3 jR j@3 <3N3ajCRN R8 N 3N0AR8A c3Nj3N,3Ȕ2Qbȕ cwL$RI CLL30Cj3Iw 8RIIRsCN< j@3 q3a$Y QN j@3 Rj@3a @N0. 8Ra Rna +Rq3` LR03I. j ja<3j URcCjCRN 4. sC03a ,RNj3uj Cc qCI$I3 jR j@3 LR03I j@aRn<@

,RNqRInjCRNI Iw3ac 8aRL $Rj@ 0Ca3,jCRNc cC<NIICN< j@3 Ua3c3N,3 R8 Rj@3a sRa0c a3A LCNCN< CN j@3 CNUnj c3Nj3N,3. j@nc UaR0n,CN< LRa3 ,RLUI3j3 jaNcIjCRNY NRj@3a

@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a

c<j3 . 0cc

0C3 cja3,G3

NR,@

ȒnNGȓ Ccj

Figure 3. Attention distribution for Baseline

@3 cC0 j@3 CjCN3aaw Cc cjCII $3CN< sRaG30 Rnj Y 3a

c<j3 . 0cc

0C3 a3Cc3aRnj3

NR,@

CLL3a nc<3a$3Cj3j

sCa0

Figure 4. Attention distribution for CoveR model

0C{3a3N,3 $3js33N j@3 $c3ICN3 LR03I N0 Rna +Rq3` LR03I j@j ,N $3 R$c3aq30 CN 2uLUI3 l Cc j@j jj3NjCRN s3C<@jc a3 0CcjaC$nj30 LRa3 nNC8RaLIw LRN< j@3 cRna,3 sRa0cY bU3,C~,IIw. 8Ra ja<3j URcCjCRN f. c c@RsN CN 7C<na3 k j@3 $c3ICN3 LR03I Uwc

Baseline translates :

‘itinerary’ to ‘strecke’ (road, distance)

Pays attention only to

‘itinerary’ for this position

Cover translates : ‘itinerary’

to ‘reiseroute’

Also pays attention to final verb

(19)

Conclusion :

CoveR : multiple convolutional layers over RNN encoder

Significant improvements over standard LSTM baseline

Increasing LSTM layers improves slightly, but convolutional layers perform better

Faster and less parameters than fully RNN encoder of same size

CoveR model can improve coverage and provide more

(20)

Questions ?

Thanks

Odkazy

Související dokumenty

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1:

Trivial Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation, Volume 1: Research Papers, volume 1, pages

In what follows we describe the experimental settings (Section 2) and results of several ex- periments focusing on four approaches: (i) multiple system translation ranking (Sec-

Multimodality in Neural Machine Translation 1/26... Multimodal

We then translate each sentence arranged according to each of the word order predictions using a standard phrase-based machine translation system trained on the corpus produced by

Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language

Neural Machine Translation - Encoder Decoder.. figure credit,

Ø OpenNMT is a industrial-strength, open-source (MIT) neural machine translation system utilizing the Torch/PyTorch toolkit.. Ø