• Nebyly nalezeny žádné výsledky

August6,2013 OndřejDušekandFilipJurčíček RobustMultilingualStatisticalMorphologyGenerationModels

N/A
N/A
Protected

Academic year: 2022

Podíl "August6,2013 OndřejDušekandFilipJurčíček RobustMultilingualStatisticalMorphologyGenerationModels"

Copied!
30
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Introduction The system Results

Robust Multilingual Statistical Morphology Generation Models

Ondřej Dušek and Filip Jurčíček

Institute of Formal and Applied Linguistics Charles University in Prague

August 6, 2013

(2)

. . . . . .

Introduction The system Results

Introduction

Morphology in NLG

Last step of the whole NLG pipeline

Usually does not get a lot of attention, but is necessary

What we do (Flect)

Natural Langu

age Generation

Semantics Syntax

Morphology

Text

We solve this Natural Langu

age Generation

Semantics Syntax

Morphology

Text

We solve this CS EN DE ES CA JA

In these languages

(3)

Introduction The system Results

Introduction

Morphology in NLG

Last step of the whole NLG pipeline

Usually does not get a lot of attention, but is necessary What we do (Flect)

Natural Langu

age Generation

Semantics Syntax

Morphology

Text

We solve this

Natural Langu

age Generation

Semantics Syntax

Morphology

Text

We solve this CS EN DE ES CA JA

In these languages

(4)

. . . . . .

Introduction The system Results

Introduction

Morphology in NLG

Last step of the whole NLG pipeline

Usually does not get a lot of attention, but is necessary What we do (Flect)

Natural Langu

age Generation

Semantics Syntax

Morphology

Text

We solve this Natural Langu

age Generation

Semantics Syntax

Morphology

Text

We solve this CS EN DE ES CA JA

In these languages

(5)

Introduction The system Results

The need for morphology in generation

English – not so much:

hard-coded solutions often work well enough

Languages with more inflection (e.g. Czech): even for the simplest things

Toto se líbí uživateli Jana Nováková.

--- - -ě é This is liked by user (name)[masc]

[dat]

Děkujeme, Jan Novák , vaše hlasování

Thank you, (name)

your poll has been created

bylo vytvořeno.

e u

[fem] [nom]

[nom]

(6)

. . . . . .

Introduction The system Results

The need for morphology in generation

English – not so much:

hard-coded solutions often work well enough

Languages with more inflection (e.g. Czech):

even for the simplest things

Toto se líbí uživateli Jana Nováková.

--- - -ě é This is liked by user (name)[masc]

[dat]

Děkujeme, Jan Novák , vaše hlasování

Thank you, (name)

your poll has been created

bylo vytvořeno.

e u

[fem]

[nom]

[nom]

(7)

Introduction The system Results

The task at hand

word + NNS words

Wort NN Wörtern

be VBZ is

ser V

gen=c,num=s,person=3,

mood=indicative,tense=present

es

Neut,Pl,Dat

+ + +

Input: Lemma (base form) or stem

+ morphological properties (POS, case, gender, etc.)

Output: Inflected word form

Inverse to POS tagging

(8)

. . . . . .

Introduction The system Results

Possible solutions

Dictionary?

Works well, but has limited size

Not many large-coverage openly available ones

Hand-written rules?

Work well, but are hard to maintain xB rule Cy

Machine learning!

Obtain the rules automatically

Plenty of treebanks of sufficient size available

Only work known to us: Bohnet et al. 2010

σ rule

x1

x2

xn

w1

w2

wn

(9)

Introduction The system Results

Possible solutions

Dictionary?

Works well, but has limited size

Not many large-coverage openly available ones

Hand-written rules?

Work well, but are hard to maintain B C

rule

x y

Machine learning!

Obtain the rules automatically

Plenty of treebanks of sufficient size available

Only work known to us: Bohnet et al. 2010

σ rule

x1

x2

xn

w1

w2

wn

(10)

. . . . . .

Introduction The system Results

Possible solutions

Dictionary?

Works well, but has limited size

Not many large-coverage openly available ones

Hand-written rules?

Work well, but are hard to maintain B C

rule

x y

Machine learning!

Obtain the rules automatically

Plenty of treebanks of sufficient size available

Only work known to us: Bohnet et al. 2010

σ rule

x1

x2

xn

w1

w2

wn

(11)

Introduction The system Results

Casting inflection patterns as multi-class classification fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

Our inflection rules: edit scripts

A kind of diffs: how to modify the lemma to get the form

Based on Levenshtein distance

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

Mutter

Mütter 5:1-ü

[5 letters from the end] [delete one letter]

[and add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter

Mütter 5:1-ü

[5 letters from the end] [delete one letter]

[and add this]

(12)

. . . . . .

Introduction The system Results

Casting inflection patterns as multi-class classification fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

Our inflection rules: edit scripts

A kind of diffs: how to modify the lemma to get the form

Based on Levenshtein distance

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

Mutter

Mütter 5:1-ü

[5 letters from the end] [delete one letter]

[and add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter

Mütter 5:1-ü

[5 letters from the end] [delete one letter]

[and add this]

(13)

Introduction The system Results

Casting inflection patterns as multi-class classification fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

Our inflection rules: edit scripts

A kind of diffs: how to modify the lemma to get the form

Based on Levenshtein distance

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

Mutter

Mütter 5:1-ü

[5 letters from the end]

[delete one letter]

[and add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter

Mütter 5:1-ü

[5 letters from the end] [delete one letter]

[and add this]

(14)

. . . . . .

Introduction The system Results

Casting inflection patterns as multi-class classification fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

Our inflection rules: edit scripts

A kind of diffs: how to modify the lemma to get the form

Based on Levenshtein distance

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

Mutter

Mütter 5:1-ü

[5 letters from the end]

[delete one letter]

[and add this]

fly flies >1-ies

[at the end]

[delete one letter]

[and add these]

sparen

gespart >2-t, <ge

[at the beginning] [add this]

be is *is

[replace the whole word]

Mutter

Mütter 5:1-ü

[5 letters from the end]

[delete one letter]

[and add this]

(15)

Introduction The system Results

Features useful for morphology generation

Same POS + same ending = (often) same inflection

sky fly NNS -ies bind find VBD -ound

+ +

Suffixes = good features to generalize to unseen inputs

Machine learning should be able to deal with counter-examples

Capitalization: no influence on morphology

(16)

. . . . . .

Introduction The system Results

Features useful for morphology generation

Same POS + same ending = (often) same inflection

sky fly NNS -ies bind find VBD -ound

+ +

Suffixes = good features to generalize to unseen inputs

Machine learning should be able to deal with counter-examples

Capitalization: no influence on morphology

(17)

Introduction The system Results

Features useful for morphology generation

Same POS + same ending = (often) same inflection

sky fly NNS -ies bind find VBD -ound

+ +

Suffixes = good features to generalize to unseen inputs

Machine learning should be able to deal with counter-examples

Capitalization: no influence on morphology

(18)

. . . . . .

Introduction The system Results

Our system Flect: Overall procedure

Wort

NN

Pl Neut Dat

1. Getfeaturesfrom lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort

NN

Pl

ortrt t

Neut Dat

2. Predictedit scriptsusing Logistic regression

Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

3. Use them as rules to obtainformfrom lemma Wort

NNPl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern

(19)

Introduction The system Results

Our system Flect: Overall procedure

Wort

NN

Pl Neut Dat

1. Getfeaturesfrom lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort

NN

Pl

ortrt t

Neut Dat

2. Predictedit scriptsusing Logistic regression

Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

3. Use them as rules to obtainformfrom lemma Wort

NNPl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern

(20)

. . . . . .

Introduction The system Results

Our system Flect: Overall procedure

Wort

NN

Pl Neut Dat

1. Getfeaturesfrom lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort

NN

Pl

ortrt t

Neut Dat

2. Predictedit scriptsusing Logistic regression

Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

3. Use them as rules to obtainformfrom lemma Wort

NNPl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern

(21)

Introduction The system Results

Our system Flect: Overall procedure

Wort

NN

Pl Neut Dat

1. Getfeaturesfrom lemma, POS, suffixes

(+morph. properties & their combinations, possibly context)

Wort

NN

Pl

ortrt t

Neut Dat

2. Predictedit scriptsusing Logistic regression

Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

3. Use them as rules to obtainformfrom lemma Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern Wort

NN

Pl

ortrt t

Neut Dat

>ern, 3:1-ö

σ

w

1

w

2

w

n

edit Wörtern

(22)

. . . . . .

Introduction The system Results

Testing Flect on 6 languages

CoNLL 2009 data: varying morphology richness & tagsets

English Czech German

92 94 96 98

Unseen forms accuracy (%)

90 100

Total

CS

EN JA CA ES DE

Works well even on unseen forms: suffixes help

over-generalization errors, e.g. torpedo+VBN=torpedone

German: syntax-sensitive morphology

(23)

Introduction The system Results

Testing Flect on 6 languages

CoNLL 2009 data: varying morphology richness & tagsets

English Czech German

92 94 96 98

Unseen forms accuracy (%)

90 100

Total

CS

EN JA CA ES DE

Works well even on unseen forms: suffixes help

over-generalization errors, e.g. torpedo+VBN=torpedone

German: syntax-sensitive morphology

(24)

. . . . . .

Introduction The system Results

Testing Flect on 6 languages

CoNLL 2009 data: varying morphology richness & tagsets

English Czech German

92 94 96 98

Unseen forms accuracy (%)

90 100

Total

CS

EN JA CA ES DE

Works well even on unseen forms: suffixes help

over-generalization errors, e.g. torpedo+VBN=torpedone

German: syntax-sensitive morphology

(25)

Introduction The system Results

Testing Flect on 6 languages

CoNLL 2009 data: varying morphology richness & tagsets

English Czech German

92 94 96 98

Unseen forms accuracy (%)

90 100

Total

CS

EN JA CA ES DE

Works well even on unseen forms: suffixes help

over-generalization errors, e.g. torpedo+VBN=torpedone

German: syntax-sensitive morphology

(26)

. . . . . .

Introduction The system Results

Flect vs. a dictionary from the same data

English: Dictionary gets OK relatively soon

0,1 0,5 1 5 10 20 30 50 75 100

75

80 85 90 95 accuracy (%)

training data part (%)

58% error reduction

76% error reduction

Dictionary (Total)

Dictionary (Unknown forms) Flect (Total)

Flect (Unknown forms) 100

EN

Czech: Dictionary fails on unknown forms, our system works

0,1 0,5 1 5 10 20 30 50 75 100

50 60 70 80 90

100

accuracy(%)

training data part (%)

92% error reduction

40

Dictionary (Total)

Dictionary (Unknown forms) Flect (Total)

Flect (Unknown forms)

CS

0,1 0,5 1 5 10 20 30 50 75 100

50 60 70 80 90

100

accuracy(%)

training data part (%)

92% error reduction

40

Dictionary (Total)

Dictionary (Unknown forms) Flect (Total)

Flect (Unknown forms)

CS

(27)

Introduction The system Results

Flect vs. a dictionary from the same data

English: Dictionary gets OK relatively soon

0,1 0,5 1 5 10 20 30 50 75 100

75

80 85 90 95 accuracy (%)

58% error reduction

76% error reduction

Dictionary (Total)

Dictionary (Unknown forms) Flect (Total)

Flect (Unknown forms) 100

EN

Czech: Dictionary fails on unknown forms, our system works

0,1 0,5 1 5 10 20 30 50 75 100

50 60 70 80 90

100

accuracy(%)

92% error reduction

40

Dictionary (Total)

Dictionary (Unknown forms) Flect (Total)

Flect (Unknown forms)

CS

0,1 0,5 1 5 10 20 30 50 75 100

50 60 70 80 90

100

accuracy(%)

92% error reduction

40

Dictionary (Total)

Dictionary (Unknown forms) Flect (Total)

Flect (Unknown forms)

CS

(28)

. . . . . .

Introduction The system Results

Conclusions

General observations:

Inflection rules/patterns can be learned from a corpus

Suffix features are useful to inflect unseen words

Detailed morphological features and context features help

Our systemFlect:

improves on a dictionary learnt from the same data

gains more in morphologically rich languages (Czech)

can be combined with a dictionary as a back-off for OOVs

(29)

Introduction The system Results

Conclusions

General observations:

Inflection rules/patterns can be learned from a corpus

Suffix features are useful to inflect unseen words

Detailed morphological features and context features help

Our systemFlect:

improves on a dictionary learnt from the same data

gains more in morphologically rich languages (Czech)

can be combined with a dictionary as a back-off for OOVs

(30)

. . . . . .

Introduction The system Results

Thank you for your attention

You may downloadFlect(and these slides)at:

http://ufal.mff.cuni.cz/~odusek/flect/

http://bit.ly/flect

The system is based on Python and Scikit-Learn.

You may contact us:

Ondřej Dušek & Filip Jurčíček Charles University in Prague odusek@ufal.mff.cuni.cz

Odkazy

Související dokumenty

We presented a method for image retrieval using repetitive patterns as the only feature. The contribution of the paper lies in 1) representing the pattern by a shift-invariant tile

Data have been classified =&gt; common outlier detection does not help =&gt; Class-based outlier detection.. Class

Similar work usually performs multi-class classification with help of photometric metadata as additional features while we rely just on the time series themselves (although in

Similar work usually performs multi-class classification with help of photometric metadata as ad- ditional features while we rely just on the time series themselves (although in

Seasonal dynamics of sand fly vectors of Mediterranean leishmaniasis caused by Leishmania infantum.. Sand fly seasonal dynamics and nocturnal acitvity were studies at 10 sites in

[r]

We note that in the case m = 1, the class K 1,n (D) properly contains the classical Kato class K n (D) introduced in [1] as the natural class of singular functions which replaces the

The marking of classification, which developed as a consequence of the inflection borrowing into Early Romani (see 4.1), not only broke the structure of inherited Layer