• Nebyly nalezeny žádné výsledky

From the Jungle to a Park:

N/A
N/A
Protected

Academic year: 2022

Podíl "From the Jungle to a Park: "

Copied!
92
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

SPMRL, Bilbao, 23.7.2015 1

(2)

SPMRL, Bilbao, 23.7.2015 2

From the Jungle to a Park:

Harmonizing Annotations across Languages

Daniel Zeman

Charles University in Prague

(3)

SPMRL, Bilbao, 23.7.2015 3

From the Jungle to a Park:

Harmonizing Annotations across Languages

Daniel Zeman

Based on joint work with many great people, including

Philip Resnik, Alexandr Rosen, Zdeněk Žabokrtský, Martin Popel, Loganathan Ramasamy, David Mareček, Rudolf Rosa, Jan Štěpánek, Jan Hajič, Joakim Nivre, Chris Manning, Ryan McDonald,

Slav Petrov, Filip Ginter, Sampo Pyysalo, Reut Tsarfaty, Yoav Goldberg, Natalia Silveira, Tim Dozat, and many others…

The research has been supported by the grant GA15-10472S.

(4)

4

Too Few or Too Many?

In 2000, dependency trees were quite rare.

"Thorn Tree Sossusvlei Namib Desert Namibia Luca Galuzzi 2004a" by Luca Galuzzi (Lucag) - Photo by (Luca Galuzzi) * http://www.galuzzi.it. Licensed under CC BY-SA 2.5 via Wikimedia Commons

(5)

5

Too Few or Too Many?

CoNLL 2006: dependency treebanks for 13 languages.

What about the remaining 6987?

(6)

6

Too Few or Too Many?

Min. 83 treebanks for 51 languages

An impenetrable jungle of annotation styles!

Still there are about 6949 languages out in the desert…

"PuertoMaldonado LagoSandoval3" by Xauxa. Licensed under CC BY 2.5 via Wikimedia Commons

(7)

SPMRL, Bilbao, 23.7.2015 7

Outline

Cross-language learning (historical motivation)

• Normalization: morphology

• Normalization: dependencies

• Cross-language learning (current work)

(8)

SPMRL, Bilbao, 23.7.2015 8

Cross-Language Parser Adaptation

• 2006 with Philip Resnik (University of Maryland)

Delexicalized parsing

(9)

SPMRL, Bilbao, 23.7.2015 9

Cross-Language Parser Adaptation

• 2006 with Philip Resnik (University of Maryland)

Delexicalized parsing

• Gained on popularity now

McDonald, Petrov & Hall (EMNLP 2011)

Oskar Täckström (dissertation 2013)

Loganathan Ramasamy (dissertation 2014)

Rosa & Žabokrtský (IWPT 2015)

(10)

SPMRL, Bilbao, 23.7.2015 10

Parser Adaptation

• Idea:

Related languages L1 and L2

• L1 treebank and morphology

• L2 morphology

Train parser on L1 morphological features

Apply the parser to L2

• We took:

L1 = Danish [da]

L2 = Swedish [sv]

(11)

SPMRL, Bilbao, 23.7.2015 11

Danish – Swedish Setup

• CoNLL 2006 treebanks (dependencies)

Danish Dependency Treebank

Swedish Talbanken05

• Two constituency parsers:

“Charniak”

“Brown” (Charniak N-best parser + Johnson reranker)

• Other resources

JRC-Acquis parallel corpus

Hajič tagger for Swedish (PAROLE tagset)

(12)

SPMRL, Bilbao, 23.7.2015 12

Most Frequent da/sv Words

i 0.024

og 0.024

at 0.021

er 0.017

en 0.014

til 0.013

af 0.013

det 0.012

0.012

och 0.027

att 0.027

i 0.021

är 0.018

som 0.017

en 0.015

det 0.013

av 0.012

0.011

(13)

SPMRL, Bilbao, 23.7.2015 15

JRC-Acquis Aligned Example

Enhver kontraherende part kan opsige denne konvention ved skriftlig henvendelse til

depositaren.

En fördragsslutande part får säga upp denna konvention genom skriftlig notifikation till

depositarien.

(14)

SPMRL, Bilbao, 23.7.2015 16

Treebank Preparation

S(bought)

VP(bought) NP(John)

V(bought) NP(bike)

S(bought) VP(bought)

NP(John) V(bought) NP(bike)

bought

John bike

(15)

SPMRL, Bilbao, 23.7.2015 17

Treebank Preparation

S(bought)

VP(bought) NP(John)

V(bought) NP(bike)

S(bought) VP(bought)

NP(John) V(bought) NP(bike)

bought

John bike

flattest possible structure

(16)

SPMRL, Bilbao, 23.7.2015 18

Treebank Preparation

• DA / SV tagset converted to the Penn Treebank tags

• Nonterminal labels:

derived from POS tags

then translated to the Penn set of nonterminals

• Make the parser feel it works with the Penn TB

• (Although it could have been configured to use other sets of labels.)

(17)

SPMRL, Bilbao, 23.7.2015 19

Treebank Normalization

Danish

• DET governs ADJ, ADJ governs NOUN

• NUM governs NOUN

• GEN governs NOM

Ruslands vej Russia’s way

• COORD: last member on conjunction, everything else on first member

Swedish

• NOUN governs both DET and ADJ

• NOUN governs NUM

• NOM governs GEN

års inkomster year’s income

• COORD: member on previous member,

commas and conjs on next member

(18)

SPMRL, Bilbao, 23.7.2015 20

Treebank Normalization

• A few heuristics

• Transform Danish to the Swedish tree style

• Concrete annotation style does not matter!

This was only for testing

Hypothesis: no Swedish treebank

Than any style is good

(19)

22 22

Parsing Danish Treebank

• CoNLL test: 322 sents, 5852 words

• CoNLL training: 5190 sents, 94386 words

4900 sents my training 290 sents my devtest

• Following are results on CoNLL test

(20)

23 23

Parsing Swedish Treebank

• CoNLL test: 389 sents, 5656 words

• CoNLL training: 11042 sents, 191467 words

10700 sents my training 342 sents my devtest

• Following are results on CoNLL test

(21)

24 24

Parsing Swedish with Danish Parser

• Trained on Danish training data

• Parse Swedish test data

• No morphology tweaking so far!

Most words are UNKNOWN

• Following are results on CoNLL test

(22)

SPMRL, Bilbao, 23.7.2015 25

Delexicalized Parsing

• What if we feed the parser with tags instead of words?

Ændringer i listen i bilaget offentliggøres og meddeles på samme måde.

NNS IN NN IN NN VB CC VB IN DT NN

NNS IN NN MD VB CC VB IN DT NN

Förändringar i förteckningen skall offentliggöras och meddelas på samma sätt.

(23)

SPMRL, Bilbao, 23.7.2015 26

Delexicalized Parsing

• What if we feed the parser with tags instead of words?

Ændringer i listen i bilaget offentliggøres og meddeles på samme måde.

((NNS (IN NN (IN NN))) ((VB CC VB) (IN (DT NN))))

((NNS (IN NN)) ((MD (VB CC VB)) (IN (DT NN))))

Förändringar i förteckningen skall offentliggöras och meddelas på samma sätt.

(24)

27 27

Delexicalized Parsing

• Trained on Danish training data (tags only)

• Parse Swedish test data (tags from Hajič tagger)

• Restuff Swedish trees with original words

• All data in hybrid Swedish-Danish Hajič-like tagset

• (“words” = sv/da tags, “tags” = Penn tags)

(25)

SPMRL, Bilbao, 23.7.2015 28

Glosses

• JRC-Acquis is a parallel corpus

more than 430,000 sentences

• Giza++ & lexical weighting generate a da-sv glossary

• Always use highest-weighted gloss

• Translate Swedish word-by-word to Danish

• Many words are no longer unknown!

(26)

SPMRL, Bilbao, 23.7.2015 29

Excerpt from sv-da Glossary

behandlingsaktörer

behandlingsanläggning

behandlingsanläggningar

behandlingsanläggningen

behandlingsdatum

behandlingsformer

behandlingsfrister

behandlingsförfaranden

behandlingsförsök

behandlingsindikation

behäftad

behandlingsvirksomheder

behandlingsanlæg

behandlingsvirksomheders

behandlingsanlægget

datøn

behandlingsmuligheder

frister

behandlingsprocedurer

befolkningsforsøg

indikation

behæftet

(27)

30 30

Glossed parsing

• Trained on Danish training data

• Translate Swedish test data to Danish

• Parse it using Danish-trained model

• Restuff trees with Swedish and evaluate

• Following are results on CoNLL test

(28)

31 31

How big a Swedish treebank can produce the same results?

66.40 (delex)

~ 1546 sentences

(29)

SPMRL, Bilbao, 23.7.2015 32

Outline

Cross-language learning (historical motivation) Normalization: morphology

• Normalization: dependencies

• Cross-language learning (current work)

(30)

SPMRL, Bilbao, 23.7.2015 33

Tagset Mapping: Interset

• Already mentioned: da/sv → Penn

• We want to preserve features that

are present in both [da] and [sv]

are not present in Penn

• This is CRUCIAL:

unmapped tags are unknown words again

• Mapping tags is always hard even for the same language

• Languages can be similar, approaches way different!

(31)

SPMRL, Bilbao, 23.7.2015 34

Tagset Discrepancy Examples

• No determiners in [da], pronouns instead

• Subject/object pronoun forms in [sv] (cf. [en] he/him), nominative vs. “unmarked” case in [da]

• Masculine gender in [sv] (pronouns)

• Numerals are adjectives in [da]

• Supine in [sv] – probably the only difference truly caused by the language

(32)

SPMRL, Bilbao, 23.7.2015 35

Interset

source tag

target tag (nearly)

universal set of features

encode

source tagset driver

target tagset driver

reusable!

decode

(33)

SPMRL, Bilbao, 23.7.2015 36

Limitations

• Universal features (the “interlingua”)

should be linguistically adequate

built bottom-up: new features/values added when needed

“marginal” phenomena may be ignored?

• Tagset conversion

motivated rather technically than linguistically

• (why would a linguist use a Swedish tagset for Danish?)

we may lose information (if target tagset cannot encode it)

we do not add information (Interset is not a tagger!)

(34)

SPMRL, Bilbao, 23.7.2015 37

pos noun

nountype prontype punctype puncside morphpos

poss reflex negativeness

definiteness gender animateness

number case prepcase

degree person politeness possgender possnumber

subcat verbform

mood tense aspect

voice foreign

abbr hyph style typo variant

tagset other

adj num verb adv adp conj part int punc com prop class

prn prs rcp art int rel exc dem

peri qest excl quot brck comm colo semi dash root ini fin

noun adj pron num poss

reflex pos neg

ind def red

masc fem com neut anim inan

sing dual plur

nom gen dat acc voc loc ins npr pre

pos com sup abs

1 2 3

inf pol

masc fem com neut sing dual plur

intr tran

fin inf sup part trans ger ind imp cnd sub jus

past pres fut imp perf

act pass foreign

abbr hyph

arch form norm coll typo

short long 0 1 2 3 4 5 6 7 8 9

cs::pdt

{ obscure_feature_1 => [0, 7,351.2, [„a“, „b“]] }

In te rs et : c ur re nt s ta te

sym nametype adjtype neg ind tot

numtype

numform numvalue

verbtype advtype adpostype

conjtype parttype adv mix def

fscript tscript

echo com

nhum

ptan coll

abl del par dis ess

dim

possperson possednumber

absperson ergperson datperson

absnumber ergnumber datnumber

abspoliteness ergpoliteness datpoliteness

erggender datgender position

gdv

pot opt des nec qot aor imp nar pqp

mid rcp cau int pro prog

rare poet vrnc slng expr derg vulg sing dual plur

60 features 349 values

(35)

SPMRL, Bilbao, 23.7.2015 38

Disjunctive Values

• Tag says that gender is masc or neut.

• Interset stores list of alternative values.

• We cannot represent alternative combinations of values, for example:

either feminine singular,

or neuter plural,

but not feminine plural or neuter singular

(36)

SPMRL, Bilbao, 23.7.2015 40

Does It Fit in Target Tagset?

• We fill only representable features

• The rest will be lost

• WARNING: it may be “representable” but still alien!

Swedish knows: pos = noun & gender = com | neut

And also: prontype = prs & gender = masc | fem | com | neut

Czech input: pos = noun & gender = masc

Keep the “alien” combination in Swedish?

(37)

SPMRL, Bilbao, 23.7.2015 41

Alien Tags in Target Tagset

• What is the goal of the conversion?

Corpus query etc. => keep alien tags

Blackbox tool => avoid data that it does not expect

• Atomic tagsets (Penn): no choice

• Structured tags (features encoded separately):

impossible combinations can be represented

• How do we avoid them?

(38)

SPMRL, Bilbao, 23.7.2015 42

Example: cs → sv

pos noun

punc int part conj prep adv verb num adj

prontype prs

int ind

pos noun

NNMS1---A---

(39)

SPMRL, Bilbao, 23.7.2015 43

Example: cs → sv

pos noun

punc int part conj prep adv verb num adj

prontype prs

int ind

ind definiteness

def def

ind pos

negativeness

noun pos NNMS1---A---

(40)

SPMRL, Bilbao, 23.7.2015 44

Example: cs → sv

pos noun

punc int part conj prep adv verb num adj

prontype prs

int ind

ind definiteness

def

com neut gender

def masc

neut com

com neut

ind com

neut pos

negativeness gender animateness

noun pos masc

anim NNMS1---A---

(41)

SPMRL, Bilbao, 23.7.2015 45

Example: cs → sv

pos noun

punc int part conj prep adv verb num adj

prontype prs

int ind

ind definiteness

def

com neut

sing plur gender number

def masc

neut com

sing sing plur plur

com neut

sing plur

ind com

neut

sing sing plur pos

negativeness gender animateness

number

noun pos masc

anim sing NNMS1---A---

(42)

SPMRL, Bilbao, 23.7.2015 46

Example: cs → sv

pos noun

punc int part conj prep adv verb num adj

prontype prs

int ind

ind definiteness

def

com neut

sing plur

nom gen

gender number case

def masc

neut com

sing sing plur plur

nom acc

nom acc

com neut

sing plur

ind com

neut

sing sing plur

nom pos

negativeness gender animateness

number case

noun pos masc

anim sing nom NNMS1---A---

(43)

SPMRL, Bilbao, 23.7.2015 49

Gender & Animacy & Definite

pl hr cs ru sk sl bg ta el en sv de grc hi la eu ar ca da es it nl pt he fa hu

0 0,5 1 1,5 2 2,5 3 3,5 4 4,5

Gender Animacy Definite

(44)

SPMRL, Bilbao, 23.7.2015 50

Case & Number

hu et eu fi ru tr cs hi hr la pl sk ta sl grc el bg de nl ar he ca da es fa pt sv en it 0

5 10 15 20 25

Case Number

(45)

SPMRL, Bilbao, 23.7.2015 52

VerbForm & Mood & Voice

et la da pl ca es it pt cs hi sk sl sv ta en ru tr grc fi de ja nl eu bg hu el he hr ar fa 0

1 2 3 4 5 6

VerbForm Mood Voice

(46)

SPMRL, Bilbao, 23.7.2015 53

Person & Tense & Aspect

grc la tr pt ca es it bg cs hi pl ru sk et sl ta de en nl da el fi he hr hu sv eu ar ja fa 0

1 2 3 4 5 6 7

Person Tense Aspect

(47)

SPMRL, Bilbao, 23.7.2015 54

Lingua::Interset

• Interset is a Perl library, available from CPAN:

cpanm Lingua::Interset

• Currently covers 60 tagsets of 37 languages

• Conversion between any two tagsets:

simple Perl script (a few lines of code)

(48)

SPMRL, Bilbao, 23.7.2015 55

Universal Features

• October 2014: Universal Dependencies guidelines

• Universal POS tags

originally 12 Google tags, extended to 17 UPOS tags

• Universal Features

from Interset (subset), only cosmetic changes

17 features (lexical and inflectional), 103 values so far

• Approximate conversion tables from Interset tagsets to UPOS + UFeatures are available

http://universaldependencies.github.io/docs/u/feat/index.html

(49)

SPMRL, Bilbao, 23.7.2015 56

Outline

Cross-language learning (historical motivation)

Normalization: morphology Normalization: dependencies

• Cross-language learning (current work)

(50)

SPMRL, Bilbao, 23.7.2015 57

HamleDT

=

HArmonized Multi-LanguagE

Dependency Treebank

(51)

SPMRL, Bilbao, 23.7.2015 58

HamleDT 1.0

• 2011: first version available, 29 treebanks

• ~ one third freely redistributable

• ~ one third easily obtainable + transformation by us

• ~ one third hard to get

• Morphology: Interset features and → Prague tags

• Syntax: Prague-style trees and labels

(52)

SPMRL, Bilbao, 23.7.2015 59

(Google) Universal Treebanks

• Version 1, 2013, 6 languages

• Version 2, 2014, 11 languages

• Stanford dependencies

• Google universal POS tags

Another common standard?

(53)

SPMRL, Bilbao, 23.7.2015 60

(Google) Universal Treebanks

• Version 1, 2013, 6 languages

• Version 2, 2014, 11 languages

• Stanford dependencies

• Google universal POS tags

“The nice thing about standards is that you have so many to choose from.”

(Andy Tannenbaum)

(54)

SPMRL, Bilbao, 23.7.2015 61

HamleDT 2.0

• May 2014: version 2.0 available, 30 treebanks

• ~ one third freely redistributable

• ~ one third easily obtainable + transformation by us

• ~ one third hard to get

• Morphology: Interset features and → Google UPOS

• Syntax: added Universal Stanford Dependencies

Stanford and Prague were the two most widely used standards

(55)

SPMRL, Bilbao, 23.7.2015 62

Universal Dependencies

• Joint effort by a growing crowd of people

• Universal POS tags

• Universal Features (from Interset)

• Dependency relations (modified Stanford)

• Language-specific extensions

or even treebank-specific

(56)

SPMRL, Bilbao, 23.7.2015 63

Universal Dependencies

• The guidelines 1.0 in October 2014

• UD 1.0: 10 treebanks in January 2015

• UD 1.1: 19 treebanks in May 2015

All freely redistributable!

Some of them currently lack morphology (lemmas, features)

• Next release in November 2015

• Conversions of old data

• Newly annotated data (hu, hr, …?)

(57)

SPMRL, Bilbao, 23.7.2015 64

Coming Soon: HamleDT 3.0

• Superset of UD 1.1 (18 languages, 19 treebanks)

• Adds 18 more languages (automatically converted using older HamleDT transformations)

• Total: 36 languages, over 40 treebanks in the UD style http://ufal.mff.cuni.cz/hamledt

(58)

SPMRL, Bilbao, 23.7.2015 65

Universal Dependencies

Don’t annotate the same thing different ways!

(59)

SPMRL, Bilbao, 23.7.2015 66

Universal Dependencies

Don’t annotate the same thing different ways!

Don’t make different things look the same!

(60)

SPMRL, Bilbao, 23.7.2015 67

Universal Dependencies

Don’t annotate the same thing different ways!

Don’t make different things look the same!

Don’t annotate things that are not there!

(61)

SPMRL, Bilbao, 23.7.2015 68

Structural Variations

• Pre/postpositions

• Subordinate clauses

• Verb groups

• Coordination

• Apposition We try to automatically

identify these constructions and transform them to the common style.

(62)

SPMRL, Bilbao, 23.7.2015 69

Structural Variations

• Pre/postpositions

• Subordinate clauses

• Verb groups

• Coordination

• Apposition We try to automatically

identify these constructions and transform them to the common style.

Content words are heads

whenever possible!

(63)

SPMRL, Bilbao, 23.7.2015 72

Prepositions

(64)

SPMRL, Bilbao, 23.7.2015 73

Subordinate Clauses

(65)

SPMRL, Bilbao, 23.7.2015 74

Verb Groups

(66)

SPMRL, Bilbao, 23.7.2015 78

Coordination: Mel'čuk

(67)

SPMRL, Bilbao, 23.7.2015 79

Coordination: Prague

(68)

SPMRL, Bilbao, 23.7.2015 80

Coordination: [ro, zh]

(69)

SPMRL, Bilbao, 23.7.2015 81

Coordination: Stanford

(70)

SPMRL, Bilbao, 23.7.2015 82

Coordination: Tesnière

(71)

SPMRL, Bilbao, 23.7.2015 85

36 Languages (HamleDT 3.0)

Ancient Greek (grc)

Arabic (ar)

Basque (eu)

Bengali (bn)

Bulgarian (bg)

Catalan (ca)

Croatian (hr)

Czech (cs)

Danish (da)

Dutch (nl)

English (en)

Estonian (et)

Finnish (fi)

French (fr)

German (de)

Greek (el)

Hebrew (he)

Hindi (hi)

Hungarian (hu)

Indonesian (id)

Irish (ga)

Italian (it)

Japanese (ja)

Latin (la)

Persian (fa)

Polish (pl)

Portuguese (pt)

Romanian (ro)

Russian (ru)

Slovak (sk)

Slovene (sl)

Spanish (es)

Swedish (sv)

Tamil (ta)

Telugu (te)

Turkish (tr)

(72)

SPMRL, Bilbao, 23.7.2015 86

UD 1.1 (May 2015): 18

Ancient Greek (grc)

Arabic (ar)

Basque (eu)

Bengali (bn)

Bulgarian (bg)

Catalan (ca)

Croatian (hr)

Czech (cs)

Danish (da)

Dutch (nl)

English (en)

Estonian (et)

Finnish (fi)

French (fr)

German (de)

Greek (el)

Hebrew (he)

Hindi (hi)

Hungarian (hu)

Indonesian (id)

Irish (ga)

Italian (it)

Japanese (ja)

Latin (la)

Persian (fa)

Polish (pl)

Portuguese (pt)

Romanian (ro)

Russian (ru)

Slovak (sk)

Slovene (sl)

Spanish (es)

Swedish (sv)

Tamil (ta)

Telugu (te)

Turkish (tr)

(73)

SPMRL, Bilbao, 23.7.2015 88

Data Size

ar bg bn ca cs da de el en es et eu fa fi fr ga grc he hi hr hu id it ja la nl pl pt ro ru sk sl sv ta te tr 0

200000 400000 600000 800000 1000000 1200000 1400000 1600000

Tokens

(74)

SPMRL, Bilbao, 23.7.2015 89

Sentence Length

ar bg bn ca cs da de el en es et eu fa fi fr ga grc he hi hr hu id it ja la nl pl pt ro ru sk sl sv ta te tr 0

5 10 15 20 25 30 35 40

Tokens

(75)

SPMRL, Bilbao, 23.7.2015 90

Nonprojective Dependencies

ar bg bn ca cs da de el en es et eu fa fi fr ga grc he hi hr hu id it ja la nl pl pt ro ru sk sl sv ta te tr 0

2 4 6 8 10 12

% Deps

(76)

SPMRL, Bilbao, 23.7.2015 93

Morphological Richness

tr ru cs grc sk eu fi hr pl el bg sl te fa ta it sv es ga pt hu ca da nl de en bn hi la 0

0,5 1 1,5 2 2,5 3 3,5

Forms / Lemmas4

(77)

SPMRL, Bilbao, 23.7.2015 94

How Can You Get It?

• 36 languages in the UD style

• 28 directly downloadable

• patches and/or free software for the rest (if you have the data)

• Stay tuned to:

http://ufal.mff.cuni.cz/hamledt/

(HamleDT 3.0 should be available before the end of summer.)

(78)

SPMRL, Bilbao, 23.7.2015 95

Outline

Cross-language learning (historical motivation)

Normalization: morphology

• Normalization: dependencies

Cross-language learning (current work)

(79)

SPMRL, Bilbao, 23.7.2015 96

Default Setup

• Malt Parser, stack-lazy algorithm

same configuration for all, no optimization

same selection of training features for all treebanks

• Trained on the first 1000 sentences only

• Tested on the whole test set

• Default score: UAS

• Only harmonized data used

(80)

SPMRL, Bilbao, 23.7.2015 97

bg cs hr pl ru sk sl da de en nl sv ca es fr it pt ro el ga grc la bn fa hi ta te eu et fi hu ja tr id ar he 0,00

10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00

lex dlx dlx1 dlx2

Malt Trained on 1000 Sents.

(81)

SPMRL, Bilbao, 23.7.2015 98

Who Helps Whom?

• Czech (62.44) ⇦ Croatian (63.27), Slovene (62.87)

• Slovak (59.47) ⇦ Croatian (60.28), Slovene (59.32)

• Polish (77.92) ⇦ Croatian (66.42), Slovene (64.31)

• Russian (66.86) ⇦ Croatian (57.35), Slovak (55.01)

• Croatian (75.52) ⇦ Slovene (58.96), Polish (55.42)

• Slovene (76.17) ⇦ Croatian (62.92), Finnish (59.79)

• Bulgarian (78.44) ⇦ Croatian (74.39), Slovene (71.52)

(82)

SPMRL, Bilbao, 23.7.2015 99

Who Helps Whom?

• Catalan (75.28) ⇦ Italian (71.07), French (68.30)

• Italian (76.66) ⇦ French (70.37), Catalan (68.66)

• French (69.93) ⇦ Spanish (64.28), Italian (63.33)

• Spanish (67.76) ⇦ French (67.61), Catalan (64.54)

• Portuguese (69.89) ⇦ Italian (69.48), French (66.12)

• Romanian (79.74) ⇦ Croatian (67.01), Latin (66.75)

(83)

SPMRL, Bilbao, 23.7.2015 100

Who Helps Whom?

• Swedish (75.73) ⇦ Danish (66.17), English (65.41)

• Danish (75.19) ⇦ Swedish (59.23), Croatian (56.89)

• English (72.68) ⇦ German (57.95), French (56.70)

• German (67.04) ⇦ Croatian (58.68), Swedish (57.48)

• Dutch (60.76) ⇦ Hungarian (41.90), Finnish (37.89)

(84)

SPMRL, Bilbao, 23.7.2015 101

Who Helps Basque?

(85)

SPMRL, Bilbao, 23.7.2015 102

Who Helps Basque?

• Basque (73.36) ⇦

Hungarian (48.72)

Estonian (45.49)

Croatian (44.37)

• Basque ⇨

Hindi (54.89); best is Tamil (65.07)

Tamil (49.58); best is Hindi (55.11)

(86)

SPMRL, Bilbao, 23.7.2015 103

Morphological Features:

Do They Help?

• The SVM learner discriminates between useful and useless features.

BUT!

• What if the target data lack the “useful” features?

(What if they lack all features, e.g. UD1.1 German, French, Spanish, Indonesian?)

(87)

SPMRL, Bilbao, 23.7.2015 104

Feature Ranking

1. All features

2. Lex features (PronType, NumType, Poss, Reflex) 3. No features (only part of speech tag)

4. Lex + Person 5. Only Person 6. Only Case 7. Lex + Case 8. …

Which features will give us the best UAS?

(88)

SPMRL, Bilbao, 23.7.2015 105

Some Exceptions

• None: fr ← it, fi ← hu

• Lexical: hr ← bg, he ← it, et ← hu, da ← sv

• Lexical + Tense + Aspect + Mood + Voice: eu ← hu

• VerbForm + Person: ga ← he

(89)

SPMRL, Bilbao, 23.7.2015 106

So What’s Next?

• Ongoing work — preliminary results

• Unsupervised target POS + morphology

• Other settings of Malt parser, other parsers

• Combination of source languages;

prediction of the best source language(s)

cf. Rudolf Rosa’s talk from yesterday

(90)

SPMRL, Bilbao, 23.7.2015 107

Back at the Start

10 20 50 100 200 500 1000 2000 5000

0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00

Training sentences

UAS

(91)

SPMRL, Bilbao, 23.7.2015 108

Back at the Start

10 20 50 100 200 500 1000 2000 5000

0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00

Training sentences

UAS

66.17 (delex)

~ 75 sentences

(92)

SPMRL, Bilbao, 23.7.2015 110

thank you thank you

děkujeme děkujeme اركش اركش

благодаря благодаря

তোমাকে ধন্বাদ তোমাকে ধন্বাদ

gràcies gràcies

tak tak danke

danke

ευχαριστώ ευχαριστώ

gracias gracias aitäh

aitäh eskerrik asko eskerrik asko kiitos

kiitos

शुक्रियि

शुक्रियि

köszönöm köszönöm

þakka þér þakka þér

grazie grazie

ありがとう ありがとう

gratias

gratias

dank dank

obrigado obrigado mulţumesc

mulţumesc

спасибо спасибо

hvala hvala

tack tack நன்றி

நன்றி

ధన్య వాదాలు

ధన్య వాదాలు

teşekkür ederim

teşekkür ederim

謝謝謝謝

Odkazy

Související dokumenty

Since the 1970’s and the move towards ‘normalization’ (Wolfensberger, Nirje, Olansky, Perske, & Roos, 1972), researchers and practitioners have been grappling

The model family consists of a number of distinct components which can be interpreted to encode both syntactic and semantic aspects of morphs, which are word segments discovered

Figure 1 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et

Idea: Simplify the translation process by making Czech look like English (beautiful → kr´ asn∅).. Assumption: Such a simplification could make translation easier from and into

In Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Computational Natural Language Learning, pages 166 – 176, Hong

In Proceedings of the Shared Task on Cross- Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning, pages 86 – 94, Hong Kong, China..

In Proceedings of the Shared Task on Cross- Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning, pages 86 – 94, Hong Kong, China..

The research is focused on the acquisition of the first language (L1) – Slovak, the learning of the first foreign language (L2) – English, the learning of the second foreign