• Nebyly nalezeny žádné výsledky

Extracting Verbal Multiword Data from Rich Treebank Annotation

N/A
N/A
Protected

Academic year: 2022

Podíl "Extracting Verbal Multiword Data from Rich Treebank Annotation"

Copied!
30
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

TLT 15, Bloomington, Jan 2017

Extracting Verbal Multiword Data from Rich Treebank Annotation

Eduard Bejček, Jan Hajič, Pavel Straňák and Zdeňka Urešová Charles University in Prague,

Faculty of Mathematics and Physics, ÚFAL

{bejcek,hajic,stranak,uresova}@ufal.mff.cuni.cz

(2)

Introduction

Parseme Shared Task (PST)

within european project on MWEs and parsing

competition between MWE identification systems

part of MWE Workshop at EACL 2017 in Valencia

still open for participation

blind test data has been released yesterday, system submission in a week

data for 18 languages (usu. thousands of MWEs)

manual annotation of all verbal MWEs in text

(3)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 3/30

Motivation

18 languages from 18 countries

manual annotation according to PST Annotation Guidelines is needed for 17 languages

Czech has already a MWE annotated corpus,

but long before PST Annotation Guidelines

(4)

Motivation

18 languages from 18 countries

manual annotation according to PST Annotation Guidelines is needed for 17 languages

Czech has already a MWE annotated corpus, but long before PST Annotation Guidelines

Let's rather try to

transform the annotation!

= compare the guidelines and extract VMWEs

(5)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 5/30

Overview of the talk

Types of verbal MWEs in PST

MWEs in Prague Dependency Treebank

Principles for good practice in annotation

VMWEs extraction itself:

extraction of each type

(extraction of deverbative variants)

(resolving of overlapping annotation)

Results and conclusion

(6)

Types of VMWEs in PST [3] (1)

Light verb construction (LVC)

to make a decision, to come into bloom

Idiom (ID)

to stand firm, to come into play, to make it, to know on which side the bread is buttered

Inherently reflexive verb (IReflV)

FR: se suicider, s'aprecevoir (“realize”, not “see”)

(7)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 7/30

Types of VMWEs in PST [3] (2)

Verb-particle construction (VPC)

to put off, to blow up, to do in

Language-specific categories

Other verbal MWEs (OTH)

to drink and drive, to short-circuit

no VPC and LSpec categories in Czech

deverbatives

decision making, decision which he made,

decision previously made

(8)

PDT and MWEs

Prague Dependency Treebank (PDT) [4]

several types of MWEs annotated in 2006, because of valency [6] annotation in PDT

light verb constructions

idioms and phrases (not only verbal)

reflexive verbs (PDT-Vallex)

all MWEs annotated in 2010, project Lexemann [5]

nominal, verbal, adverbial etc.

also multiword named entities

some of them correspond to PST categories,

but they are annotated in several diverse ways

(9)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 9/30

Conversion

nevidomý ACT

dostat_se PRED

#QCor ACT

styk CPHR

pracovník PAT

rehabilitační RSTR

#PersPron ACT

utrpět TWHEN

zranění PAT

Nevidomý se

dostane do styku s

rehabilitačními pracovníky ,

když utrpí zranění .

nsp

nsp

1:IReflV;2:LVC 1;2

2 2

3:ID 3

Prague Dependency Treebank 3.0 PARSEME Shared Task

LVC ID

OTH IReflV

Nevidomý se dostane do styku s rehabilitačními pracovníky, když utrpí zranění.

Blind <REFL> gets into contact with rehabilitation workers, when sustains injury.

A blind man gets in touch with physiatrists when he sustains an injury.

(10)

Good practice for treebanks

annotation of MWEs in treebanks, Parseme

LREC'16 paper [1] resulting from TLT'15 paper [2]

Principle A: to annotate MWEs as such

Principle B: to mark MWEs in a distinctive and specific way

Principle C: to annotate even discontinuous MWEs and MWEs of varying forms

Principle D: to allow for searching MWEs by their type

And what about PDT?

(11)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 11/30

Extraction – LVC

...corresponds to a CPHR functor

Zákon tak vstoupil v platnost . Law so came into force.

By that the law has come into force.

zákon ACT

tak MANN

vstoupit PRED

#QCor ACT

platnost CPHR podepsat

<MWE category="LVC">

Zákon tak vstoupil v platnost.

Zákon tak

vstoupil

v

platnost

Sb Adv

Pred

AuxP

Obj

has come into force

1. Input text

2. PDT t-layer

3. PDT a-layer 4. Output annotation

Condition: CPHR node Range: noun in DPHR node + its governing verb node

+ a preposition, if exists with the noun

(12)

Extraction – ID (1)

...corresponds to a DPHR functor (or to a MW lexeme)

<MWE category="ID">

odezva ACT

#Neg RHEM

dát enunc PRED

na_sebe_čekat DPHR

.

Odezva na sebe nedala čekat.

Reaction on itself not-gave wait.

The reaction didn't keep us waiting. didn't keep us waiting

Odezva na sebe nedala čekat.

1. Input text 2. PDT t-layer

3. Output annotation

Condition: DPHR node

Range: all words mentioned in DPHR node + its governing verb node

(13)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 13/30

Extraction – ID (2)

...corresponds to a DPHR functor (or to a MW lexeme)

Condition: DPHR node

Range: all words mentioned in DPHR node + its governing verb node

ministr ACT

však PREC

prohlásitenunc PRED

#PersPron ACT

mít EFF

svědomí PAT

čistý RSTR .

Nevěřícně kroutím hlavou nad legislativou.

Disbelievingly I-shake head over legislation.

I am shaking my head in disbelief on the legislation.

1. Input text

2. PDT t-layer

<MWE category="ID">

shaking my head

Nevěřícně kroutím hlavou nad legislativou.

3. Output annotation

root

#PersPron ACT

kroutit PRED

nevěřícný MANN

hlava PAT

legislativa REG

mwe lexeme

verb

(14)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 14/30

Extraction – IReflV

... tvořit tvořit se tvrdit tyčit se

...corresponds to a reflexive t-lemma

Opatření se týká zejména domovníků.

The meassure involves chiefly housekeepers.

1. Input text

opatření ACT

týkat_se PRED

zejména RHEM

domovník PAT

2. PDT t-layer

4. Output annotation

Opatření se týká

zejména

domovníků

Sb AuxT

Pred

AuxZ Obj

3. PDT a-layer

Opatření se týká zejména domovníků.

<MWE category="IReflV">

involves

Condition: t-lemma with a reflexive particle + the particle's type is AuxT

Range: verb + reflexive particle

(15)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 15/30

Extraction – OTH

...corresponds to verbal MW lexemes

Condition: MWE, type 'lexeme'

+ neither idiom nor LVC + head is not a verb + it contains verb or is marked as verb in lexicon

Range: contents words are listed in MWE + auxiliary words -- to be done

Doktorand je studentem, jak se sluší a patří.

PhD-student is student, as <REFL> suits and befits.

A PhD student is a student, as he should be.

1. Input text

4. Output annotation

doktorand ACT

být PRED

student PAT

#Gen ACT

slušet_se RSTR

a CONJ

patřit_se RSTR

2. PDT t-layer Doktorand je studentem, jak se sluší a patří.

<MWE category="OTH">

as he should be

verb

conjunction

verb

3. PDT a-layer

... BASIC_FORM:

LEMMATIZED: MORPHO_TAGS: PDT_FREQ: POS: ...

jak se sluší a patří jak se slušet a patřit RB RFL VBZ CC VBZ 1

verb

3b. SemLex

(16)

Conversion

nevidomý ACT

dostat_se PRED

#QCor ACT

styk CPHR

pracovník PAT

rehabilitační RSTR

#PersPron ACT

utrpět TWHEN

zranění PAT

Nevidomý se

dostane do styku s

rehabilitačními pracovníky ,

když utrpí zranění .

nsp

nsp

1:IReflV;2:LVC 1;2

2 2

3:ID 3

Prague Dependency Treebank 3.0 PARSEME Shared Task

LVC ID

OTH IReflV

Nevidomý se dostane do styku s rehabilitačními pracovníky, když utrpí zranění.

Blind <REFL> gets into contact with rehabilitation workers, when sustains injury.

(17)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 17/30

Deverbatives

LVC: no nominal CPHR in PDT

ID: several nominal DPHR in PDT

not all of them are deverbative; picked manually

and also some of them from project Lexemann

IReflV: many deverbatives (nominal / adverbial)

We used rule-based ID and LVC recognizer by Milena Hnátková, upgraded for deverbatives.

Results were checked manually.

(18)

Overlapping

in general:

embedding

duplicates

some word is shared between two MWEs

(19)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 19/30

Overlapping – same type (1)

duplicated annotation

PDT – Lexemann agreement

⇒ remove one

PDT deep layer:

The measure can be taken for six month at most and only for selected items.

= The measure can be taken for six month at most and the measure can be taken only for selected

items.

⇒ remove one

(20)

Overlapping – same type (2)

different range, same type

coordination:

The ministry provides information services and counselling activities to small businesses.

⇒ preserve both

PDT – Lexemann disagreement:

to play a role vs. to play an important role not to turn a hair vs. not to turn even a hair

to have no option vs. to have no other option

⇒ preserve PDT range

(21)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 21/30

Overlapping – different type

IReflV is compatible with all other VMWEs

⇒ preserve both

different type (LVC vs ID) and same or different range

PDT – Lexemann disagreement

⇒ preserve PDT type and range

(22)

Results

VMWE type number of

all instances instances

without overlaps

ID 2,107 1,611

LVC 2,496 2,437

IReflV 10,266 9,982

OTH 2 2

Total 14,032

(23)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 23/30

Four principles – score

back to four principles

Principle A: to annotate MWEs as such

Principle B: to mark MWEs in a distinctive and specific way

Principle C: to annotate even

discontinuous MWEs and MWEs of varying forms

Principle D: to allow for searching

MWEs by their type

(24)

Four principles – score

back to four principles

Principle A: to annotate MWEs as such

Principle B: to mark MWEs in a distinctive and specific way

Principle C: to annotate even

discontinuous MWEs and MWEs of varying forms

Principle D: to allow for searching MWEs by their type

✓?

(25)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 25/30

Four principles – score

back to four principles

Principle A: to annotate MWEs as such

Principle B: to mark MWEs in a distinctive and specific way

Principle C: to annotate even

discontinuous MWEs and MWEs of varying forms

Principle D: to allow for searching MWEs by their type

✓?

(26)

Four principles – score

back to four principles

Principle A: to annotate MWEs as such

Principle B: to mark MWEs in a distinctive and specific way

Principle C: to annotate even

discontinuous MWEs and MWEs of varying forms

Principle D: to allow for searching MWEs by their type

✓?

(27)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 27/30

Four principles – score

back to four principles

Principle A: to annotate MWEs as such

Principle B: to mark MWEs in a distinctive and specific way

Principle C: to annotate even

discontinuous MWEs and MWEs of varying forms

Principle D: to allow for searching MWEs by their type

✓?

✗?

(28)

Conclusion

well founded, rich annotation of MWEs in PDT

conforming to most of four Parseme principles

almost fully automatic transformation

14 thousand of verbal multiword expressions

Czech data – one of the largest data sets for

the Parseme Shared Task

(29)

21st January, 2017 TLT 15, Bloomington, Indiana

Bejček, Hajič, Straňák, and Urešová:

Extracting Verbal Multiword Data from Rich Treebank Annotation 29/30

Acknowledgement

Czech Ministry of Education, Youth and Sports project PARSEME (LD14117)

European COST project PARSEME (IC1207)

LINDAT/CLARIN repository, supported by the MEYS (LM2010013, LM2015071)

We also thank our colleague Milena Hnátková who

kindly extracted deverbative variants of VMWEs and

manually checked them.

(30)

References

[1] Victoria Rosén, Koenraad De Smedt, Gyri Losnegaard, Eduard Bejček, Agata Savary, and Petya Osenova. MWEs in treebanks: From survey to guidelines. In Nicoletta Calzolari et al., editors, Proceedings of the 10th International Conference LREC 2016, pages 2323–2330, Paris, France, 2016.

[2] Victoria Rosén, Gyri Smørdal Losnegaard, Koenraad De Smedt, Eduard Bejček, Agata Savary, Adam Przepiórkowski, Petya Osenova, Verginica Mititelu: A survey of multiword expressions in

treebanks. In: 14th International Workshop TLT 2015, pages 179–193, IPIPAN, Warszawa, Poland, 2015.

[3] Veronika Vincze, Agata Savary, Marie Candito, Carlos Ramisch, and Fabienne Cap. Annotation guidelines for the PARSEME shared task on automatic detection of verbal multiword

expressions, version 6.0, 2016. http://typo.uni-konstanz.de/parseme/images/shared- task/guidelines/PARSEME-ST-annotation-guidelines-v6.pdf or http://parsemefr.lif.univ- mrs.fr/guidelines-hypertext

[4] Jan Hajič, Jarmila Panevová, Eva Hajičová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová, Zdeněk Žabokrtský, Magda Ševčíková Razímová, and Zdeňka Urešová. Prague

Dependency Treebank 2.0, 2006. LDC2006T01. Philadelphia, PA, USA.

[5] Pavel Straňák. Annotation of Multiword Expressions in The Prague Dependency Treebank. PhD thesis, Charles University in Prague, 2010.

[6] Zdeňka Urešová. Valence sloves v Pražském závislostním korpusu. Studies in Computational and Theoretical Linguistics. Ústav formální a aplikované lingvistiky, Praha, Czechia, 2011.

Odkazy

Související dokumenty

The experimentally observed shift of the photocurrent (PC) peak at room temperature can be explained as an effect of screening of the applied electric field by space charge

As Charles University Distinguished Chair at the Faculty of Mathematics and Physics in the Department of Probability and Mathematical Statistics (DPMS), my hope is to teach, to

Ondˇrej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University,

Ondˇrej Bojar bojar@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University,

6.Kríž V., Hladká, B., Urešová, Z.: Czech Legal Text Treebank 1.0, LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University, Prague,

ˇ SM´IDEK Michal, Department of Mathematical Analysis, Faculty of Mathematics and Physics, Charles University, Sokolovsk´ a 83, 186 75 Prague 8, Czech Republic (September 27,

CERN ´ ˇ Y Robert, Department of Mathematical Analysis, Faculty of Mathematics and Physics, Charles University, Sokolovsk´ a 83, 186 75 Prague 8, Czech Republic (March 11,

Department of Neurology and Center of Clinical Neurosciences Charles University in Prague, First Faculty of Medicine and General Faculty Hospital in Prague, Czech