TLT 15, Bloomington, Jan 2017
Extracting Verbal Multiword Data from Rich Treebank Annotation
Eduard Bejček, Jan Hajič, Pavel Straňák and Zdeňka Urešová Charles University in Prague,
Faculty of Mathematics and Physics, ÚFAL
{bejcek,hajic,stranak,uresova}@ufal.mff.cuni.cz
Introduction
Parseme Shared Task (PST)
within european project on MWEs and parsing
competition between MWE identification systems
part of MWE Workshop at EACL 2017 in Valencia
still open for participation
blind test data has been released yesterday, system submission in a week
data for 18 languages (usu. thousands of MWEs)
manual annotation of all verbal MWEs in text
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 3/30
Motivation
18 languages from 18 countries
manual annotation according to PST Annotation Guidelines is needed for 17 languages
Czech has already a MWE annotated corpus,
but long before PST Annotation Guidelines
Motivation
18 languages from 18 countries
manual annotation according to PST Annotation Guidelines is needed for 17 languages
Czech has already a MWE annotated corpus, but long before PST Annotation Guidelines
Let's rather try to
transform the annotation!
= compare the guidelines and extract VMWEs
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 5/30
Overview of the talk
Types of verbal MWEs in PST
MWEs in Prague Dependency Treebank
Principles for good practice in annotation
VMWEs extraction itself:
extraction of each type
(extraction of deverbative variants)
(resolving of overlapping annotation)
Results and conclusion
Types of VMWEs in PST [3] (1)
Light verb construction (LVC)
to make a decision, to come into bloom
Idiom (ID)
to stand firm, to come into play, to make it, to know on which side the bread is buttered
Inherently reflexive verb (IReflV)
FR: se suicider, s'aprecevoir (“realize”, not “see”)
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 7/30
Types of VMWEs in PST [3] (2)
Verb-particle construction (VPC)
to put off, to blow up, to do in
Language-specific categories
Other verbal MWEs (OTH)
to drink and drive, to short-circuit
no VPC and LSpec categories in Czech
deverbatives
decision making, decision which he made,
decision previously made
PDT and MWEs
Prague Dependency Treebank (PDT) [4]
several types of MWEs annotated in 2006, because of valency [6] annotation in PDT
light verb constructions
idioms and phrases (not only verbal)
reflexive verbs (PDT-Vallex)
all MWEs annotated in 2010, project Lexemann [5]
nominal, verbal, adverbial etc.
also multiword named entities
some of them correspond to PST categories,
but they are annotated in several diverse ways
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 9/30
Conversion
nevidomý ACT
dostat_se PRED
#QCor ACT
styk CPHR
pracovník PAT
rehabilitační RSTR
#PersPron ACT
utrpět TWHEN
zranění PAT
Nevidomý se
dostane do styku s
rehabilitačními pracovníky ,
když utrpí zranění .
nsp
nsp
1:IReflV;2:LVC 1;2
2 2
3:ID 3
Prague Dependency Treebank 3.0 PARSEME Shared Task
LVC ID
OTH IReflV
Nevidomý se dostane do styku s rehabilitačními pracovníky, když utrpí zranění.
Blind <REFL> gets into contact with rehabilitation workers, when sustains injury.
A blind man gets in touch with physiatrists when he sustains an injury.
Good practice for treebanks
annotation of MWEs in treebanks, Parseme
LREC'16 paper [1] resulting from TLT'15 paper [2]
Principle A: to annotate MWEs as such
Principle B: to mark MWEs in a distinctive and specific way
Principle C: to annotate even discontinuous MWEs and MWEs of varying forms
Principle D: to allow for searching MWEs by their type
And what about PDT?
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 11/30
Extraction – LVC
...corresponds to a CPHR functor
Zákon tak vstoupil v platnost . Law so came into force.
By that the law has come into force.
zákon ACT
tak MANN
vstoupit PRED
#QCor ACT
platnost CPHR podepsat
<MWE category="LVC">
Zákon tak vstoupil v platnost.
Zákon tak
vstoupil
v
platnost
Sb Adv
Pred
AuxP
Obj
has come into force
1. Input text
2. PDT t-layer
3. PDT a-layer 4. Output annotation
Condition: CPHR node Range: noun in DPHR node + its governing verb node
+ a preposition, if exists with the noun
Extraction – ID (1)
...corresponds to a DPHR functor (or to a MW lexeme)
<MWE category="ID">
odezva ACT
#Neg RHEM
dát enunc PRED
na_sebe_čekat DPHR
.
Odezva na sebe nedala čekat.
Reaction on itself not-gave wait.
The reaction didn't keep us waiting. didn't keep us waiting
Odezva na sebe nedala čekat.
1. Input text 2. PDT t-layer
3. Output annotation
Condition: DPHR node
Range: all words mentioned in DPHR node + its governing verb node
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 13/30
Extraction – ID (2)
...corresponds to a DPHR functor (or to a MW lexeme)
Condition: DPHR node
Range: all words mentioned in DPHR node + its governing verb node
ministr ACT
však PREC
prohlásitenunc PRED
#PersPron ACT
mít EFF
svědomí PAT
čistý RSTR .
Nevěřícně kroutím hlavou nad legislativou.
Disbelievingly I-shake head over legislation.
I am shaking my head in disbelief on the legislation.
1. Input text
2. PDT t-layer
<MWE category="ID">
shaking my head
Nevěřícně kroutím hlavou nad legislativou.
3. Output annotation
root
#PersPron ACT
kroutit PRED
nevěřícný MANN
hlava PAT
legislativa REG
mwe lexeme
verb
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 14/30
Extraction – IReflV
... tvořit tvořit se tvrdit tyčit se
...corresponds to a reflexive t-lemma
Opatření se týká zejména domovníků.
The meassure involves chiefly housekeepers.
1. Input text
opatření ACT
týkat_se PRED
zejména RHEM
domovník PAT
2. PDT t-layer
4. Output annotation
Opatření se týká
zejména
domovníků
Sb AuxT
Pred
AuxZ Obj
3. PDT a-layer
Opatření se týká zejména domovníků.
<MWE category="IReflV">involves
Condition: t-lemma with a reflexive particle + the particle's type is AuxT
Range: verb + reflexive particle
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 15/30
Extraction – OTH
...corresponds to verbal MW lexemes
Condition: MWE, type 'lexeme'
+ neither idiom nor LVC + head is not a verb + it contains verb or is marked as verb in lexicon
Range: contents words are listed in MWE + auxiliary words -- to be done
Doktorand je studentem, jak se sluší a patří.
PhD-student is student, as <REFL> suits and befits.
A PhD student is a student, as he should be.
1. Input text
4. Output annotation
doktorand ACT
být PRED
student PAT
#Gen ACT
slušet_se RSTR
a CONJ
patřit_se RSTR
2. PDT t-layer Doktorand je studentem, jak se sluší a patří.
<MWE category="OTH">
as he should be
verb
conjunction
verb
3. PDT a-layer
... BASIC_FORM:LEMMATIZED: MORPHO_TAGS: PDT_FREQ: POS: ...
jak se sluší a patří jak se slušet a patřit RB RFL VBZ CC VBZ 1
verb
3b. SemLex
Conversion
nevidomý ACT
dostat_se PRED
#QCor ACT
styk CPHR
pracovník PAT
rehabilitační RSTR
#PersPron ACT
utrpět TWHEN
zranění PAT
Nevidomý se
dostane do styku s
rehabilitačními pracovníky ,
když utrpí zranění .
nsp
nsp
1:IReflV;2:LVC 1;2
2 2
3:ID 3
Prague Dependency Treebank 3.0 PARSEME Shared Task
LVC ID
OTH IReflV
Nevidomý se dostane do styku s rehabilitačními pracovníky, když utrpí zranění.
Blind <REFL> gets into contact with rehabilitation workers, when sustains injury.
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 17/30
Deverbatives
LVC: no nominal CPHR in PDT
ID: several nominal DPHR in PDT
not all of them are deverbative; picked manually
and also some of them from project Lexemann
IReflV: many deverbatives (nominal / adverbial)
We used rule-based ID and LVC recognizer by Milena Hnátková, upgraded for deverbatives.
Results were checked manually.
Overlapping
in general:
embedding
duplicates
some word is shared between two MWEs
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 19/30
Overlapping – same type (1)
duplicated annotation
PDT – Lexemann agreement
⇒ remove one
PDT deep layer:
The measure can be taken for six month at most and only for selected items.
= The measure can be taken for six month at most and the measure can be taken only for selected
items.
⇒ remove one
Overlapping – same type (2)
different range, same type
coordination:
The ministry provides information services and counselling activities to small businesses.
⇒ preserve both
PDT – Lexemann disagreement:
to play a role vs. to play an important role not to turn a hair vs. not to turn even a hair
to have no option vs. to have no other option
⇒ preserve PDT range
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 21/30
Overlapping – different type
IReflV is compatible with all other VMWEs
⇒ preserve both
different type (LVC vs ID) and same or different range
PDT – Lexemann disagreement
⇒ preserve PDT type and range
Results
VMWE type number of
all instances instances
without overlaps
ID 2,107 1,611
LVC 2,496 2,437
IReflV 10,266 9,982
OTH 2 2
Total 14,032
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 23/30
Four principles – score
back to four principles
Principle A: to annotate MWEs as such
Principle B: to mark MWEs in a distinctive and specific way
Principle C: to annotate even
discontinuous MWEs and MWEs of varying forms
Principle D: to allow for searching
MWEs by their type
Four principles – score
back to four principles
Principle A: to annotate MWEs as such
Principle B: to mark MWEs in a distinctive and specific way
Principle C: to annotate even
discontinuous MWEs and MWEs of varying forms
Principle D: to allow for searching MWEs by their type
✓?
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 25/30
Four principles – score
back to four principles
Principle A: to annotate MWEs as such
Principle B: to mark MWEs in a distinctive and specific way
Principle C: to annotate even
discontinuous MWEs and MWEs of varying forms
Principle D: to allow for searching MWEs by their type
✓?
✓
Four principles – score
back to four principles
Principle A: to annotate MWEs as such
Principle B: to mark MWEs in a distinctive and specific way
Principle C: to annotate even
discontinuous MWEs and MWEs of varying forms
Principle D: to allow for searching MWEs by their type
✓?
✓
✓
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 27/30
Four principles – score
back to four principles
Principle A: to annotate MWEs as such
Principle B: to mark MWEs in a distinctive and specific way
Principle C: to annotate even
discontinuous MWEs and MWEs of varying forms
Principle D: to allow for searching MWEs by their type
✓?
✓
✓
✗?
Conclusion
well founded, rich annotation of MWEs in PDT
conforming to most of four Parseme principles
almost fully automatic transformation
14 thousand of verbal multiword expressions
Czech data – one of the largest data sets for
the Parseme Shared Task
21st January, 2017 TLT 15, Bloomington, Indiana
Bejček, Hajič, Straňák, and Urešová:
Extracting Verbal Multiword Data from Rich Treebank Annotation 29/30
Acknowledgement
Czech Ministry of Education, Youth and Sports project PARSEME (LD14117)
European COST project PARSEME (IC1207)
LINDAT/CLARIN repository, supported by the MEYS (LM2010013, LM2015071)
We also thank our colleague Milena Hnátková who
kindly extracted deverbative variants of VMWEs and
manually checked them.
References
[1] Victoria Rosén, Koenraad De Smedt, Gyri Losnegaard, Eduard Bejček, Agata Savary, and Petya Osenova. MWEs in treebanks: From survey to guidelines. In Nicoletta Calzolari et al., editors, Proceedings of the 10th International Conference LREC 2016, pages 2323–2330, Paris, France, 2016.
[2] Victoria Rosén, Gyri Smørdal Losnegaard, Koenraad De Smedt, Eduard Bejček, Agata Savary, Adam Przepiórkowski, Petya Osenova, Verginica Mititelu: A survey of multiword expressions in
treebanks. In: 14th International Workshop TLT 2015, pages 179–193, IPIPAN, Warszawa, Poland, 2015.
[3] Veronika Vincze, Agata Savary, Marie Candito, Carlos Ramisch, and Fabienne Cap. Annotation guidelines for the PARSEME shared task on automatic detection of verbal multiword
expressions, version 6.0, 2016. http://typo.uni-konstanz.de/parseme/images/shared- task/guidelines/PARSEME-ST-annotation-guidelines-v6.pdf or http://parsemefr.lif.univ- mrs.fr/guidelines-hypertext
[4] Jan Hajič, Jarmila Panevová, Eva Hajičová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová, Zdeněk Žabokrtský, Magda Ševčíková Razímová, and Zdeňka Urešová. Prague
Dependency Treebank 2.0, 2006. LDC2006T01. Philadelphia, PA, USA.
[5] Pavel Straňák. Annotation of Multiword Expressions in The Prague Dependency Treebank. PhD thesis, Charles University in Prague, 2010.
[6] Zdeňka Urešová. Valence sloves v Pražském závislostním korpusu. Studies in Computational and Theoretical Linguistics. Ústav formální a aplikované lingvistiky, Praha, Czechia, 2011.