• Nebyly nalezeny žádné výsledky

Improving Dependency Parsing Using Sentence Clause Charts

N/A
N/A
Protected

Academic year: 2022

Podíl "Improving Dependency Parsing Using Sentence Clause Charts"

Copied!
61
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Linguistic Mondays, 10.10.2016 MFF UK

Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

Charles University in Prague

Czech Republic kriz@ufal.mff.cuni.cz

http://ufal.mff.cuni.cz/vincent-kriz

(2)

large collections of documents

efficient browsing & querying

typical approaches

– full-text search

– meta-data search

no semantics

(3)

Intelligent Library (INTLIB)

– founded by

– 2012–2015

– partners

(4)

New search approach

– semantic interpretation of documents

– suitable DB & query language

– user-friendly browsing & querying

(5)

New search approach

– semantic interpretation of documents

– suitable DB & query language

– user-friendly browsing & querying

Knowledge base

– set of entities and relations between them

(6)

New search approach

– semantic interpretation of documents

– suitable DB & query language

– user-friendly browsing & querying

Knowledge base

– set of entities and relations between them

RExtractor

– information extraction system

(7)

from plain-texts

server architecture

– process client's requests

– REST API

– web interface (~ demo)

http://quest.ms.mff.cuni.cz:14280

(8)

– queries over dependency trees

– domain and language independent

real use-case defined by INTLIB

– definitions, rights and obligations in Czech laws

– Czech extraction strategy

(9)

– queries over dependency trees

– domain and language independent

real use-case defined by INTLIB

– definitions, rights and obligations in Czech laws

– Czech extraction strategy

– English extraction strategy

(10)

– Accounting Act (563/1991 Coll.)

– Decree on Double-entry Accounting for undertakers (500/2002 Coll.)

– automatically parsed, then manually checked

1,133 manually annotated dependency trees

35,085 tokens

Czech Legal Text Treebank

(11)

– Kríž Vincent, Hladká Barbora, Urešová Zdeňka: Czech Legal Text Treebank 1.0. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Copyright © European Language Resources

Association, Paris, France, ISBN 978-2-9517408-9-1, pp. 2387-

2392, 2016

(12)

Error # of errors Ratio

Parser 145 59.7%

Query 93 38.3%

Entity 5 2.1%

Error # of errors Ratio

Parser 145 59.7%

Query 93 38.3%

Entity 5 2.1% Parser60%

Query 38%

Entity 2%

(13)

MST parser

Ryan McDonald, Fernando Pereira, Kiril Ribarov, Jan Hajič (2005):

Non-projective Dependency Parsing using Spanning Tree

Algorithms. In: Proceedings of HLT/EMNLP, Vancouver, British Columbia.

– trained on newspaper texts

long sentences still problematic

(14)

– as the sentence length increases,

the unlabeled attachment score (UAS) decreases

1-10 11-20 21-30 31-40 41-50 51+

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

PDT dtest PDT etest CAC

Sentence length UAS

(15)

– Kuboň (2001), Kuboň et al. (2007)

– segments – easily detectable and linguistically motivated units

– may be combined into clauses

– provide a structure of a complex sentence with

regard to the mutual relationship of individual

clauses

(16)

– Lopatková and Holan (2009)

– a new module between morphological and syntactic analysis

– determine the overall sentence structure

segmentation chart

relationship among segments

especially relations of coordination, apposition and

subordination

(17)

řeči rád zdůrazňoval své vzdělání.

Credits: Lopatková and Holan (2009)

(18)

řeči rád zdůrazňoval své vzdělání.

split sentence into segments

– rule-based boundaries identification

punctuation marks, coordinating conjunctions, brackets, …

Credits: Lopatková and Holan (2009)

(19)

determine mutual relations

– manually designed rules

finite verb

subordinating expression

opening bracket

Credits: Lopatková and Holan (2009) S tím byly trochu problémy , protože starosta … vzdělání .

(20)

Credits: Lopatková and Holan (2009) S tím byly trochu problémy

protože starosta … vzdělání

(21)

segmentation chart

captures the layer of embedding for individual segments

Credits: Lopatková and Holan (2009) S tím byly trochu problémy

protože starosta … vzdělání

(22)

segmentation chart principles

main segments belong to layer 0

segments that depend on segment on layer k belong to k+1

coordinated segments have the same layer

segments in parenthesis/brackets belong to k+1 layer

Credits: Lopatková and Holan (2009) S tím byly trochu problémy

protože starosta … vzdělání

(23)

– Lopatková et al. (2012)

– manual clause structure annotation based on the concept of segments

– 2,699 annotated sentences

(24)

– Krůza and Kuboň (2014)

– automatic procedure for recognizing clauses and their mutual relationship from plain-texts

– Bejček et al. (2013)

– automatic procedure for recognizing clauses and their mutual relationship from dependency

trees

– used for clause annotation in PDT 3.0

(25)
(26)

– Lopatková and Holan (2009)

– two differences

subordinating conjunctions at the beginning of each clause are considered as boundaries

clauses split into two parts (by an embedded clause)

are considered as two different clauses

(27)

While failure is usually an orphan, the success tends to have many

fathers, claiming eagerly that

particularly they were present at its

conception.

failure is usually an orphan

the success tends to have many fathers

claiming eagerly

particularly they were present at its conception

while , ,

that

B 1 B 0 B 1 B 2

1 0

1 2

2

3

4

Credits: Kuboň et al. (2007)

(28)

– from dependency trees with the clause annotation

– a layer of embedding number of different →

clauses on the path from the clause to the root in

the dependency tree

(29)

se jim podařilo rozdělit molekuly dekaboranu na části,

nepodařilo se

zmíněným vědcům zatím určit, jaké produkty přitom

vznikly.

(30)

podařilo rozdělit molekuly dekaboranu na části, nepodařilo se zmíněným vědcům zatím určit, jaké produkty přitom vznikly.

0 1 2

(31)

podařilo rozdělit molekuly dekaboranu na části, nepodařilo se zmíněným vědcům zatím určit, jaké produkty přitom vznikly.

1 0

1 2

(32)

podařilo rozdělit molekuly dekaboranu na části, nepodařilo se zmíněným vědcům zatím určit, jaké produkty přitom vznikly.

1 0

1

2 2

(33)

podařilo rozdělit molekuly dekaboranu na části, nepodařilo se zmíněným vědcům zatím určit, jaké produkty přitom vznikly.

1 0

1

2 2

3

(34)

podařilo rozdělit molekuly dekaboranu na části, nepodařilo se zmíněným vědcům zatím určit, jaké produkty přitom vznikly.

1 0

1

2 2

3

4

(35)

podařilo rozdělit molekuly dekaboranu na části, nepodařilo se zmíněným vědcům zatím určit, jaké produkty přitom vznikly.

1 0

1

2 2

3

4

(36)

podařilo rozdělit molekuly dekaboranu na části, nepodařilo se zmíněným vědcům zatím určit, jaké produkty přitom vznikly.

1 0

1

2 2

3

4

B 1 B 2 B 0 B 1 B

(37)
(38)

0 0B1 0B0 0B1B0 0B1B2 70

75 80 85 90 95 100

0 10 20 30 40 50 60

PDT train PDT dtest PDT etest CAC 2.0 Rel. freq.

UAS

(39)

– sentence with 36 clauses

– sentence with 7 layers of embedding

0B1B2B3B4B5B6

(40)

exploit an existing dependency parser

trained on complete sentences

exploit gold-standard clause charts

Kríž Vincent, Hladká Barbora: Improving Dependency Parsing Using

Sentence Clause Charts. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics – Student Research Workshop

two specific strategies

parsing coordinated clauses

parsing subordinated clauses

(41)

coordinated clauses

– let's explore the most simple sentences with coordinated clauses – 0B0

– how good is the full- scale parser on

individual clauses from 0B0?

Full-scale Clauses

76 78 80 82 84 86

PDT dtest PDT etest CAC 2.0

UAS

(42)

coordinated clauses

– let's explore the most simple sentences with coordinated clauses – 0B0

– how good is the full- scale parser on

individual clauses from 0B0?

Full-scale Clauses

76 78 80 82 84 86

PDT dtest PDT etest CAC 2.0

UAS

+4% of UAS in average

(43)

C

1

, C

2

, …, C

n

neighboring coordinated clauses

on the same layer

– parse C

i

individually

obtain dependency tree T

i

with root node r

i

– create a sequence of tokens S = r

1

 B

1,2

 r

2

 B

2,3

 … r

n

– parse S , obtain T

s

– build a final dependency tree using T

and T

s

(44)

John loves Mary and Linda hates Peter.

C

1

 = {John loves Mary}, C

2

 = {Linda hates Peter}

parse individual clauses

C

1

T

1

, r

= loves

C

2

T

2

, r

= hates

create a sequence of tokens S = {loves and hates}

parse T

s

build a final dependency tree

(45)

Full-scale Clauses CCP 76

78 80 82 84 86 88

PDT dtest PDT etest CAC 2.0

(46)

Full-scale Clauses CCP 76

78 80 82 84 86 88

PDT dtest PDT etest CAC 2.0

+1.4% of UAS in average

(47)

subordinated clauses

exploring 0B1 sentences

almost no improvement when parse individual clauses

UAS is significantly

higher then overall UAS

84 Full-scale Clauses

84.5 85 85.5 86 86.5 87 87.5 88 88.5

PDT dtest PDT etest CAC 2.0

UAS

(48)

subordinated clauses

exploring 0B1 sentences

almost no improvement when parse individual clauses

UAS is significantly

higher then overall UAS

84 Full-scale Clauses

84.5 85 85.5 86 86.5 87 87.5 88 88.5

PDT dtest PDT etest CAC 2.0

UAS

0 0B1 0B0 0B1B0 0B1B2

70 75 80 85 90 95

0 10 20 30 40 50

PDT train PDT dtest PDT etest CAC 2.0 Rel. freq.

UAS

(49)

C

1

, C

2

, …, C

n

the longest sequence of neighboring subordinated clauses

layer ( C

i+1

) = layer ( C

i

) + 1

– create a sequence of tokens S = C

1

 B

1,2

 C

2

 B

2,3

 … C

n

– parse S , obtain T

s

(50)

clauses

evaluation on 0B1B0 sentences

parse 0B1

parse 0B0

Full-scale CCP

77 78 79 80 81 82 83 84

PDT dtest PDT etest CAC 2.0

(51)

clauses

evaluation on 0B1B0 sentences

parse 0B1

parse 0B0

Full-scale CCP

77 78 79 80 81 82 83 84

PDT dtest PDT etest CAC 2.0

+1.6% of UAS in average

(52)

– work in cycles

– check the deepest layer

if there are coordinated clauses apply 0B0 strategy →

otherwise identify the longest sequence of subordinated clauses apply 0B1 strategy →

– use standard full-scale parsing as a fall-back

(53)

Full-scale CCP 80.5

81 81.5 82 82.5 83 83.5 84 84.5 85

PDT dtest PDT etest CAC 2.0

(54)

Full-scale CCP 80.5

81 81.5 82 82.5 83 83.5 84 84.5 85

PDT dtest PDT etest CAC 2.0

+1.0% of UAS in average

(55)

Full-scale CCP 81.5

82 82.5 83 83.5 84 84.5 85 85.5

PDT dtest PDT etest CAC 2.0

(56)

Full-scale CCP 81.5

82 82.5 83 83.5 84 84.5 85 85.5

PDT dtest PDT etest CAC 2.0

+0.7% of UAS in average

(57)

– Czech Legal Text Treebank 1.0

– relation extraction in RExtractor

clause charts

– extraction from plain-text

special parsers

– train on individual clauses

(58)

parsing

1% increase of UAS on complex sentences

(59)

parsing

1% increase of UAS on complex sentences

in the real parsing task, automatically detected clause structures must be used, not gold-standard

we can train specialized clause-parsers – for main clauses, subordinated clauses, merge clauses, …

we can find out better strategies for parsing sequences of subordinated clauses

(60)

parsing

1% increase of UAS on complex sentences

in the real parsing task, automatically detected clause structures must be used, not gold-standard

we can train specialized clause-parsers – for main clauses, subordinated clauses, merge clauses, …

we can find out better strategies for parsing sequences of subordinated clauses

(61)

parsing

1% increase of UAS on complex sentences

in the real parsing task, automatically detected clause structures must be used, not gold-standard

we can train specialized clause-parsers – for main clauses, subordinated clauses, merge clauses, …

we can find out better strategies for parsing sequences of subordinated clauses

Odkazy

Související dokumenty

Samoorganizované molekuly jsou amfifilní, což znamená, že ve své struktuře obsahují jak hydrofilní (polární), tak hydrofóbní (nepolární) části řetězce. Tyto

středních počtů iontů v těsné blízkosti HA, středních počtů vodíkových můstků (v rámci molekuly HA a z molekuly HA na vodu) a délek šestice

Rehydratační reakcí se do struktury podvojných vrstevnatých hydroxidů Mg-Al a Zn- Al podařilo interkalovat také neionizované molekuly cuk- rů 12 , což otevírá

Ty- to p¯emÏÚujÌ prokolagenovÈ molekuly na molekuly kolagenu, takÈ naz˝vanÈho tropokolagen (1,5 nm v pr˘mÏru), kterÈ se spojujÌ v mimobunÏËnÈm prostoru za

Nathan Green (MFF Charles University) Dependency Parsing June 2, 2011 1 / 29...

Construction Head Dependent Exocentric Verb Subject (nsubj)!. Verb Object (dobj) Endocentric Verb

 semantic sentence structure, defined as a dependency tree of semantic sentence structure, defined as a dependency tree of semantic roles, provides a more stable alternative

 UAS (unlabeled attachment score) – standard metric for evaluation of dependency parsers..  UUAS (undirected unlabeled attachment score) – edge direction is disregarded (it is