• Nebyly nalezeny žádné výsledky

Syntactic annotation of a second-language learner corpus

N/A
N/A
Protected

Academic year: 2022

Podíl "Syntactic annotation of a second-language learner corpus"

Copied!
20
0
0

Načítání.... (zobrazit plný text nyní)

Fulltext

(1)

Syntactic annotation of a

second-language learner corpus

Jirka Hana & Barbora Hladká

Charles University Prague

ICBLT 2018

(2)

CzeSL – Corpus of L2 Czech

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 2

(3)

CzeSL – Czech as a Second Language

• Part of AKCES – Acquisition Corpora of Czech

• Essays written by non-native speakers of Czech

• A1 – C1 CEFR proficiency levels

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 3

learner corpus

(4)

CzeSL – two releases

• CzeSL-SGT

• 8,600 essays, 1.1M tokens

• http://hdl.handle.net/11234/1-162, CC BY-SA-3.0

• CzeSL-man <– we work with this here

• 645 essays, 120K tokens, 11K sentences

• Manually corrected and annotated for errors

• https://bitbucket.org/czesl, CC BY-SA-3.0

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 4

learner corpus

(5)

CzeSL – number of documents by CEFR Level

ICBLT 2018 5

Level Documents

Basic user A1 57

A1+ 3

A2 111

A2+ 145

Independent user B1 176

B2 124

Proficient user C1 12

Unknown 17

Total 645

Hana & Hladká: Syntactic annotation of a second-language learner corpus

(6)

Non-native and native language are different

Non-native langue has:

• Errors in spelling, grammar, vocabulary, collocations

• Different distribution of vocabulary and syntactic constructions

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 6

learner corpus

(7)

CzeSL: Error Annotation Scheme

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 7

Tier 0 original text: Myslim že kdy by byl se svim ditem ...

Tier 1 words correct: Myslím že kdyby byl se svým dítětem ...

Tier 2 contextually correct: Myslím , že kdybych byl se svým dítětem ...

think

SG1

that if

SG1

was

MASC

with my child ...

`I think that if I were with my child ….’

corrections

(8)

Sample non-native text: My Family

Jmenujese Adam. Ja jsem Mongolska. Mongolska ma 21 kraji. Moje rodina je hezka jeste velka. Mongolska je 3000 million lidi. Ma tradični píseňka,

taneční. Mongolska tradicni píseňka je hezka. Ješte ma ”Morin khuur ”. Morin Khuur to je muzika. Ten hezka tradični pohádka, píseň. Mongolska má mnoho tradiční svátík. Třiba Naadam, Tsagaarsur. Ješte mnoho Velbloud, Kůn, Kravá, Koza, Ovce. Mongolsky lidi dobrý. Mongolsko ma mnoho hory a nemam ocean.

Mongolska hlavní naměsto. Ulaanbaatar.

ADAM, 18 Let

Bydlim v Cechagh už 6 měsíc.

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 8

learner corpus

(9)

Task: Annotate some structure of L2 Czech

Motivation:

- better understanding of L2 Czech (including its grammar) - better computational processing of L2 Czech

Some structure?

- the deeper, the better, ideally semantics - dependency syntax for practical purposes Work in progress ...

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 9

learner corpus

(10)

Universal Dependencies

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 10

(11)

Universal Dependencies (UD)

• 100+ corpora in 60+ languages

• syntactic annotation based on dependency syntax

• language agnostic (mostly)

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 11

(12)

Universal Dependencies (UD)

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 12

Yesterday, John invited Mary to his birthday party.

(13)

UD Annotation of Non-native Czech

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 13

(14)

Annotating original text

• Annotate the original text, not corrections

• Ideal case: use grammar of author‘s interlanguage

• Reality: often, not enough data

• Be conservative, assume as little as possible

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 14

learner corpus

(15)

Annotating original text – Example 1

• Standard – oblique (adjunct):

• Non-native – direct object:

ICBLT 2018 15

Vstoupit místnost.

enter room.

Intended: `Enter a room.’

Vstoupit do místnosti.

enter into room.

`Enter a room.’

Hana & Hladká: Syntactic annotation of a second-language learner corpus

(16)

Annotating original text – Example 2

• Standard – nej `most‘ is a prefix

• Non-native – nej is a word:

ICBLT 2018 16

největší město v Kolumbii biggest city in Columbia

`the biggest city in Columbia’

Hana & Hladká: Syntactic annotation of a second-language learner corpus

nej větší město v Kolumbii most bigger city in Columbia

`the biggest city in Columbia’

(17)

Annotating original text – an unclear example

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 17

learner corpus

Oba jsou stejné důležité.

both are equal important

`Both are equally important ’

- stejné = equal (adjective) - stejně = equally (adverb) What is it?

- spelling error

- adj/adv neutralization - other error

What is the lemma, POS, syntactic fnc?

(18)

Sometimes UD helps ...

Jsem Mongolska.

`I am Mongolian / a Mongolian / from Mongolia’

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 18

Jsem mongolský. – adjective, not in std language

Jsem Mongol. – inhabitant, noun

Jsem z Mongolska. – country, preposition + noun

The same structure in UD

(19)

Current status

• 2,100 sentences out of 11,000 annotated so far

• 100 sentences double annotated with Cohen’s kappa:

• Universal POS: 0.93

• Dependency Label: 0.89

• Relation: 0.93

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language learner corpus 19

(20)

Future work

• More double annotated data

• More annotated data – annotate the whole CzeSL

• Test standard and custom trained parsers

ICBLT 2018 Hana & Hladká: Syntactic annotation of a second-language 20

learner corpus

Odkazy

Související dokumenty

Production of a backward translation of the reconciled language version into the source language by one professional translator, native speaker of the source language and fluent

As expected, there is no desperate need for deep syntactic analysis in case of language pairs of closely related languages. The experiments with the language pair

Automatic assignment of error tags wherever possible, based on comparing faulty and corrected forms.. Standard morphosyntactic tagging

Rosen (Charles University) Error-tagged Learner Corpus of Czech Prague 2012 1 / 50... Rosen (Charles University) Error-tagged Learner Corpus of Czech Prague 2012 2

Bojaál jsem se že ona se ne bude libit prahu , feared AUX RFL that she RFL not will *like *prague,.. I was afraid that she would not

 PBT systems are (mostly) based on IBM word alignment models and IBM translation models don’t model structural or syntactic aspect of language. These models are

learner corpora (CzeSL, Merlin) and Data- Driven Learning (DDL), a method which uses corpus data in language teaching.. The text briefly presents the application of this method

Discourse analysis and corpus annotation show that this phenomenon is quite perva- sive: 20% of all occurrences are coded as part of a co-occurring string in Crible’s (2017)