Text práce (997.3Kb)

(1)

Charles University in Prague Faculty of Mathematics and Physics

MASTER THESIS

Jindˇrich Helcl

V´ıcejazyˇ cn´ a datab´ aze kolokac´ı

Institute of Formal and Applied Linguistics

Supervisor of the master thesis: prof. RNDr. Jan Hajiˇ c, Dr.

Study programme: Informatics

Specialization: Mathematical Linguistics

Prague 2014

(2)

I would like to thank to my supervisor, prof. RNDr. Jan Hajiˇc, Dr. for his valuable comments and guidance through the creation of the thesis.

I would also like to express my gratitude to my family. Namely, to Kamila and Martin, for their time and willingness to consult the thesis with me.

I thank my parents who have supported me for the whole time of my study, and who are very glad that this thesis exists, despite the fact that they do not understand it.

(3)

I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources.

I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act.

In Prague, 29.7.2014

(4)

Název práce: V´ıcejazyˇcná databáze kolokac´ı Autor: Jindˇrich Helcl

Katedra: Ústav formáln´ı a aplikované lingvistiky

Vedouc´ı diplomové práce: prof. RNDr. Jan Hajiˇc, Dr., Ústav formáln´ı a apliko- vané lingvistiky

Abstrakt: Kolokace jsou skupiny slov, které se v daném jazyce vyskytuj´ı ˇcastˇeji spolu, neˇzli oddˇelenˇe. Patˇr´ı mezi nˇe také spojen´ı, která dávaj´ı nˇekolika nezávislým slov˚um nový význam. Tato práce se zabývá nalezen´ım kolokac´ı v objemných dat- ech a vytvoˇren´ım databáze slouˇz´ıc´ı k jejich vyhledáván´ı. Pro nalezen´ı kolokac´ı v textu poˇc´ıtáme hodnotu Pointwise Mutual Information zaloˇzenou na poˇctu výskyt˚u jednotlivých skupin slov v korpusu. Slova s nejvyˇsˇs´ı hodnotou PMI jsou kandidáty na vhodné kolokace. Vybrané kolokace jsou uloˇzené do databáze ve formátu pouˇzitelném pro vyhledáván´ı pomoc´ı Apache Lucene. Souˇcást´ı práce je k vytvoˇrené databázi pˇridat webové rozhran´ı, které umoˇzˇnuje rychlý a jednoduchý zp˚usob pro vyhledáván´ı kolokac´ı. Pokud by tato sluˇzba byla dostateˇcnˇe rychlá a kolokace kvalitn´ı, mohli by ji pouˇz´ıvat pˇrekladatelé k nacházen´ı vhodných ek- vivalent˚u v c´ılovém jazyce. Také m˚uˇze být pouˇz´ıvána studenty ciz´ıho jazyka k rozˇsiˇrován´ı slovn´ı zásoby. Taková databáze bude tvoˇrena nezávisle v nˇekolika jazyc´ıch, mezi nimiˇz bude minimálnˇe ˇCeˇstina a Angliˇctina.

Kl´ıˇcová slova: kolokace, vyhledáván´ı informac´ı, Apache Lucene Title: Multilingual Collocation Database

Author: Jindˇrich Helcl

Department: Institute of Formal and Applied Linguistics

Supervisor: prof. RNDr. Jan Hajiˇc, Dr., Institute of Formal and Applied Lin- guistics

Abstract: Collocations are groups of words which are co-occurring more often than appearing separately. They also include phrases that give a new meaning to a group of unrelated words. This thesis is aimed to find collocations in large data and to create a database that allows their retrieval. The Pointwise Mutual Information, a value based on word frequency, is computed for finding the collocations. Words with the highest value of PMI are considered candidates for good collocations. Chosen collocations are stored in a database in a format that allows searching with Apache Lucene. A part of the thesis is to create a Web user interface as a quick and easy way to search collocations. If this service is fast enough and the collocations are good, translators will be able to use it for finding proper equivalents in the target language. Students of a foreign language will also be able to use it to extend their vocabulary. Such database will be created independently in several languages including Czech and English.

Keywords: collocations, information retrieval, Apache Lucene

(5)

Chapter 1 Introduction

Imagine you were a translator. Every once in a while you would be asked to translate texts from your mother tongue to—let’s say—English. There would be situations in which you would use a specific word either because it is common in the domain, or because you would feel it is appropriate in the context. What would you do if you wanted to add a modifier to that word, but did not know which one to use?

Suppose you would be translating an article about a scientific method. You would know that the right word for the measure of the method quality is precision. How would you find out whether to use raise,increment orimprove, when referring to greater orhigher value of that precision?

In these situations, your problem of translation would become a problem of findingcollocations.

Roughly speaking, a collocation is a couple of words that occur together more often than they would by chance. Collocations vary in their flexibility. On one hand we have idiomatic expressions, such as lose your head or put the screws on. More flexible collocations involvestrong tea orbroad daylight. (Manning and Sch¨utze, 1999)

Collocation extraction is one of the classic natural language processing (NLP) problems. Its task is to find collocations in a given text corpus. A secondary task is usually to filter out useless and incorrect results. Retrieved collocations are quite useful for many other NLP tasks, for example machine translation or word sense disambiguation.

This thesis aims to extract collocations from large text corpora (billions of words) in multiple languages. Extracted collocations will be stored and handled in such a way that allows the user to explore them efficiently.

To allow the language researchers and translators to fully harvest the potential of the collocation database, the collocations have to be made accessible through an easy-to-use interface, and presented to the user in a human readable format.

In this thesis, we focus on the following two topics: How to learn good collocations, and how to present them to the user.

(8)

1.1 Use Cases

In this section, we focus on the two main use cases, each corresponding to a specific need of a professional translator. We look at the problems of finding the best translation and identifying a collocation.

Finding the translation. Translators create texts in foreign language. Often the domain of translated texts is out of the translator’s professional expertise.

Here, the collocation database may become handy. On the following example, we show how to find a good translation through searching the collocation database.

• A translator wants to know whether to use raise, increment, or improve when referring to the value of precision.

• The translator looks upprecision in the database.

• The database search yields these results:

Collocation Score true precision 7.23 improve precision 6.15 raise precision 2.31

• According to the results, the translator choose to use the phrase improve precision

Identifying a collocation. Collocations are translated in a specific way. A translator may come across a couple of words that may or may not be a collocation. If the translator is not sure, he can use the collocation database to discover whether the word combination is a collocation or not. An example situation follows.

• A translator is not sure whetherred glasses is a collocation

• The translator search the database for the expression.

• The search results show thatred glasses has score of 0.23.

• Based on the score, the translator can consider the couple of words non- collocation.

Out of many possible ways how translators can profit from our application, we have described the two important use cases. We will focus on these use cases throughout the thesis.

(9)

1.2 Objectives

In this section, we give a detailed description of the two main goals of this thesis:

first, creating a multilingual collocation database, second, an easy-to-use Web interface.

To make the database useful, it should contain high-quality collocations in terms of practicability for the user. It should also cover collocations found in the language to the highest possible extent.

In further parts of this thesis, we present a method for evaluation of the fulfillment of the two requirements. The experiments use a gold-standard list of collocations from a given language to measure the performance.

The database should be also reusable by other applications using the following two alternative ways. First, the index should be easily imported into another application. Second, the database server functionality should be accessible through the API calls.

To provide the user with a comfortable search tool, the Web interface should allow the user to search according to various criteria. We give a list of supported query patterns with examples:

Word. Searching for a single word is the simplest.

Example input agreement

System response

0.658 trade agreement 0.654 framework agreement 0.551 reach agreement 0.541 peace agreement 0.539 user agreement 0.523 forces agreement

The result table show collocations that contain the wordagreement along with their score.

Word + Part of speech. Part of speech can be used to narrow down the search.

Example input agreement + verb

System response

0.551 reach agreement 0.312 binding agreement 0.327 pay agreement 0.281 nearing agreement

The results contains only instances of the word agreement that are col- located with a verb.

(10)

Pair of words. Search for a pair of words can reveal true collocations.

Example input 1 red glasses

Example input 2 champagne glasses Example input 3 normal glasses System response 1 0.106 red glasses

System response 2 0.462 champagne glasses System response 3 0.000 not found

Only collocations that match the whole input are retrieved. The higher the score, the more likely the pair forms a collocation.

Synonym search. The synonym search retrieves collocates both for the input word and its synonyms.

Example input house

System response

0.623 White house 0.555 Chrysler building 0.452 home video 0.211 empty house 0.198 office building 0.120 old house

Search results are enriched with collocations of words that are in the same word class as the wordhouse.

In addition to the search objectives presented above, the users should be empowered to look up the usages of the collocations found, for example in a parallel corpus of a selected pair of languages. This feature can be very useful for confirming whether or not to select the collocation for the translation.

The installation and configuration documentation should be given to enable readers to install their own instances of the database server and the web server.

We have shown the features of the user interface. The presented query patterns provide users with the ability of finding good translations, or scoring a pair of words in order to decide whether the words form a collocation or not. We introduced additional features that increase comfort when working with the search results.

1.3 Existing Solutions

For a translator, there are several methods how to search for translations. This section goes through the methods, ranging from dictionaries to translation utili-

(11)

ties. For each method, we focus on its ability to cover translation of collocations.

The first method that comes to mind is to use a dictionary. Apart from printed books, that are slower to work with, there are many dictionaries available online.

Dictionaries often include translations of collocations, but they are usually limited to fixed-phrases or idioms. The dictionaries have also a limited size because they are created manually by humans.

However, translators have another option. The text can be translated by an automated machine translation application and then post-edited by the translator. The problem is that the translation engines such as Google Translate may suffer from low accuracy when translating collocations.

For information of how a word is used in language, translators can use a corpus. A corpus is a collection of texts in a given language. When a translator searches a corpus for a word, he gets a list of sentences in which the word is used.

This approach gives the translator the option to see the context of the word.

However, collocations are often infrequent usages of the word, so the translator may not spot the right collocation just from looking up a word in a corpus.

1.4 Thesis Structure

This work presents Multilingual Collocation Database (MCDB). The thesis is structured as follows.

The introduction lists the main use cases, sets goals of the project, discusses existing solutions and state-of-the-art approaches.

A detailed theoretical background related to the field of collocations, their extraction and evaluation is given in chapter 2. We introduce basic terms of the information theory such as entropy or mutual information, and give a brief introduction to language modeling. We present several definitions of collocations and their properties. We show how collocation extraction can be performed and evaluated. At the end of the section, we look at the concept of word classes.

Corpora of size of billion words in each language were used to extract collocations. A description of the data used for the extraction is given in chapter 3.

We describe the process of preparation for extraction experiments, and show a statistical analysis of each data set.

Chapter 4 describes the program implementation and used modules. We describe used ready-made tools, and establish requirements for the system on which the program parts are run. We show how the project is designed. For each mod- ule in the project, we describe its function, and connection to the other modules.

The project modules include the database creation, the Web server, and the Web user interface.

Chapter 5 presents the performed experiments. We describe methods that were used for the collocation extraction task. We evaluate the results and discuss the effectiveness for each method.

We discuss the possibilities of the future development in chapter 6. Sugges- tions and ideas are presented for each part of the project, including notes on the collocation extraction approach, database realization, or front-end design.

Conclusion follows in chapter 7. The achieved results are presented and the fulfillment of the requirements is discussed.

(12)

Chapter 2 Theoretical Background

In this chapter we overview the theoretical background of the collocation extraction. We begin with some selected fields of information theory that form the ground for most of the machine learning techniques.

In section 2.2, we discuss various approaches to the definition of collocations and their properties. The following section 2.3 contains a survey on automatic methods for collocation extraction. At the end of the section, we go through the evaluation methods for the collocation extraction task.

2.1 Information theory

Many of the collocation extraction task solutions as well as other fields in statistical NLP are based on notions from information theory. In this section, a description of some fundamental terms of this theory is given. We introduce entropy, joint and conditional entropy, mutual information, and perplexity. Fur- thermore, we describe the noisy channel model. We also define n-gram language models and a method for their parameter estimation.

In information theory, entropy is one of the fundamental notions. The entropy of a random variable X with a sample space Ω is defined as:

H(X) = −X

x∈Ω

Pr(x)×log₂Pr(x) (2.1)

The entropy is a measure of “chaos” in the probability distribution. If entropy is 0, it means that we are certain about the value of the random variable. If there was an x such that Pr(x) = 1, for all x⁰ would be Pr(x⁰) equal to zero. The logarithm of 1 is also zero, and thus the entropy would be zero too.

In general, the entropy has no upper limit. For a fixed number of possible outcomes, the distribution with the highest entropy is the uniform distribution, i.e., the one that treats all the possible values of the random variable as equally probable.

If we want to know the entropy of a combination of two random variables X and Y, we define joint entropy:

H(X, Y) = −X

x∈Ω

X

y∈Ψ

Pr(x, y)×log₂Pr(x, y) (2.2) and similarly, the conditional entropy:

(13)

H(Y|X) =−X

x∈Ω

X

y∈Ψ

Pr(x, y)×log₂Pr(y|x) (2.3) These combination entropy formulas are quite useful for relating random variables.

There is one more variable that is very important in both information theory and our task of collocation extraction. The mutual information gives us the measure of how much information can one random variable “give” to another:

I(X, Y) =X

x∈Ω

X

y∈Ψ

Pr(x, y)×log₂ Pr(x, y)

Pr(x)×Pr(y) = H(X)−H(X|Y) (2.4) If the conditional entropy H(X|Y) is low, then the value ofXstrongly depends on the value of Y. We can say that the value of Y tells us something about the value of X. In other words, the two random variables share some information about each other.

If, on the other hand, the random variables X and Y are independent – Pr(x, y) = Pr(x)×Pr(y), the conditional entropy H(X|Y) is equal to H(X) and the mutual information is zero. Thus, independent random variables do not share any information.

In section 2.3 we present several researches that use the mutual information to measure goodnes of collocations.

In addition to entropy and mutual information, perplexity is often used as a substitute measure. The perplexity G(X) is computed as:

G(X) = 2^H(X) (2.5)

For example, perplexity of random variable representing a roll of the usual fair dice is 6, perplexity of a fair coin toss is 2, etc.

2.1.1 Noisy Channel

The noisy channel (Shannon, 1948) is a scheme used in solutions of various tasks in information theory. It has many applications in NLP problems. The input data is mixed with “the noise” and produce the output. The term noise can mean different things in different tasks. For example, in automatic speech recognition (ASR), the noise is transferring the input (words) into an acoustic signal. In machine translation, the noise is equivalent to translation to the target language.

The (observed) noisy output is then regarded as the original (given) text in the source language (figure 2.1).

In both cases, probability distribution of the input data is called the language model. The addition of the noise is described by the channel model, that represents the conditional probability distribution of output signal given the input. The channel model is often called differently, depending on the task (the translation model in machine translation, the acoustic model in ASR)

We denote the probability of the channel input data as Pr(A) and the probability of the output Pr(B). The statements “probability of the input/output”

mean that for someA out of all possible inputs/outputs (i.e., all possible texts),

(14)

+ Noise

- -

hello ahoj

Input (target language) Output(source language)

Noisy channel

Figure 2.1: Noisy channel model application in machine translation probability of that text occurring at the input/output of the noisy channel is Pr(A). The probability of an output given an input (the channel model) is de- noted as Pr(B|A).

The goal in noisy channel setup is to recover the “Inverted” channel model (probability of the spoken words given the acoustic signal in ASR, probability of the translation of a known word into a foreign one). We can achieve it by using the two distributions mentioned above and the Bayes’ formula:

Pr(A|B) = Pr(B|A)× Pr(A)

Pr(B) (2.6)

where A is the input and B is the output.

Our task is to observe the output of a noisy channel and select an input that would generate the observed output with the highest probability. If we write it down as a formula, we look for the A that gives us the highest value of the right-hand side of the Bayes’ formula:

A_best= argmax

A

Pr(B|A)×Pr(A)

Pr(B) (2.7)

and because Pr(B) doesn’t depend on A, we can simplify it to:

A_best= argmax

A

Pr(B|A)×Pr(A) (2.8)

where P(A) is the language model, and P(B|A) is the channel model.

This framework allows the NLP tasks to be divided in two independent sub- tasks. The language model gives a description of the language itself, regardless of a specific task. A good language model can improve the whole solution dra- matically.

2.1.2 Language model

As mentioned in the previous section, a language model (LM) is a probability distribution of the input data for the noisy channel. In other words, it determines the probability Pr(w₁, . . . , w_K) that a sequence of wordsw₁. . . w_K appears on the input. Using the chain rule, we can express the probability as:

Pr(w₁, . . . , w_K) = Pr(w₁) Pr(w₂|w₁). . .Pr(w_K|w₁, . . . , wK−1) (2.9)

(15)

where the conditional probability Pr(w_K|w₁, . . . , w_K−1) is a probability of a word (or prediction) w_K given the sequence of the first K −1 words seen since the beginning of the experiment, the history w₁, . . . , wK−1. (Brown et al., 1992)

Naturally, we do not know exact values of these probabilities and hence we must estimate them. The estimation is performed using a real-world data set that should represent the common usage of the language. We refer to such data as the training data.

In some applications of NLP we deal only with domain-specific language. In that case, the estimates should match usage of words within the domain. For our purposes though, we deal with the whole (general) language.

To estimate the probabilities Pr(w) we often use a technique called the maximum likelihood estimation (MLE). The maximum likelihood estimate for the probability Pr(w) of a word w appearing in the language based on a training data T is calculated using the formula:

Pr(w) = c(w) P

∀u∈T c(u) = c(w)

|T| (2.10)

wherec(w) is a number of occurrences of the word win the data T. We call the estimated values the parameters of a language model. Similarly, we estimate the joint probability of observing two consecutive wordsu, v in the data T:

Pr(u, v) = c(u, v) P

∀(u,v)∈T c(u, v) = c(u, v)

|T| (2.11)

Estimating the parameters for an LM that uses all of the parameters in equation 2.9 is of course extremely inefficient, since the number of parameters grows exponentially with the size of the data. Therefore, we must sacrifice the precision to get fast and more realistic implementation.

Brown et al. (1992) introduce a solution to this problem as so called n-gram Language Models. In n-gram LM we approximate the conditional probabilities of a word given its history by discarding distant words in the history. We take into account only then−1 preceding words. We call a sequence of nconsecutive words an n-gram.

The estimation ofn-gram LM parameters is calculated bysequential maximum likelihood estimate. To get parameters of then-gram model, we must first estimate parameters of the (n-1)-gram model. For 1-gram (orunigram) model, we use the formula 2.10 above. When we have parameters for the (n-1)-gram model, we use this formula to compute the final parameters for the language model:

Pr(wK|wK−n+1, . . . , wK−1) = c_n(w_K−n+1, . . . , w_K) P

w∈T c_n(wK−n+1, . . . , wK−1, w) (2.12) where cn(w1, . . . , wn) is the number of n-grams seen in the data. The equation can then be simplified to:

Pr(w_K|wK−n+1, . . . , wK−1) = c_n(wK−n+1, . . . , w_K)

c_n−1(w_K−n+1, . . . , w_K−1) (2.13) The c_n and cn−1 are called order-n and order-(n-1) parameters of the n-gram language models respectively.

(16)

Often we want to measure the quality of a language model or to compare two of them and use the better one. The evaluation of a model is done by measuring the cross-entropy on a data set that was not used in the training data. This additional data set is called thetest data. The cross-entropy of a language model distribution p trained on data T measured on test data S is the normalized log probability of the dataSas estimated by the model. The value is computed using this formula (Hajiˇc, 2000):

H_S(T) =− 1

|S|

X

i=1...|S|

log₂Pr(w_i|h_i) (2.14) The cross-entropy tells us how much the language model suits the test data.

If the cross-entropy is low, the test data are assigned with high probability by the model. Therefore, the model with the lower cross-entropy is more likely to suit any real-world data.

This section has described basic terms of information theory, noisy channel model, and introduced language models. So far we have described techniques of parameter estimation forn-gram language models using the maximum likelihood estimation. We can also tell how good, or trustworthy, a language model is by measuring the cross-entropy.

2.2 Collocations

There are several definitions of collocations, but none of them seems to be widely accepted. This section aims to explain the notion of collocation, list its definitions, and describe its properties.

The first definition of collocation was introduced by Firth (1957), who said that “collocations of a given word are statements of the habitual or customary places of that word.” Definitions used in other papers in NLP mostly agree that a collocation is a combination of words that has a specific meaning of it’s own.

Each word from such a combination is called a collocate.

Manning and Sch¨utze (1999) define a collocation as an expression formed by a couple of words corresponding to “some conventional way of saying things”.

They also include a definition of Choueka (1988):

[A collocation is defined as] a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components.

This definition limits collocations only on groups ofconsecutive words, which, in some cases, can be too restrictive.

In contrast to previous two definitions, Evert and Kermes (2003) treat collocations as “unpredictable combinations of words in a particular (morpho-)syntactic relation (adjectives modifying nouns, direct objects of verbs, or English noun- noun compounds).” Here, the words do not need to be consecutive and, unlike the former definition, a syntactic dependency is now required.

Despite the fact that we do not have a fair definition, many collocations can be characterized by following properties (Manning and Sch¨utze, 1999):

(17)

Non-compositionality of meaning. The meaning of a collocation cannot be derived directly from meanings of its collocates. Either the meaning is completely different (fast food, at your fingertips or kiss of death¹) or there is a connotation that cannot be predicted from the collocates (e.g.,white wine, wherewhite refers to a different color).

Non-substituability in context. The collocates of a collocation cannot be freely substituted by words with equivalent or similar meaning in the context.

New collocation (although meaning of its collocates remained unchanged) would either make no sense or at least it would sound strange ([quick] food, [powerful]

tea,[yellow] wine).

Non-modifiability and non-transformability. Many collocations cannot be modified or grammatically transformed without disturbing the usual form and sound of the collocation. Especially, idioms and frozen phrases follow this crite- rion (throw [bright] light on it,(they come here) in person[s]). Some collocations cannot be modified only in a specific way, but otherwise they can (e.g., throw some light on it or even shed some light on it)².

Some papers published on this topic formulate the definition of a collocation using these properties. For example, Pearce (2001) proposed a supervised method for collocation extraction based on the assumption that if a phrase is semantically compositional, we can substitute the component words within for their synonyms without a change of meaning. The definition of collocation in his paper is following:

“A pair of words is considered a collocation if one of the words significantly prefers a particular lexical realisation of the concept the other represents.”

Here, we have a definition that employs the property of non-substituability.

In this case, the collocates are not equal with respect to the definition. The pair (emotional; baggage) is given as an example of such a pair of words. Here, the word emotional significantly prefers the word baggage instead of something representing the same concept (e.g.,luggage). We look at the method itself further in section 2.3.

Smadja (1993) presents a comprehensive definition of collocations as formed by Benson (1990):

“A collocation is an arbitrary and recurrent word combination.”

Smadja argues that this definition does not involve some important properties of collocations that must be taken into account when creating the NLP applications, particularly in machine translation (MT). He describes a set of four properties of collocations seen from a more lexical view than the listed properties presented six years later by Manning and Sch¨utze (1999):

1Examples taken from Manning and Sch¨utze (1999) and http://www.usingenglish.com/

reference/idioms/

2http://idioms.thefreedictionary.com/throw+light+on+something

(18)

Arbitrariness. Collocations are not generated by any principles or rules, which makes them difficult to produce by non-native speakers. This property makes it hard to translate collocations from one language to another using MT tools. The collocation extraction task is used to enhance the performance of MT. (See Orliac and Dillinger, 2003.)

Domain-dependency. Some collocations are used only in some domains of the language. For example, many technical terms are formed by common expressions that have completely different meaning when used outside the domain (e.geaster egg, back button orhot spot in computer jargon).

Recurrence. This property means that the collocations are not formed by chance and they are typical and characteristic for the language or its domain.

Cohesive lexical clusters. Presence of a word often implies the presence of another word that forms the collocation with the present one. Speakers of a language are able to fill collocates left out of a sentence because of this property.

Smadja concludes, that given this property, collocations have their own statistical distribution in a text sample, quite different from the joint probability of collocates appearing together by chance.

The definitions and properties of collocations described above form a space for variety of collocation types. One of the criteria how to divide them into types is the level of their semantic compositionality. Wermter and Hahn (2004) distinguish these three classes of collocations:³

Idiomatic phrases. None of the words contribute to the overall meaning of the collocation. The meaning is connected only to the expression as a whole.

Idioms such as pulling someone’s leg fall into this category.

Support verb constructions / narrow collocations. Expressions in which at least one of the collocates semantically contributes to the meaning of the collocation. The core of the meaning can be derived from such lexical unit. (e.g., walk the streets,aus eigener Tasche bezahlen – “to pay out of one’s own pocket”)⁴ Fixed phrases. All units in the expression are contributing semantically to the meaning. However, the collocation is still not as compositional as free word combinations are. (For example, “im Koma liegen” – to lie in coma.)

There are many typologies of collocations discussed. Apart from compositionality, they include division into groups according to the level of collocation modifiability or similar factors.

3The research of Wermter and Hahn (2004) is aimed on extracting German collocations from lists of Preposition-Noun-Verb phrases

4Examples borrowed from (Wermter and Hahn, 2004)

(19)

2.3 Collocation extraction

This section gives a detailed research of collocation extraction methods. In the first part of this chapter we go through the development of collocation extraction techniques, reviewing publications from the early nineties till the recent ones.

Most of the approaches use a scoring function to rank collocation candidates.

Since the evaluation of this task is difficult, because of the vague definition of collocation, various different practices were applied. The second part of this chapter contains a survey of those evaluation methods.

2.3.1 Methods

There has been an extensive research in the field of collocation extraction. One of the earliest publications dealing with this topic is done by Church and Hanks (1990). The article itself is not aimed on collocations, however, measuring the word association by statistical methods based on the pointwise mutual information (PMI)⁵ is introduced.

The pointwise mutual information is a measure of two events x and y of random variables X and Y respectively, that represents how much information the two events share:

I⁰(x, y) = log₂ Pr(x, y)

Pr(x)×Pr(y) (2.15)

If the random variables X and Y are independent, the PMI value is zero because the joint probability Pr(x, y) is equal to the product of the individual probabilities.

On the other hand, if the events are more likely to be observed together, the pointwise mutual information gains a positive value, thus telling us that the two events share some information. Note that the PMI value can be negative. That means that the two events are not likely to occur simultaneously, though not being independent.

Church and Hanks (1990) use this measure as anassociation ratioof two words x and y, supplying the probabilities Pr(x) and Pr(y) with the values estimated from data. The joint probability Pr(x, y) is an estimate of the two words occurring together in a window of 5 words.

They point out that this association ratio is slightly different from the pointwise mutual information concept, as it is not symmetric, while PMI itself should be. It is because the joint probability estimate is calculated using counts of word x followed by wordy, not counting the “reverse” direction.

Later on, Smadja (1993) presents a method for automatic collocation extraction from text corpora. The collocations extracted by this method can be of any number of words.

First, for all sentences containing a given wordwa frequency tablef(w, w_i) of wordsw_i co-occurring with win the same sentence is created. The frequencies pⁱ_j

5In the article, the term mutual information is used. However, to be consistent with other publications, we will use the term pointwise mutual information instead. You can find the definition of the mutual information above in section 2.1 (formula 2.4).

(20)

of the word w_i on aj-th position fromw (−5≤j ≤5, j 6= 0) are also computed.

A measure called z-score is computed for each pair (w, w_i):

z = f(w, wi)−fˆ

σ (2.16)

where the ˆf is the average frequencyf(w, w_i) and theσis the according standard deviation. The higher this score is, the more likely the two words form a candidate for a collocation. Besides the z-score, the spread factor U_i is computed:

U_i = P10

j=1(p^j_i −pˆ_i)²

10 (2.17)

The spread factor tells us in which way the words w and w_i co-occur. If the values of p^j_i are close to each other, the spread factor is low. On the other hand, when the variance of the values is high, the spread factor grows.

In the next step the irrelevant bigrams are filtered. This involves infrequent combinations of words (based on a z-score value threshold) and occurrences ofw_i that do not have any peak in their pⁱ_j values histogram, i.e., co-occurrences that do not have any particular form (based on the spread factor from formula 2.17).

Based on these bigrams, the collocations are expanded to their full-length by looking at concordances of the filtered bigrams. This method also discards any parts of collocations that are not collocations themselves, such as York Stock in New York Stock Exchange.

In the book of Manning and Sch¨utze (1999), great deal of information is given on collocations and their retrieval from corpora. One of the methods they propose is a simplification of the approach described above. (Smadja, 1993)

The presented method is a generalization of a basic frequency-based collocation classification. The use of frequencies of n-grams⁶ in a corpus alone rank highest the combinations of functional words, such as of the, in the, to the, etc.

Justeson and Katz (1995)⁷employ a part-of-speech tag pattern filters to leave out the functional words from the extracted collocations. In contrast to the filtering, their method is based on relative positions of the co-occurring pairs of words and is not limited to consecutive words.

At the beginning, frequencies of pairs of words that co-occur within a defined window size (3 – 4 words) in a corpus are computed. The information about the relative distance of the words is stored. The pairs of words are assigned a mean and variance characterizing their co-occurrence. The mean ( ¯d in the formula below) is the average relative distance of the first word with respect to the second word. The variance is computed using the formula 2.18, whered_i is a distance of the second word from the first word in theiri-th co-occurrence within the window.

s² = Pn

i=1(d_i−d)¯²

n−1 (2.18)

6Recall that ann-gram is a group ofnconsecutive words. (Section 2.1.2)

7As cited by Manning and Sch¨utze (1999)

(21)

Collocations are often fixed combinations of words, thus the good collocate candidates are those pairs of words whose variance of their relative distance is low.

Apart from lexical frequency-based methods described above, there are also techniques that make use of linguistic knowledge. Those techniques employ part- of-speech tagging, syntactic parsing and other analyzing tools and use their outcomes, for example Pearce (2001). Pearce’s method is performed on top of syntactically annotated data and uses WordNet (Fellbaum, 1998), a lexical database containing synonyms of the English language. As in previous cases, a list of candidate collocations is compiled and scored by an association measure. The candidates with the highest score are considered collocations.

This method is based on the assumption that if a phrase is semantically compositional, we can substitute component words within the collocation with their synonyms without a change of meaning. In fact, this approach challenges the collocation property of non-substiuability.

Collocate candidates are extracted from a parsed corpus. The candidates are pairs of words in non-clausal modification. For each word w, a co-occurrence set of words is compiled.

cs_w ={w_v :f(w, w_v)>0} (2.19) wheref(w, wv) is the frequency of co-occurring of the two words in the input list.

Additionally, for each word w, synsets (collections of words that represent the same concept) from WordNet containing at least two words co-occurring withw are selected, forming a set ofCandidate Collocation Synsets of w, CCSw.

Next, for each word w and a synset S ∈ CCS_w, the two most co-occurring words w⁰ and w⁰⁰ are selected:

w⁰ = argmax

wv∈S

f(w, w_v) (2.20)

w⁰⁰ = argmax

wv∈S\{w⁰}

f(w, w_v) (2.21)

Finally, a measure calledcollocation strengthsis assigned to word pair (w, w⁰) according to this formula:

s = f(w, w⁰)−f(w, w⁰⁰)

f(w, w⁰) (2.22)

The collocation strength is a value between 0 and 1 and express how much a lexical units inside the collocation are substituable. However, despite the inter- esting idea, no evaluation of the process is described in the article.

The next research we look into is an article of Pecina (2005). The paper compares a variety of 84 association measures and evaluates them on a manually annotated data set.

Pecina (2005) also proposes a way of combining association scores using supervised machine learning algorithm calledlogistic linear regression. The algorithm is trained to learn a model that estimates the probability of a given bigram being a collocation:

(22)

Pr(x is collocation) = exp(β₀+β₁x₁+· · ·+β_nx_n)

1 + exp(β₀+β₁x₁+· · ·+β_nx_n) (2.23) where x_i are various association scores computed from the data. The β_i coef- ficients are parameters of the model. They are obtained from training. Here, the model is trained using the iteratively reweighted least squares learning algorithm. After feature selection, the classifier was trained on 17 selected association measures as its features.

For evaluation of the measures and the combined classifier, a data set of 8904 bigrams was manually annotated. It contained 2649 collocations. The data was taken from Prague Dependency Treebank 1.0 (Hajiˇc et al., 2001). For more information about the evaluation technique used in this case see section 2.3.2.

After the evaluation, the best performing single association measure was the pointwise mutual information, with the logistic regression showing a significant improvement over the single association measure.

In the follow-up research, Pecina and Schlesinger (2006) explored the possibil- ity of combining various association measures. In addition to the logistic linear regression, they utilized several other machine learning algorithms. Moreover, the reference data set had been improved using a newer version of the Prague Dependency Treebank. It contained 2557 collocations out of 12232 candidates.

The collocations had been selected manually by two additional annotators.

As in the previous research, 17 association measures out of 82 were used as features for training. The machine learning techniques involved logistic linear regression, linear discriminant analysis, support vector machines and neural net- works. The best performance was measured on a neural network with five units in the hidden layer.

Since the pointwise mutual information seems to be a well suited association measure for the task, Bouma (2009) experiments with the property of PMI that it gives a higher score to less frequent collocations. They propose normalized versions of mutual information and pointwise mutual information.

As an associative measure of a pair of words (x, y), the mutual information I(X, Y) is computed as the MI of two indicator variables, X and Y, where

X =

(

1 if the left collocate is x 0 otherwise

(2.24) and similarly forY indicating the right collocate. Then, the mutual information can be computed using the maximum likelihood estimate of the probabilities Pr(X = 0), Pr(X = 1), Pr(Y = 0) and Pr(Y = 1) and the joint combinations with the MI formula 2.4.

The pointwise mutual information is known for its behavior that less frequent word pairs are likely to be ranked high. It follows from the observation that if a couple of words (x, y) is seen only together, the PMI value for such a pair is equal to −log₂Pr(x, y). This value grows with decreasing probability. This also means that the PMI has no upper limit in general. The lower limit of the PMI is the negative infinity for seen words that do not occur together at all.

The normalized version of the PMI is designed to suppress this behavior:

(23)

I_n⁰(x, y) =

log₂ Pr(x, y) Pr(x)×Pr(y)

.

−log₂Pr(x, y) (2.25) This alternation grants the pointwise mutual information upper and lower bounds, 1 and −1 respectively. The low frequency bigrams are now ranked with the same score as the high frequency ones, as long as they appear exclusively with each other.

The definition of normalized mutual information is constructed in a similar manner. Given the formula 2.4, we know that the mutual information is a non- negative number with upper bound being the entropy of the distribution of one of its variables. We also know that the entropy of a conditional distribution H(X|Y) is less or equal than the entropy of the marginal distribution H(X)(without reference to the conditioning variable). Bouma (2009) does the normalization by the joint entropy, using this formula:

I_n(X, Y) = P

x,yPr(x, y)×log₂ Pr(x)×Pr(y)^Pr(x,y)

−P

x,yPr(x, y) log₂Pr(x, y) = I(X, Y)

H(X, Y) (2.26) The normalization of the mutual information resulted in a different measure than the original MI. The normalization of the pointwise mutual information from the formula 2.25 is suggested to be a good replacement for the original version of the PMI.

One of the most recent studies on this subject, Takayama et al., 2013 describe a method for collocation extraction based on the pointwise mutual information and dependency tree patterns. The dependency tree is a form of syntactic annotation of a sentence. Each word except the root syntactically depends on another word.

The root of a dependency tree is usually a verb.

The research is aimed to find discontinuous collocations of at least three words in length. They introduced a PMI-based similarity measure between dependency tree patterns and extracted collocations from the ACL Anthology Corpus (Sch¨afer, Read, and Oepen, 2012). Again, there is not any evaluation method presented, only a list of some of the collocations found.

2.3.2 Evaluation

The problem with collocations is that we do not have widely accepted definition. With the vagueness of the definition comes the difficulty of the collocation extraction evaluation task.

The notion of collocation as viewed by Smadja (1993) is that collocations are arbitrary and only native speakers can use them easily. Therefore, in his work, a skills of a lexicographer are employed in order to evaluate the retrieved collocations.

Pearce (2002) in the comparative evaluation of collocation extraction techniques uses a list of over 17,000 two-word collocations. Before the evaluation, the list is reduced to contain only bigrams that occurred at least once in the data, yielding a golden data set of 4,152 entries.

(24)

Different methods were then used to extract a ranked list of collocations from the data. Precision and recall were computed and used to calculate the performance of a collocation extraction method on the data. The formulas for precision and recall are:

precision= # of correctly extracted items

# of all extracted items = |G ∩ E|

|E| (2.27)

recall = # of correctly extracted items

# of all correct items = |G ∩ E|

|G| (2.28)

where G is the reference (or “golden”) set of collocations and E is the set of extracted collocations.

Since collocation extraction methods often depend on a threshold parameter, or an arbitrary number of results, when creating the list of collocations (so-called n-best list), precision and recall graphs may be created. Those graphs illustrate the behavior of precision or recall depending on the threshold or thenparameter.

A very similar approach to evaluation of this task is presented by Thanopoulos, Fakotakis, and Kokkinakis (2002). Here, two golden data sets were produced and each of the tested techniques was evaluated on both of them. The used evaluation measure is called the coverage and it is equivalent to our definition of recall in formula 2.28.

The approach of computing precision and recall was used as well in (Pecina, 2005) and (Pecina and Schlesinger, 2006). The collocation extraction was done on a manually annotated Czech data set, as mentioned in the previous section.

In addition, F-measure is given along with the precision and recall scores:

F = 2× precision ×recall

precision +recall (2.29)

The F-measure, sometimes called the F₁ score, is the harmonic mean of precision and recall values.

A manual annotation is employed in the research of Evert and Krenn (2005).

As in the approaches described above, this method of evaluation was used to compare several association measures on the same data. A list of collocation candidates with appropriate frequency information was prepared. Each of the evaluated association measures was used to rank the collocation candidates ac- cordingly to their quality (sometimes called the collocation strength).

In contrast, no golden data set is created. Instead, small random samples of the ranked candidate data set are annotated manually. The true precision is an estimate based on the sample precision, which is computed on the sample using the manual annotation.

In addition, confidence intervals are computed and shown in the evaluation charts along with the estimated precision depending on the number of then-best results selected.

In the chapter of theoretical background, we have described the basics of information theory, the notion of collocation, and methods of collocation extraction.

The section of information theory has described the notions of entropy, noisy channel, and language modeling. In particular, we have introduced the maximum likelihood estimate (MLE). The MLE is a method of assigning probability

(25)

estimates to random variable values of which we do not know their probability distribution. We use this method when computing the association scores of the collocation candidates.

In the second section we have explored various definitions of collocations and their properties. Perhaps the most considered property throughout the publications is the collocation’s non-compositionality—the meaning of a collocation is not compositional when it cannot be inferred from the meaning of the lexical units it contain.

The last section has presented methods for the collocation extraction and their evaluation. The techniques include a variety of association-measure based approaches, ranging from a lexical level to the dependency syntax. There has been also attempts to combine several measures together using a machine learning classifier (Pecina and Schlesinger, 2006). In this thesis we use the pointwise mutual information and its normalized variant.

Although many authors argue that the collocation extraction results should be evaluated by a human lexicographer, some of them has presented automated methods for the evaluation using a reference data set.

(26)

Chapter 3 Data

In this chapter, we present and analyze the data used in our experiments. The collocation database is created for two languages, English and Czech. For the English database, we use two large data sets consisting of over 4 billion words.

The Czech corpus is a little smaller – it contains about 3 billion words.

For each language, we describe the data format and explain necessary processing steps performed prior to the experiments. At the end of each section, we show statistics of the data sets to illustrate some of their properties.

3.1 English

This section gives an overview of the data used for extracting English collocations.

We present two different corpora. For each one, we briefly describe the format of the data. To prepare the data for the extraction task, we use a pipeline of automated processing tools that is described below. In the end of this section, we present a statistics of the English data set in terms of counts of unique words and bigrams, and relative frequencies of part-of-speech tags.

There are many corpora available for the English language. In the task of collocation extraction, the most important feature of a good corpus is its size.

The goal in collecting the English data for the experiment is to have about 3 billion words of text.

The data does not need to be annotated in any way. For additional annotation it is sufficient to use automated NLP tools, since the large data size overrides the fact that these tools make mistakes. In following paragraphs the corpora used for this task are introduced:

W2C. The W2C corpus (Majliˇs and ˇZabokrtsk´y, 2012) is a multilingual set of corpora of over one hundred languages. For each language, there are two corpora available in W2C, the Web corpus and the Wiki corpus. In our experiment, we use both of the English corpora together.

The Wiki corpus is composed of the bodies of Wikipedia articles¹ in the given language. The corpora have been built for each Wikipedia containing at least 5,000 articles. The maximum size of a Wiki corpus was set to 20,000 articles.

1http://www.wikipedia.org

(27)

Source Sentences [M] Tokens [M]

W2C-WEB 36.8 875.5

W2C-WIKI 3.1 80.2

W2C Total 39.9 955.8

Table 3.1: Counts of sentences and tokens in English W2C corpora

The Web corpus has been created using Web crawling. The initial Wiki corpora have been used to create search queries. The Web pages matching those queries have been downloaded. The language of the results has been identified and the text has been put in proper corpus. Duplicate content has been removed from the corpora.

The W2C corpora come in plain text file format. Sentence boundaries are not explicitly marked and the text is not tokenized (split into smallest atomic meaningful units, calledtokens). The tokenization process is described in section 3.1.3. Details of numbers of words and sentences in each corpus are given in table 3.1.

English Gigaword. English Gigaword 5th Edition Corpus is an archive of newspaper texts collected by the Linguistics Data Consortium (Parker et al., 2011).

The corpus is a collection of SGML-formatted files, each of which contains news articles published in a month by one of seven news agencies. In total, it contains about four billion words (see details in table 3.2). This large amount of data would be sufficient for us even without using the W2C corpora. However, we decided to include the W2C corpora, since the English Gigaword data comes exclusively from news articles. Along with the W2C corpora, we should be able to model a wider spectrum of the language usage.

Source Sentences [M] Words [M] Tokens [M]

Agence France-Presse 29.3 738.3 826.7

Associated Press Worldstream 56.4 1,187.0 1,374.2

Central News Agency of Taiwan 1.3 38.5 42.0

Los Angeles Times/Washington Post 13.2 268.1 316.7

New York Times 73.8 1,422.7 1,690.6

Washington Post/Bloomberg 0.8 17.4 20.4

Xinhua News Agency 13.8 360.7 395.9

English Gigaword Total 188.6 4,032.7 4,666.4

Table 3.2: Counts of sentences and words in the English Gigaword corpora

(28)

3.1.1 Data Format

As stated above, the W2C corpora come in a compressed plain text format. There is no annotation in the data, so we must perform some automated pre-processing steps on it. Read more on pre-processing below in sections 3.1.2 through 3.1.4.

The SGML format of the files from EGW is relatively simple. The file is a collection of document elements <DOC> with id and type attributes. Each document may contain a headline or dateline element (<HEADLINE>, <DATELINE>).

Then, there is a <TEXT> tag, which contains a collection of paragraph elements (<P>). Each of the tags in the document is the only thing on its line and all lines that contain textual data do not begin with an opening angle bracket, <.

3.1.2 Cleaning the W2C data

Since the W2C-WEB English corpus is obtained using Web crawling, there is quite a lot of noise in the data. To improve our performance, we tried to filter this noise out of the data. The filtration steps used to prepare the data for our task were based on empirical observations.

The only filtration done prior to the experiments was to filter out all the lines that contained the word “poˇcet”. This step filtered out about 600.000 sentences.

The majority among those were footers of forum posts, indicating the user, the number of views etc.

3.1.3 Tokenization

The texts in the corpora come in a plain form. It is more convenient, however, to transform the data into a tokenized format. Tokenizaton is a process of dividing the text into a set of units, called tokens.

A token is a sequence of characters in the text, that has an important semantic role (Manning, Raghavan, and Sch¨utze, 2008). Tokens in the text often corre- sponds to words. For example, consider this sentence from the English Gigaword corpus:

Japan’s domestic sales of motor vehicles in August rose 12 percent from a year earlier to 295,283 units, the first double-digit increase since August 1990, the Japan Automobile Dealers Association said Thursday.

In this case, tokens are bare words, parts of words, the possessive ending (’s), numbers and punctuation marks.

If we do tokenization just by splitting the sentence on whitespaces, ignor- ing the punctuation, we end up with tokens like Japan’s or double-digit, though we would rather have them separated. On the other hand, if we split the text on whitespace and punctuation characters, we get 295 and 283, instead of just 295,283.

The tokenization of our data is done by the Stanford Tokenizer, which is part of their part-of-speech tagger. (Toutanova and Manning, 2000) The counts of tokens in tables 3.1 and 3.2 are obtained from the tokenized data. The counts of sentences in those tables are computed as counts of sentence boundary characters

(29)

(.!?) in the tokenized data. The counts of words in English Gigaword corpora in table 3.2 are taken from the documentation.

3.1.4 Tagging

In the beginning of this thesis we set as a requirement that users should be able to search the database by part of speech of one or both of the collocates. In order to do so, we must include the part of speech information in the database.

The part-of-speech tagging is a process during which a program called the tagger assigns a tag to each of the input words. The tag carries the information about the type of the word, or its part of speech.

Prior to the collocation extraction, all the English text from the selected corpora is tagged by the Stanford Tagger, the same tool that is used for the tokenization. In addition to tags, lemmas are also assigned to the words. A lemma is a base form of a word. For example, lemma of the wordcats would be acat.

Each tagger maps words in a text to a set of tags, which is called a tagset.

There may be different tagsets in different languages. The Stanford tagger uses the Penn Treebank tagset. (Marcus, Marcinkiewicz, and Santorini, 1993) There are 36 tags in the tagset explained in table 3.4.

The tagset used by the tagger is enriched with supportive, mostly self-explai- ning tags used for non-word expressions. The relative frequency distribution of tags in the English data is given in table 3.3. The -LRB- and -RRB- tags are associated with left or right round bracket respectively.

Tag Frequency Tag Frequency Tag Frequency

NN 13.75 % CC 2.54 % POS 0.77 %

NNP 11.20 % PRP 2.39 % -LRB- 0.62 %

IN 10.32 % TO 2.18 % -RRB- 0.62 %

DT 8.26 % VBN 2.12 % WDT 0.43 %

JJ 6.25 % VBZ 1.89 % NNPS 0.36 %

NNS 5.40 % VBG 1.62 % WP 0.34 %

, 4.87 % VBP 1.34 % RP 0.29 %

. 3.98 % PRP$ 1.01 % WRB 0.29 %

VBD 3.47 % : 0.97 % FW 0.27 %

CD 3.37 % ” 0.90 % JJR 0.26 %

RB 2.84 % MD 0.86 % JJS 0.19 %

VB 2.66 % “ 0.85 % Other 0.54 %

Table 3.3: Relative frequencies of tag occurrences in the English data

(30)

Tag Description Example

CC conjunction, coordinating and, or, but

CD cardinal number five, three, 13%

DT determiner the, a, these

EX existential there there were six boys

FW foreign word mais

IN conjunction, subord. or preposition of, on, before, unless

JJ adjective nice, easy

JJR adjective, comparative nicer, easier

JJS adjective, superlative nicest, easiest

LS list item marker

MD verb, modal auxillary may, should

NN noun, singular or mass tiger, chair, laughter

NNS noun, plural tigers, chairs, insects

NNP noun, proper singular Germany, God, Alice

NNPS noun, proper plural we met two Christmases ago

PDT predeterminer both his children

POS possessive ending ’s

PRP pronoun, personal me, you, it

PRP$ pronoun, possessive my, your, our

RB adverb extremely, loudly, hard

RBR adverb, comparative better

RBS adverb, superlative best

RP adverb, particle about, off, up

SYM symbol %

TO infinitival to what to do?

UH interjection oh, oops, gosh

VB verb, base form think

VBZ verb, 3rd person singular present she thinks

VBP verb, non-3rd person singular present I think

VBD verb, past tense they thought

VBN verb, past participle a sunken ship

VBG verb, gerund or present participle thinking is fun

WDT wh-determiner which, whatever, whichever

WP wh-pronoun, personal what, who, whom

WP$ wh-pronoun, possessive whose, whosever

WRB wh-adverb where, when

Table 3.4: The Penn Treebank tagset

(31)

3.1.5 Statistics

The whole English data set contains 5,622.2 million tokens in total. In table 3.5 there are counts of unique unigrams and bigrams in the data. As you can see, the counts of unique lemmas and word forms are quite close. This is probably because English is not very inflectional language. Another reason for this little difference may be the behavior of the lemmatizer as it may consider some words derived rather than inflected.

Unique types Unigrams Bigrams

Form 18,559,976 210,757,826

Lemma 18,545,495 181,503,809

Form+Lemma+Tag 21,977,691 229,275,061

Table 3.5: Unique word forms, lemmas and form–lemma–tag triples in the English data

This section has presented two corpora for English collocation extraction: the W2C corpora, and the English Gigaword corpus. We have described formats of the corpora and performed some cleaning procedures. The morphological information has been added to the text by an automated tagging utility, and the annotated text has been analyzed in terms of occurrence frequencies.

3.2 Czech

In the following text, the Czech data source is described. Unlike in the English data set, we use just one corpus here. Similar procedures are used for morphological annotation of the data. However, for the preparation task, we use a different tool, which is suited for the Czech language. Further in this section, we briefly introduce the Czech positional tag system and present statistics of the annotated data.

As a source of the Czech language data in this thesis we use the Czech Web Corpus, or CWC2011 (Spoustov´a and Spousta, 2012). The corpus consists of three collections of texts obtained from internet: Articles, discussions, and blogs.

In total, there are 3284 million tokens, forming a text of 218 million sentences.

Table 3.6 shows counts of tokens and sentences in each collection.

One of the features of the CWC2011 is that it comes with compiled lists ofn- gram frequencies from unigrams up to 5-grams. For collocation extraction using pointwise mutual information, this format is sufficient and the implementation of the PMI computation is quite straightforward. Also, it allows users to include up to 5-word collocations.

However, as in the task of collecting the English data, we want to extract the frequencies from the tagged corpus. It means that we cannot use then-gram lists because the data are not tagged; for tagging, we need to have the data in a sequential form.

(32)

Source Sentences [M] Tokens [M]

Articles 38 628

Blogs 98 1,249

Discussions 82 1,407

Total 218 3,284

Table 3.6: Counts of sentences and tokens in CWC2011

3.2.1 Data Format

The CWC2011 corpus is in plain text format. The text is tokenized and sentence boundaries are annotated. The data files are in format of one token on a line.

Sentence boundary is considered a single token and is marked as<s>.

The corpus is divided into three data files, one for each part (articles, blogs, discussions). The files are compressed using bzip2.

3.2.2 Tagging

As in the case of the English data, the text needs to be tagged first. Since the Czech morphology is different from the morphology of English, we use a different method for tagging. We give a description of the used tagset; then we look at the relative frequencies of tags assigned by a tagger to the texts from the Czech Web Corpus.

The Czech language has rich inflection so the Penn Treebank tagset cannot cover all the morphology information of a word. To solve this issue, thepositional tag system has been introduced by Hajiˇc (2000).

A positional tag is a string of 15 characters that represents the morphology of a word. The tag consists of a set of 15 morphological properties, with each property encoded as one alphanumeric character. The morphological properties (e.g., part of speech or degree of comparison) are placed on a fixed position in the string according to their category. The tag is constructed by placing the characters representing the morphological properties to their position. An example tag with explanation of its positions is given in figure 3.1. Table 3.7 lists the categories of morphological properties and their position in the tag.

AAIS2----1A----

adjectiv e general

adj.

masc.

inan imativ

e

singulargenitiv e

positiv e affirmativ

e

Figure 3.1: Czech positional tag example for wordfotbalov´eho

To perform the task, we use a tagger called MorphoDiTa (Strakov´a, Straka, and Hajiˇc, 2014). The example tag in the figure 3.1 has actually been assigned by this tagger.

Text práce (997.3Kb)

Charles University in Prague Faculty of Mathematics and Physics

MASTER THESIS

Jindˇrich Helcl

V´ıcejazyˇ cn´ a datab´ aze kolokac´ı

Institute of Formal and Applied Linguistics

Supervisor of the master thesis: prof. RNDr. Jan Hajiˇ c, Dr.

Study programme: Informatics

Specialization: Mathematical Linguistics

Prague 2014

Contents

Chapter 1 Introduction

1.1 Use Cases

1.2 Objectives

1.3 Existing Solutions

1.4 Thesis Structure

Chapter 2

Theoretical Background

2.1 Information theory

2.1.1 Noisy Channel

2.1.2 Language model

2.2 Collocations

2.3 Collocation extraction

2.3.1 Methods

2.3.2 Evaluation

Chapter 3 Data

3.1 English

3.1.1 Data Format

3.1.2 Cleaning the W2C data

3.1.3 Tokenization

3.1.4 Tagging

3.1.5 Statistics

3.2 Czech

3.2.1 Data Format

3.2.2 Tagging

AAIS2----1A----