Yuliya Yamalutdinova Detection of contradictions in pairs of texts in Kazakh

(1)

BACHELOR THESIS

Yuliya Yamalutdinova

Detection of contradictions in pairs of texts in Kazakh

Institute of Formal and Applied Linguistics

Supervisor of the bachelor thesis: Mgr. Rudolf Rosa, Ph.D.

Study programme: Computer Science

Study branch: General Computer Science

Prague 2019

(2)

I declare that I carried out this bachelor thesis independently, and only with the cited sources, literature and other professional sources.

I understand that my work relates to the rights and obligations under the Act No. 121/2000 Sb., the Copyright Act, as amended, in particular the fact that the Charles University has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 subsection 1 of the Copyright Act.

In ... date ... signature of the author

(3)

I would like to thank my supervisor Mgr. Rudolf Rosa, Ph.D. for his comments and great help with this thesis.

(4)

Title: Detection of contradictions in pairs of texts in Kazakh Author: Yuliya Yamalutdinova

Institute: Institute of Formal and Applied Linguistics

Supervisor: Mgr. Rudolf Rosa, Ph.D., Institute of Formal and Applied Linguistics Abstract: Nowadays we have access to massive amount of information on the internet. But at the same time, we are faced with the problem of untrue information. The solution for this problem would be a tool, which could detect contradictions in texts. The goal of this work is to find in the given texts in Kazakh the statements with similar content and classify them as contradictory or similar. In most of the previous works the authors tried to align the sentences to find the most similar ones and used the information about semantics and morphology to classify them as contradictory or not. In our research we have tried to find semantically similar sentences using word2vec, fastText and BERT embeddings, and trained several models to classify them as contradictory or not, using the information about morphology, checking sentences for antonyms and building the neural network classifiers trained on the huge datasets. Our best model has achievedF₂ better that random.

Keywords: contradiction, NLP, Natural Language Processing, Kazakh

(5)

Introduction

Nowadays getting the information has become much easier. People can easily find all the latest news on various news portals on the Internet all around the world. We have access to massive amount of information, but at the same time, we are faced with the problem of invented news and generally untrue information.

For example, when several news portals publish different contradictory versions of the same story, we do not know which one is true. Even the reliable sources sometimes publish untrue news. One way to solve this problem is to compare two texts containing suspicious information with each other and find contradictions between them. The goal of this work is to find in the given texts the statements with similar content and classify them either as contradiction or similar. In our research we have focused on detecting contradictions between news articles, since they often contain a lot of contradictions.

There has not been much research into the field of contradiction detection, however some approaches have achieved good results at detecting certain types of contradictions, such as word negation and numerical contradictions. However, the majority of them has been conducted on English texts, and there is no research conducted on the texts in Kazakh or other Turkic languages.

Solving the task of contradiction detection has a number of applications, for example, it can be used to automatically filter untrue news on web pages. Since there has been little research into rare Turkic languages processing, the results of our work can also be used in similar research in Kazakh or other Turkic languages.

Thesis Structure

In Chapter 1 we introduce the problem of contradiction detection and present related work. Chapter 2 describes some concepts used in this work such as logistic regression, support vector machine, neural networks and word embeddings. In Chapter 3 we introduce Kazakh language and explain its morphology, syntax and basic grammar rules including making negation. Chapter 4, where we present our work, is divided into two parts. The first part describes finding of semantically similar sentences in two given texts. The second part describes different ways to classify these sentences as contradictory or similar. In the Chapter 5 we describe the implementation of the program and how to use it. In the last Chapter 6 we describe the used data, summarize the experiments and present the results.

(8)

1. Detection of Contradictions

1.1 Problem Formulation

To solve the problem of contradiction detection we should first define the concept of contradiction. We have decided to follow one of the possible definitions stated by the Cambridge Dictionary [dic]: the fact of something being the complete opposite of something else or very different from something else, so that one of them must be wrong. It means that two sentences are in contradiction when they are opposite or very different and one of them is wrong.

We should also decide what does “complete opposite” or “very different” mean for sentences. We suppose that both “opposite” and “very different” sentences should contain antonyms or phrases with opposite meanings. According to this definition the following sentences are not in contradiction:

• Tom has a black cat.

• Ann has a white cat.

• John has a black dog.

because they are different and contain antonyms, so fulfill the first condition, but they all can be true at the same time. On the other hand these two sentences are in contradiction, because they are opposite and only one of them is true, so they fulfill all conditions:

• The official language of the Republic of Kazakhstan is Kazakh.

• The official language of the Republic of Kazakhstan is Russian.

We assume, in general, that the sentences that contain more than one pair of antonyms are not in contradiction, because they don’t describe the same thing anymore, therefore, they both can be true. But at the same time even if the sentences contain the same words and differ at only one word it does not mean that they are in contradiction. For example, following two sentences contain antonyms, but if we do not know how many animals Tom has, we can not say that this is a contradiction.

• Tom has a black cat.

• Tom has a black dog.

So in this example the result depends on the context.

Another problem is a definition of antonyms or opposite words. There are pairs of words that are context dependent meaning that in some sentences they serve as antonyms, and in others not. For example, given two sentences

• There was a big festival in Paris.

• There was a big festival in Moscow.

(9)

type example

numbers I will arrive in California on July 20th.

I will arrive in California on July 16th.

facts Paris is the capital of France.

Moscow is the capital of France.

antonyms The cat that is sitting on the sofa is black.

The cat that is sitting on the sofa is white.

negation I do not want to go for a walk.

I want to go for a walk.

structure Tom asked Katie to do out.

Katie asked Tom to go out.

context dependent There was a big show in Paris.

The was a big show in Moscow.

Table 1.1: Types of contradiction between sentences

In this case again without given the context we are not able to say whether

“Paris” and “Moscow” are opposites or not. And if we do not have any other sentences, that would clarify the meaning, we can assume that these sentences are not in contradiction. On the other hand, the words “Paris” and “Moscow”

in the following two sentences are definitely opposites, because these sentences describe the same fact and we know that they can not be both true:

• Paris is the capital of France.

• Moscow is the capital of France.

As we can see, it is not an easy task to define precisely which sentences are in contradiction and which are not. In some cases we need to know that the sentences describe the well-known fact, so there is only one true sentence. In other cases we need the context to decide whether two sentences are in contradiction or they both can be true at the same time. It is also a problem that there are a lot of types of contradictions between the sentences and we can create the model, which will correctly detect the contradiction of one type, but will not be able to detect the contradiction of another type. We have shown different types of contradictions in the Table 1.1.

1.2 Previous Work

There has not been much research into the field of contradiction detection, however, some approaches have shown good results in detecting certain types of contradictions. And yet majority of them has been conducted on English texts.

There is no available research into the field of contradiction detection in Kazakh or other Turkish languages. However, there has been the research into other NLP(Natural Language Processing) tasks in Kazakh, such as morphology analyzing [Kessikbayeva and Cicekli, 2014], spelling correction [Gelbukh, 2014], speech recognition [Khomitsevich et al., 2015] and sentiment analysis [Sakenovich and Zharmagambetov, 2016].

(10)

There has been research into related task “recognizing textual entailment” [Gi- ampiccolo et al., 2007]. Based on these works one of the first approaches to detect all the types of contradictions in English was made by De Marneffe et al. [2008].

The authors of this work define the contradiction between sentences as following “contradiction occurs when two sentences are extremely unlikely to be true simultaneously”[De Marneffe et al., 2008]. They have also distinguished between different types of contradictions, such as numeric, factive, usage of antonyms, structural and lexical. They have considered contradiction made by using negation, antonyms and numerical “easy”, because they can be detected without full comprehension of sentences, which is also true for our work. These types of contradictions are easily detected in Kazakh. The other types of contradiction were marked as “difficult”, because information about sentence meaning is needed for detecting them. Their model has 4 stages: linguistic analysis, alignment between graphs, filtering non-coreferent events and extraction of contradiction features.

In the first two stages they have computed linguistic representation of the text and made an alignment between graphs. In the third stage they have tried to remove the sentences, which do not describe the same event and can be marked as contradiction by error. In the last stage they have extracted contradiction features, such as polarity, numbers, antonyms, structural, factivity and modal- ity, and used a logistic regression to classify each pair as contradictory or not.

They have achieved 63% accuracy at detecting negations and 78.9% precision at detecting the single word antonymy, however, other types of contradiction were not detected so well. The authors have also suggested that “generalizing for contradiction is more difficult than for entailment”[De Marneffe et al., 2008].

Other approach to detect contradictions in English has been made by [Dragos, 2017]. They have tried to detect contradictions by using the factual information of the sentences and its uncertainty, which have allowed to detect better contradictions in reported facts or contradiction in the points of view on some fact.

Each sentence has been divided into several tuples, each of them holds factual information and uncertainty, and the model analyses the sequence of tuples instead of analyzing sentences directly. The model analyzes inconsistencies between facts and certainty and using this information estimates contradictions.

One of the tasks related to the contradiction detection is a detection of synonyms and antonyms. Nguyen et al. [2016] have proposed new word embedding model, which outperforms state-of-the-art models on distinguishing antonyms from synonyms. They have done it by improving the weights of feature vectors, such as the most salient features in the vectors have bigger weight, and features of minor importance have less weight. The authors have noticed, that

“the strongest features of a word also tend to represent strong features of its synonyms, but weaker features of its antonyms”[Nguyen et al., 2016], and proved, that using the new embeddings can help to distinguish antonyms from synonyms better. They have also integrated lexical contrast into a skip-gram model, by changing its objective function.

1.3 Our Approach

We have decided that there is a contradiction between sentences when they are semantically similar and they contain exactly one pair of opposite words or phrases.

(11)

So, according to this definition, the following two sentences are in contradiction:

• A young boy is playing in the grass.

• The boy is in the sand.

But these two sentences are not in contradiction:

• A young boy is playing in the grass.

• The man is in the sand.

The first part of our work is about finding semantically similar sentences in a given text. To compute sentence similarity we have needed a way to represent sentences, which would be based on word representations. Given the representations of two words we can compute their similarity and one of the popular ways of doing this is using word embeddings, which are described in the next chapter.

We have considered different pre-trained word embeddings for Kazakh, such as word2vec, fastText and BERT, based on their ability to recognize semantically similar words. Next we have considered several ways to create sentence embeddings from the embedding for each word in the sentence and compared their performance.

After semantically similar sentences have been found, they should be classified as contradictory or similar, which is the second part of our work. We have implemented several classifiers based on fully-connected neural networks, analysis of the morphology of each word, classification of each pair of words as antonyms or synonyms, and compared their performance on three different datasets.

(12)

2. Theory

In this work we have implemented several models. Some of them use classical ML(Machine Learning) algorithms such as logistic regression and support vector machine, other use neural networks. Another concept that we have used is word embeddings, which are often used for word and sentence representation. In this chapter we will describe these concepts in more detail.

2.1 Logistic Regression

Logistic regression is a classification algorithm, which means it attempts to predict discrete or categorical value. For example, it can be used to predict whether two words are antonyms or synonyms. It is easy and fast algorithm, however, it can be applied only on linear function. In this section we will introduce the basics of this algorithm and describe how it works.

Imagine that we have a set of n features X which we want to use to predict some valuey. The logistic regression algorithm tries to find the weight θ_i for each feature i, so that

y=f(X)_Θ =θ₀+θ₁x₁+θ₂x₂+...+θ_nx_n

It takes the output of this linear function and maps it to the interval [0,1] by using the sigmoid function defined as

sig(y) = 1 1 +e^−y

The distribution of this function is shown in the Figure 2.1. The output of the sigmoid function is a probability of output value to be 1 given the valuesX.

p=P(y= 1|X)

This probability is used to predict the result binary value, so that prediction(p) =

⎧

⎨

⎩

0 p <0.5 1 p≥0.5

We want as many data points as possible to be classified correctly, so we need to estimate a vector of weightsΘ. Logistic regression is estimated using Maximum Likelihood Estimation. Given some random data point, it tries to maximize a probability that this point will be classified correctly. This likelihood is given by the formula

L(Θ) =

n−1

∏

i=0

P(y_i|x_i,Θ)

where x₀...x_n−1 are n data points and y₀...y_n−1 are the values that should be predicted for each data point x_i, y_i ∈ {0,1}. Due to the fact that we want to

(13)

Figure 2.1: Distribution of sigmoid function

maximize the likelihood, it is easier to compute the logarithm of the likelihood given by the formula

L(Θ) =

n−1

∑

i=0

log(P(y_i|x_i,Θ))

To find weights that maximize the likelihood function are used iterative methods, such as Newton method¹, which repeatedly revises the solution to see if it can be improved until no improvement can be made.

2.2 Support Vector Machine

Support vector machine is an algorithm, which can be used for both classification and regression. Unlike logistic regression it can be applied on both linear and non-linear functions.

LetX₁ andX₂ be two sets of points inn-dimensional space. Then X₁ and X₂ are linearly separable² if there is a hyperplane, which splits the space into two spaces S₁ and S₂, such that all points from X₁ are in S₁ and all points from X₂ are in S₂.

∃ α₀, α₁...αn−1, k such as α_i, k ∈R

∀x∈X1 α0x0+α1x1+...+αn−1xn−1 < k

∀x∈X₂ α₀x₀+α₁x₁+...+αn−1xn−1 > k

⇒X₁ and X₂ are linearly separable

An example of linearly separable data in two-dimensional space is shown in

1https://en.wikipedia.org/wiki/Newton%27s_method

2https://en.wikipedia.org/wiki/Linear_separability

(14)

Figure 2.2: Linearly separable points

the Figure 2.2. For using support vector machine with linear kernel the data points should be linearly separable. The objective of the algorithm is to find the hyperplane, which maximizes the margin, where margin is a distance between hyperplane and the data points on each side which are nearest to it. The examples of maximum and not maximum margins are shown in Figure 2.3. The boundary data points are called support vectors. They are shown for the maximum margin in the Figure 2.3. For linearly separable data is possible to find the hyperplane, which maximizes the margin. This margin is derived based on the relative position of the support vectors.

Let us have a hyperplane defined by equation w^Tx+b = 0

where w is a vector normal to the hyperplane, x is a data point and b is a parameter that represents the movement of the hyperplane in the direction ofw.

The support vector machine algorithm uses function for classification defined as f(x) = sign(w^Tx+b).

The pointsxfor whichf(x) = 1are classified as the first class, and the points for whichf(x) =−1are classified as the second class. We want to findwand b such as the distance from the hyperplane w^Tx+b = 0 to each class is the largest. It can be shown that this distance is equal _∥w∥¹ (Figure 2.4). The problem of finding the maximum _∥w∥¹ is equivalent to the problem of finding the minimum ∥w∥². So

(15)

support vectors

Figure 2.3: Examples of two margins defined by hyperplanes a and b. Black margin is a maximum whereas the shadow one is not.

1

‖ ‖

1

‖ ‖

Figure 2.4: Distance from the hyperplane to the classes

(16)

Figure 2.5: Not linearly separable points

to find the hyperplane we need to solve the following linear program minimize ∥w∥²

subject toy_i(w^Tx_i+b)≥1,

∀i= 0...m−1

wherex0...xm−1 are data points with labelsy0...ym−1. This program can be solved using the method of Lagrange multipliers³. The function for classification can now be equivalently expressed as

f(x) =sign(

n−1

∑

i=0

α_iy_ix^T_i x+b)

whereαi is a Lagrange multiplier.

However, not every dataset is linearly separable. For example, the points of the dataset shown in the Figure 2.5 can be separated, but not by a straight line.

In this case all data points are projected into the space of higher dimension using some mapping

φ:Rⁿ→X

3https://en.wikipedia.org/wiki/Lagrange_multiplier

(17)

The function for classification is now changed to f(x) =sign(w^Tφ(x) +b) =sign(

n−1

∑

i=0

α_iy_iφ(x^T_i )φ(x) +b)

Due to the high dimension of the data finding the appropriate mapping can be expensive. Therefore a kernel function is used to get a hyperplane in the new space. The kernel function is defined by the formula

K(x_i, x_j) =φ(x^T_i )φ(x_j)

Having the kernel function,we do not need to know the mapping, since the kernel function defines inner product in the new space. The function used for classification can now be changed to

f(x) = sign(

n−1

∑

i=0

αiyiK(x_i, x) +b)

The following kernels are the most common in practice Linear: K(x, y) =x^Ty

Polynomial: K(x, y) = (x^Ty+c)^d Gaussian: K(x, y) = exp

(

−∥x−y∥² 2σ²

)

Sigmoid: K(x, y) = tanh(αx^Ty+c)

2.3 Neural Networks

Nowadays neural networks are used for a lot of applications. They are effective for solving many difficult problems, such as speech recognition, machine translation or image processing. The structure of neural network has come to us from the biology. It is a sequence of neurons connected by synapses, where neuron is some unit for computation, which receives an information from other neuron, performs simple calculations on it and passes it on to other neuron, and synapse defines a weight of a neuron when it passes its information on. Neurons are combined together into layers, which can be input, output and hidden. Input layers get information, hidden layers process it and the output layers displays the result.

The neural network differ in the number of hidden layers. They can be “shallow”, which usually have only 1 or 2 hidden layers, and “deep” with more hidden layers.

Every neuron X, except those in input layer, gets total information from all neurons from previous layer, which is normalized by some activation function:

input(X) =

n−1

∑

i=0

w_iY_i output(X) = f(input(X))

where Y₀...Yn−1 are neurons from the previous layer with corresponding weights w₀...wn−1 and f is an activation function. Usually a bias neuron is also added

(18)

when the input of some neuron is calculated. It is used then we need to shift a decision boundary to some direction. So the final formula is

input(X) =

n−1

∑

i=0

w_iY_i+bias

Based on its input each neuron calculates and returns a value representing its

“decision”. However this value can be any number in range (−∞,+∞) and an activation function, which maps any value to the needed interval, is used to solve this problem. There are a lot of activation functions, which differ on the range of returned values. Here we will show only the most popular:

Sigmoid: f(x) = 1

1 +e^−x with output in range (0,1) Tanh: f(x) = 2

1 +e^−2x −1 with output in range (−1,1) ReLu: f(x) =max(0, x) with output in range (0,+∞)

At the beginning of the training process the weights of the neurons are set to the random values and the aim is to find the “right” values such as the network makes the right predictions. The training data are usually divided into batches and processed by parts. Each time all training data is processed is called an epoch. Usually the more epochs we have the better network is trained. However the network can be over-trained, which means it will give good results on train data, but wrong on the unseen data. For this purpose is usually used a validation set, which is a part of unseen data on which the network is tested every epoch.

Using the validation data we can stop the training process at the moment then the network has achieved the best accuracy on the unseen data. At the end of each epoch an error is calculated. Error should decrease at each epoch, and if it does not something wrong has happened.

There are different methods to train a network. One of them is a back- propagation algorithm, which is well described in a “Deep learning” book by MIT [Goodfellow et al., 2016]. The main idea of this algorithm is that the network makes a reverse transmission of information about error from the previous epoch or iteration to adjust the wrong parameters. The value of weights will change in the direction that give us the best result.

There are different types of architectures of neural networks, such as Multi- layer Perceptron(MLP), Convolutional Neural Networks(CNN), Recurrent Neural Networks(RNN) and Long-Short Term Memory(LSTM) networks. In our work we have used MLP, which is a “feed forward” network, with at least 3 fully-connected layers: input, output and hidden. It can be used for supervised classification and regression problems. To teach MLP network is usually used a back-propagation algorithm. An example of network with MLP architecture is shown in the Figure 2.6.

2.4 Word Embeddings

Imagine that we are given two texts in the same language and we want to find in them semantically similar sentences. For doing this we need somehow represent

(19)

Input layer

Hidden layer

Output layer

Figure 2.6: Multi-layer perceptron

the sentences so the computer could understand and compare them. Computers are unable to understand the words, sentences and text in the way we are used to represent them. It requires a numerical representation. One of the traditional ways to represent the words is using one-hot⁴vector. It is a vector, which contains

“1” at some position and at other positions has “0”. Usually a dictionary of all existing words is used, where words are in the alphabet order and each word has a unique representation depending on the order. So the length of the vector and number of vectors should be equal to the size of the vocabulary. Since most of languages has a huge number of unique words, especially languages with cases, where each word has a lot of different forms, this representation would waste memory and would be difficult to use. It is also not possible to infer a relationship between two words from their representation. For example, two semantically similar words can be far away from each other in the dictionary and otherwise some not similar words can be close to each other.

2.4.1 word2vec

The approach described above is good for processing small numbers of texts with limited dictionary. However in our case we need to process a lot of different sentences with different words and we need some representation, which would allow us to compare the similarity of sentences. A new approach to word representation was proposed by Mikolov et al. [2013], which was called “word2vec”. This approach is based on simple hypothesis: words that occur surrounded by the same words have similar meanings. Similarity in this case is understood as the fact that only matching words can stand nearby. Based on this hypothesis for the new approach large amounts of information was not an obstacle, but rather an advantage.

There were two architectures of word2vec that were proposed: continuous bag of words(CBOW) and skip-gram. They are shown in the Figure 2.7. The first model tries to predict a probability of the word based on the words around it. The training process is organized as follows: (2n+ 1) word are taken sequentially and

4https://en.wikipedia.org/wiki/One-hot

(20)

w(t) w(t) SUM

INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT

w(t+2) w(t+1) w(t-1) w(t-2)

CBOW Skip-gram

Figure 2.7: The architectures of word2vec models proposed by Mikolov et al.

[2013]

the word in the center is the word that should be predicted. The surrounding words are the context of length n on each side. Each word in the model is associated with a unique vector, which is changing in the process of learning. This approach is called continuous, because the model gets consistently sets of words from text, and bag of words, because word order in context is not important.

Another proposed model, opposite to CBOW, is called skip-gram. Here we have given a word and the model tries to guess its context. The rest of the model architecture is similar. Both models have one hidden layer, which has a dimension equal to the size of embedding. Using the word2vec model the dimension of the representation changes from the size of the vocabulary to the size of hidden layer. However it is not the only advantage of word2vec embeddings. The vectors from word2vec model are much better in the sense of finding the similarity of words. Although no knowledge about semantics is used in the model, the result vectors contain an information about semantics and can be used to compare word similarity.

The authors show in the work that some simple algebraic operations can be performed with the word vectors. For example, “the word “big” is similar to

“bigger” in the same sense that “small” is similar to “smaller” ” [Mikolov et al., 2013]. To find a word that is similar to “small” in the same sense that “biggest”

similar to “big”, we need to compute the vector

vec(X) =vec(“biggest”)−vec(“big”) +vec(“small”)

These features of word2vec embeddings make them appropriate for our purposes.

2.4.2 fastText

Word2vec word embeddings are useful for many tasks, however they have a limi- tation for some morphologically rich languages such as Kazakh. Words in Kazakh have a lot of different forms and even if we train word2vec model on the large corpora we can not be sure that there is a vector for each form of each word.

(21)

It happens because in word2vec model the internal structure of the words is ignored. An approach to improve the embeddings was proposed by Bojanowski et al. [2017]. Instead of studying the whole word, fastText model breaks the word on the n-grams of characters and studies them. For example, the 3-grams for the word “capital” are “cap”, “api”, “pit”, “ita” and “tal”. An embedding for each word is calculated as the sum of all its n-grams. Now we can find an embedding for an unknown word if it is made of known n-grams, so the rare forms of words can be represented too. It also helps to get an embedding for the word with some spelling mistake. An example is shown on the fastText project page⁵. The 10 nearest words to the misspelled word “enviroment” with similarity:

Query word ? e n v i r o m e n t e n v i r o m e n t a l 0 . 9 0 7 9 5 1 e n v i r o n 0 . 8 7 1 4 6

e n v i r o 0 . 8 5 5 3 8 1 e n v i r o n s 0 . 8 0 3 3 4 9 e n v i r o n n e m e n t 0 . 7 7 2 6 8 2 e n v i r o m i s s i o n 0 . 7 6 1 1 6 8 r e a l c l i m a t e 0 . 7 1 6 7 4 6 e n v i r o n m e n t 0 . 7 0 2 7 0 6 a c c l i m a t a t i o n 0 . 6 9 7 1 9 6 e c o t o u r i s m 0 . 6 9 7 0 8 1

The authors have evaluated the model on the nine different languages on the analogy and word similarity tasks and compared it with other proposed representations. They have shown that the fastText embeddings achieve state-of-the-art performance on these tasks.

2.4.3 BERT

BERT or Bidirectional Encoder Representations from Transformers is a language representation model, which has been proposed by Devlin et al. [2018] and has obtained state-of-the-start results on the lots of Natural Language Processing tasks. BERT is an unsupervised method used to get pre-training language representation. It was shown before by Radford et al. [2016] that if we pre-train the model for language representation on the large amount of data and then use this model for the task that the need, it would give much better results on many NLP tasks.

BERT uses a new network architecture called “Transformer”, which was proposed by Vaswani et al. [2017]. This architecture is based entirely on attention and it replaces the recurrent layers with multi-headed self-attention, which allows the model to be trained much faster. The authors showed that they achieved a new state-of-the-art on machine translation task. There are several existing context-sensitive models for pre-training textual representation, which use the

“Transformer” architecture, such as semi-supervised sequence learning presented by Dai and Le [2015], generative pre-training⁶ and ELMo presented by Peters et al. [2018], however all of them are only shallowly bidirectional or unidirectional. BERT is an improvement of the previous models. It is a bidirectional context-sensitive model, which means it takes into account other words in the

5https://fasttext.cc/docs/en/unsupervised-tutorial.html

6https://openai.com/blog/language-unsupervised/

(22)

sentence to get an embedding for the word and uses both words before and after given word, whereas unidirectional model takes into account only upcoming words. To do this BERT replaces 15% of the words in the input text by “[MASK]”

and the network learns to predict which word can be on the place of “[MASK]”.

A simple example of input with masked words and expected output is shown on the official BERT page⁷:

I n p u t : t h e man went t o t h e [MASK1] . he bought a [MASK2] o f m i l k L a b e l s : [MASK1] = s t o r e ; [MASK2] = g a l l o n

Another feature of the BERT model is its ability to capture logical connection between sentences. For example, we are given two sentences. The model is able to understand whether the second sentence should follow the first one or they are just random sentences and have nothing in common. Some examples are shown on the official BERT page:

S e n t e n c e A: t h e man went t o t h e s t o r e . S e n t e n c e B : he bought a g a l l o n o f m i l k . L a b e l : I s N e x t S e n t e n c e

S e n t e n c e A: t h e man went t o t h e s t o r e . S e n t e n c e B : p e n g u i n s a r e f l i g h t l e s s . L a b e l : N ot Ne xt Se n te nc e

BERT provides pre-trained multilingual model, which also supports Kazakh language. It can be found on the official BERT page.

7https://github.com/google-research/bert

(23)

3. Kazakh Language

3.1 Language family and Alphabet

Kazakh is the official language of the Republic of Kazakhstan with 13 million native speakers living in Kazakhstan, Russia, China, Mongolia and Uzbekistan¹. It belongs to the Turkic languages, and was also influenced by Persian and Arabic during its formation. Some of the Turkic languages are very similar, for example, Kazakh have similar vocabulary and grammar rules to Uzbek, Kirghiz and Tatar, so it is not hard to understand these languages for a Kazakh speaker.

Modern Kazakh uses the Cyrillic script, however a transition to the Latin alphabet is planned by 2025². The current Kazakh alphabet has 42 letters: 33 from the Russian alphabet and 9 letters that are not in the Russian alphabet.

These letters are: Ә, Ғ, Қ, Ң, Ө, Ұ, Ү, Һ, I. Since all NLP and MT(Machine Translation) tools for Kazakh that we have used still work only with Cyrillic alphabet, we have trained our models and performed experiments on texts written in Cyrillic.

3.2 Morphology and Syntax

Kazakh is an agglutinative language with seven cases and three tenses. In Kazakh to create new forms of words and new words suffixes are used. There are two types of suffixes in Kazakh: inflectional and derivational. Inflectional suffixes are used for creation of new words using already existing ones, usually with different meaning. For example, a new word “оқушы” /oqushy/ (pupil) can be formed by adding a lexical suffix “шы” /shy/ to the word “оқу” /oqu/ (to study). Other examples:

• “сөз” /s’oz/ (word) - “сөздiк” /s’ozdik/ (dictionary)

• “ғарыш” /garysh/ (space) - “ғарышкер” /garyshk’er/ (cosmonaut)

• “кiтап” /kitap/ (book) - “кiтапхана” /kitaphana/ (library)

Derivational suffixes are generally used for changing the grammatical categories of the words, such as number, gender, tense, mood and case. For example, the phrase “in your letters” is a one word “хаттарыңызда” /hattarynyzda/ in Kazakh, which is created by adding suffixes “тар” /tar/, “ыңыз” /ynyz/ and “да” /da/ to the word “хат” /hat/ (letter). Other concepts such as negation, plurality and possession can also be expressed by using the derivational suffixes.

As well as other Turkic languages, Kazakh is a language with vowel harmony.

It means that vowels in Kazakh are divided into several classes and there are constraints on which vowels from different classes can not be used together in one word. There is also a rule called “consonant assimilation” in Kazakh, which means that consonants are divided into several classes by the way they sound and the first consonant in every added suffix should be from the same class as the last consonant in a stem.

1https://www.ethnologue.com/language/kaz

2https://astanatimes.com/2017/10/kazakhstan-to-switch-to-latin-alphabet-by-2025/

(24)

3.3 Negation in Kazakh

As we have mentioned earlier, negation in Kazakh can be expressed by adding the derivational suffixes to the words. Some adjectives and almost all verbs can be negated this way. Suffixes “ма” /ma/, “ме” /m’e/, “ба” /ba/, “бе” /b’e/, “па”

/pa/, “пе” /p’e/ are used to negate verbs. For example,

• “айт” /ayt/ (tell) – “айтпа” /aytpa/ (do not tell)

• “бiлемiн” /bil’emyn/ (I know) - “бiлмеймiн” /bilm’eymyn/ (I do not know) Another way to negate verbs is using the words “жоқ” /zhok/ (no) and “емес”

/yem’es/ (not): “болған” /bolgan/ (I was) - “болған жоқпын” /bolgan zhoqpyn/

(I was not).

Suffixes “сыз” /syz/ and “сiз” /siz/ are used to negate nouns and adjectives.

For example,

• “қауiптi” /qauypty/ (dangerous) - “қауiпсiз” /qauypsiz/ (safe)

• “мұғалiм” /mugalim/ (teacher) - “мұғалiмсiз” /mugalimsiz/

(without teacher)

Another way to negate words in Kazakh is using the word “еш” /yesh/. Is is usually used with nouns and pronouns. For example,

• “жер” /zh’er/(land) - “еш жерде” /yesh zh’erd’e/ (nowhere)

• “адам” /adam/ (person) - “еш адам” /yesh adam/ (no one)

• “қашан” /qashan/ (when) - “ешқашан” /yeshqashan/ (never)

(25)

4. Our work

Our work consists of two parts. The first one describes finding similar sentences in the given texts. And the second one describes a classification of these sentences as contradictory or similar. In this chapter we will describe our approaches in detail.

4.1 Sentence Similarity

We have considered word2vec, fastText and BERT models in our work, used them to create sentence embeddings and compared the quality of these embeddings.

4.1.1 word2vec

Word2vec model has been trained on the articles from Kazakh Wikipedia¹ and the articles from the web page with news in Kazakh using the skip-gram algorithm and with dimension of vectors equal 300. This model has been used to get the embedding for each word in the sentence. To estimate the performance of the model we have tested how well it can predict which words are similar to some given word. The model seems to work well in this task, for example, the 5 most similar words to the word “жақсы” /zhaqsy/ (good) returned by the model are:

• “нашар” (bad) with similarity = 0.646

• “жаксы” (good) the same word, but with the spelling mistake, with similarity = 0.599

• “тәуiр” (good) with similarity = 0.548

• “жаман” (bad) with similarity = 0.509

• “керемет” (great) with similarity = 0.506

The similarity of two words has been computed as a cosine similarity between their embeddings

cos(emb_w₁, emb_w₂) = embw1 ·embw2

||emb_w₁|| · ||emb_w₂||.

Other examples of finding the most similar words to the given word using word2- vec model are shown in the Table 4.1. As we have mentioned before, an embedding does not exist for each word in the word2vec model, therefore, we have decided to set zero vector as default embedding for these words.

Using these word vectors we have created sentence embeddings by computing the average vector of all word vectors

emb_s =

∑

w∈Wsemb_w

|W_s|

1https://kk.wikipedia.org/wiki/

(26)

word the 5 most similar similarity

‘ит‘(dog)

“тазы”(greyhound) 0.679

“Ит”(dog) 0.636

“қасқыр”(wolf) 0.621

“аң”(animal) 0.615

“мысық”(cat) 0.600

“астана”(capital)

“Ассирия”(Assyrian) 0.525

“елорда”(capital) 0.516

“Вавилон”(Babylon) 0.516

“Суяб”(Suyab) 0.506

“Мерв”(Merv) 0.497

“қауiптi”(dangerous)

“кауiптi”(dangerous) 0.605

“қауiпты”(dangerous) 0.554

“зиянды”(harmful) 0.542

“қолайсыз”(unfavorable) 0.527

“қауiп”(danger) 0.504

“жаңбыр”(rain)

“нөсер”(showers) 0.734

“Жаңбыр”(rain) 0.690

“қар”(snow) 0.658

“жауын”(rain) 0.653

“жауыншашын”(rainfall) 0.653

“батыл”(bold)

“жiгерлi”(energetic) 0.615

“байсалды”(seriously) 0.591

“қайсар”(persistent) 0.551

“өжет”(bold) 0.527

“қатал”(harsh) 0.524

Table 4.1: Finding the most similar words to the given word using word2vec model

(27)

where W_s is a sequence of embeddings of all words in sentence s including zero vectors for unknown words . The similarity of sentences has been computed as cosine similarity between their embeddings

sim(s₁, s₂) = emb_s₁ ·emb_s₂

||emb_s₁|| · ||emb_s₂||.

However, this way to compute sentence embeddings can be easily improved.

We know that some words appear in text more frequently than others, which makes them less important, so they should not influence much the resulting sentence embedding. This can be achieved by adding a weight function for words based on the word frequency. Since we have already had a list of word vectors sorted by the word frequency, we have defined word weight as

weight(w) = 1

2 +W −pos(w)

where W is number of words, for which we have computed vectors, pos(w) is a position of the word in the list of word vectors, sorted by frequency. Weight of unknown words has been set to the maximal value 0.5. The sentence embedding with usage of weight function can be computed as

emb_s = ^∑

w∈Ws

weight(w)∗emb_w

Our next approach has been to improve calculation of sentence similarity.

We have considered other metrics of similarity such as the length of sentences, number of words with the same lemmas, similarity of subjects and verbs. To get the word lemmas we have trained UDPipe [Straka and Strakov´a, 2017] model using data from Universal Dependencies² for Kazakh. Combining these metrics and cosine similarity together has given us

sim(s₁, s₂) = cos(s₁, s₂) + 0.1∗(len(s₁, s₂) +word(s₁, s₂) +subject(s₁, s₂) +verb(s₁, s₂)) wherecos(s₁, s2) is a cosine similarity of sentence embeddings,

len(s₁, s₂) = min(length(s₁), length(s₂)) max(length(s₁), length(s₂)),

word(s₁, s₂) = |word_lemmas(s₁)∩word_lemmas(s₂)|

max(length(s₁), length(s₂)) ,

subject(s₁, s₂) =

⎧

⎨

⎩

0 |subjects_lemmas(s1)∩subjects_lemmas(s₂)|= 0 1 |subjects_lemmas(s1)∩subjects_lemmas(s₂)| ̸= 0,

verb(s₁, s₂) =

⎧

⎨

⎩

0 |verbs_lemmas(s₁)∩verbs_lemmas(s₂)|= 0 1 |verbs_lemmas(s₁)∩verbs_lemmas(s₂)| ̸= 0 .

2https://universaldependencies.org/

(28)

4.1.2 FastText

Another way to get the word embeddings is to use fastText. We have used already existing pre-trained word vectors for Kazakh from the github repository of fastText³, trained on Wikipedia and Common Crawl⁴. Again we have tested how close to each other are semantically similar words and compared the results with the results obtained by using word2vec vectors. Several examples of finding the 5 most similar words to the given word are shown in the Table 4.2. Both word2vec and fastText embeddings give good and quite similar results, however, due to usage of n-grams fastText embeddings can also be found for unknown words, whereas word2vec embeddings do not provide such an opportunity and we have to assign a default value to unknown words. Therefore, it allows rare words to be represented appropriately, which makes fastText embeddings more useful for our purposes. As well as in the case of word2vec embeddings we have had a list of words sorted by frequency, so we have done the same improvement again and used the same weight function. The original “.vec” file with embeddings was too big and contained word embeddings for almost 2 million words. Due to available memory capacity limits we have decided to use embeddings only for first 200000 words, assuming that most of the frequent words would occur there. Similarity of two sentences has been computed the same way, as it was done using the word2vec model, by combining several metrics together. Comparing similarity of different sentences has given us a good boundary for classification. The pair of sentences with

sim(s₁, s₂)≥0.8 should be classified as similar, while the pair with

sim(s₁, s₂)<0.8 should be classified as not similar.

4.1.3 BERT

BERT can be used directly for getting sentence embeddings, so we do not have to create them from smaller parts such as word embeddings. However, BERT is a new technology and there is no model for Kazakh language at the moment, so we have used the newest⁶ multilingual model from the BERT github repository⁷. To compute the embeddings we have used the BERT wrapper⁸. Each sentence embedding obtained from the BERT wrapper is a vector in 768-dimensional space.

Similarity of the sentences has been computed as a cosine similarity of these embeddings with the boundary equal 0.85, which means that the pair of sentences with

sim(s₁, s₂)≥0.85

3https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md

4http://commoncrawl.org/the-data/get-started/

5the Republic of Kazakhstan

6November 23rd, 2018 version

7https://github.com/google-research/bert

8https://github.com/ptakopysk/bert

(29)

word word2vec fastText

‘жаңбыр‘(rain)

“нөсер”(showers) “варша”(various)

“Жаңбыр”(rain) “Толтектер”(folders)

“қар”(snow) “нөсерлеп”(pouring)

“жауын”(rain) “нөсер”(showers)

“жауыншашын”(rainfall) “жауғызып”(wound up)

“мысық”(cat)

“қабылан”(leopard) “Мысық”(cat)

“бiрқазан”(pelican) “4мысық”(4 cats)

“қабан”(boar) “мысықтар”(cats)

“жанат”(raccoon) “мысықты”(cat)

“егеуқұйрық”(rat) “мияу”(meow)

“бала”(child)

“баланы”(child) “баланың”(of child)

“сәби”(baby) “Бала”(child)

“нәресте”(baby) “баланы”(child)

“балаларды”(children) “қыз”(girl)

“қыз”(girl) “сәби”(baby)

“Республика”

(republic)

“республика”(republic) “республика”(republic)

“Республиканың” (of

republic) “HeadTtl”

“ҚРның”(of RK⁵) “ҚазақстанРеспубликасы”

(RK)

“ҚР”(RK) “Республиканың”(of

republic)

“Президент”(president) “050060”

“жемiстер”(fruits)

“көкөнiстер”(vegetables) “жемiстерге”(fruits)

“апельсин”(orange) “көкөнiстер”(vegetables)

“соя”(soy) “көкенiстер”(vegetables)

“алхоры”(plum) “жемiс-жидектер”(berries)

“жидектер”(berries) “Жемiстер”(fruits)

“мектеп”(school)

“мектептiң”(of school) “мектептiң”(of school)

“Мектеп”(school) “Мектеп(school)”

“интернат”(boarding

school) “сегiзжылдық”(8 years)

“лицей”(lyceum) “мектептер”(schools)

“балабақша”(kinder-

garden) “Бургдорфта”(Burgdorft) Table 4.2: Finding the most similar words to the given word. Comparison of word2vec and fastText vectors

(30)

should be classified as similar, while the pair with sim(s₁, s₂)<0.85 should be classified as not similar.

4.2 Classification

Let us assume that we have two semantically similar sentences. We need to classify them either as contradiction or as similar. To solve this task we have considered several ways to detect contradictions between two sentences such as checking the words of the same part of speech for the negative suffixes, finding the pairs of the most similar words and classifying them either as antonyms or as synonyms, and using neural networks.

4.2.1 Checking for Negation in Words

Our first approach has been to find the corresponding pairs of morphologically similar words and check if one of them is a negation of another. We assume that words are morphologically similar if they have the same part of speech, same lemma and same tense (for verbs). To obtain this information about words we have used UDPipe model for Kazakh again and parsed lemmas, parts of speech and morphological features from the result in CoNLL-U format. After corresponding words have been found we have checked if one of them contains a negative suffix. For some words this information can be found right from the output of UDPipe model. For example, the model seems to work well on finding the negative suffixes in verbs. However, it can not recognize the negation in adjectives and pronouns, so we have checked them for all possible suffixes by ourselves.

By our definition of contradiction between sentences there must be only one pair of words, where one word is a negation of another. So we have found all morphologically similar pairs of words in sentences, checked them for contradiction, and if more than one contradictory pair has been found, the sentences have been marked as not contradictory. As a result we have a tool which is able to detect numeric contradictions, opposite verbs, adjectives and pronouns created by adding the negative suffixes to the root.

4.2.2 Synonyms and antonyms detection

Our next approach has been quite similar. Again we have been trying to do word alignment in the sentences to find the corresponding pairs of words and to check if words in pair are opposites. This time the corresponding pairs of words have been found by using fastText word embeddings. We have got an embedding for each word in the sentences and calculated the cosine similarity between each pair of words. The pair for the word from the first sentence will be the most similar

(31)

word from the second sentence.

∀w_i ∈words(s₁) : pair_to(w_i) = w_j,

such as cos_sim(w_i, w_j)≤cos_sim(w_i, w_k)

∀wk∈words(s₂)

Due to the fact that word embeddings are similar both for antonyms and synonyms we can assume that the words in each result pair are either synonyms or antonyms. Even if some two words marked as the most similar have nothing in common, we can still assume that they are antonyms and it will not change the result.

We needed some model to classify words either as synonyms or as antonyms.

For this purpose we have created a list of pairs of synonyms, antonyms and neutral words in Kazakh, got embeddings for each pair and used them as a train data for a classifier. Opposite and neutral words have been labeled as 0, synonyms have been labeled as 1. We have trained different classifiers using the scikit-learn Python library. Several ML algorithms have been used, such as logistic regression, SVM and multi-layer perceptron. Given a pair of words with corresponding embeddings, we have used different training features:

• difference of embeddings (of size 300, we will refer to this as “300 features”)

• just the embeddings (of size 600, we will refer to this as “600 features”)

• the embeddings and their difference (of size 900, we will refer to this as “900 features”)

We have trained models with different parameters and features and have chosen the “best” parameters for each algorithm:

L o g i s t i c R e g r e s s i o n (C=1e5 , s o l v e r= ’ l i b l i n e a r ’ , max_iter =4000) LinearSVC ( random_state =123)

NuSVC( d e c i s i o n _ f u n c t i o n _ s h a p e= ’ o v r ’ , gamma= ’ s c a l e ’ , d e g r e e =3 , k e r n e l= ’ r b f ’ , random_state =123 , s h r i n k i n g=True )

M L P C l a s s i f i e r ( h i d d e n _ l a y e r _ s i z e s = ( 5 0 , ) , s o l v e r= ’ adam ’ , a c t i v a t i o n= ’ l o g i s t i c ’ , random_state =123 , max_iter =200 , l e a r n i n g _ r a t e _ i n i t = 0 . 5 )

where LinearSVC is a support vector classification with linear kernel, NuSVC is a support vector classification with a parameter to control the number of support vectors, MLPClassifier is a multi-layer perceptron classification. We have evaluated the accuracy and recall of each model using the cross validation. The results are shown in the Table 4.3. NuSVC model with 600 features has given the best results, therefore, it has been chosen as a default model. The overall algorithm for checking whether two sentences are contradictory is:

1. For each word from the first sentence find the most similar word from the second sentence.

(32)

algorithm features No. features accuracy LogReg

difference of embeddings 300 0.54

only embeddings 600 0.61

difference and embeddings 900 0.61 LinearSVC

difference and embeddings 900 0.63 NuSVC

difference and embeddings 900 0.70 MLP

difference and embeddings 900 0.61

Table 4.3: Classification of the pairs of words as synonyms or antonyms/neutral 2. For each pair of the most similar words check if they are synonyms or

opposites.

3. If exactly one pair is a pair of opposites, there is a contradiction between sentences.

4. Do the same for the second sentence.

4.2.3 Scikit model

The previous model can be used to find almost all of the types of contradictions.

However, sometimes it is not enough to compare single words. For example, the opposite of some word can be also a phrase. For these cases we need more general model, which will be able to detect all the types of contradictions.

For training the new model we have used SNLI [Bowman et al., 2015] and MultiNLI [Williams et al., 2018] corpuses with 1 million pairs of human-written English sentences manually labeled with the labels “entailment”, “contradiction”

and “neutral”, used for the task of recognizing textual entailment. We have parsed from the corpus the pairs labeled as contradiction and entailment. To translate the sentences to Kazakh we have used Transformer NMT[Vaswani et al., 2017]

implemented within Neural Monkey framework⁹ submitted to WMT 2019 shared task¹⁰, and the Google Translate¹¹. Since the results obtained from Google Trans- late were more accurate, we have used them in our work. After that we have created a BERT embedding for each translated sentence. Each training example has been a pair of embeddings of two sentences and the corresponding label, which gives together 1536 features plus 1 label. All data have been shuffled and saved as a “.csv” table. This large dataset has allowed us to use the neural networks. The first model we have created has used simple multi-layer perceptron

9https://ufal.mff.cuni.cz/neuralmonkey

10http://www.statmt.org/wmt19/translation-task.html

11https://translate.google.com/

(33)

from scikit-learn Python library. We have used 20000 training examples as test data for evaluation of the model in the end of training.

We have tried to change number and size of layers, learning rate and regu- larization parameter alpha to find the best parameters for the model. We have tried all reasonable values and have found the two “best” one:

M L P C l a s s i f i e r ( h i d d e n _ l a y e r _ s i z e s =(3000 , 1 0 2 4 ) , s o l v e r= ’ adam ’ , a c t i v a t i o n= ’ r e l u ’ ,

random_state =123 , a l p h a = 0 . 0 0 5 , v e r b o s e=True ) . M L P C l a s s i f i e r ( h i d d e n _ l a y e r _ s i z e s =(3000 , 1 0 2 4 , 5 1 2 , 2 5 6 ) ,

s o l v e r= ’ adam ’ , a c t i v a t i o n= ’ r e l u ’ , random_state

=123 ,

a l p h a = 0 . 0 0 5 , l e a r n i n g _ r a t e _ i n i t = 0 . 0 0 5 , v e r b o s e=True )

We have also used early stopping, which allows to stop training when validation loss does not improve anymore. When early stopping is used a part of training data is kept aside from the training and used as validation data. At each iteration validation loss is computed on the validation data and if it does not improve by some set value the training process stops. Using the best parameters we have trained the model on the data of sizes 20000, 60000, 80000 and 100000 to see how accuracy and recall change depending on the size of the training data. The results are shown in the Figures 4.1 and 4.2. The model with the first parameters has given better results, however in both cases accuracy and recall have been decreasing with increasing size of training data. The best results have been obtained from the model trained on the data of size 60000 with early stopping.

We have decided not to continue training of the model on the bigger data and instead use acquired knowledge and train the model using more powerful Python library tensorflow [Abadi et al., 2016].

4.2.4 Tensorflow model

To train the next model we have used the same BERT embeddings. Just like we did when we have tried to classify words as antonyms or synonyms, we have used different training features. Given a pair of sentences with corresponding embeddings, we have two types of features:

• just the embeddings (of size 1536)

• difference of the embeddings (of size 768)

The models have been built using the Keras¹² API from tensorflow Python library. We have used 20000 examples as a validation dataset for validation at each epoch, and another 20000 examples as test dataset to evaluate the model in the end of training. We have also used early stopping, which in tensorflow allows to detect the epoch, where some monitored quantity has stopped improving, and to stop training there. The quantity which we have monitored has been validation loss, however, in some cases we have monitored validation accuracy to obtain better results. Due to restrictions on computational capacity, we have trained the models with different parameters on smaller parts of data(20000,

12https://keras.io/

(34)

Figure 4.1: Scikit model: change of accuracy and recall of the model with 2 hidden layers with increasing size of training data

Figure 4.2: Scikit model: change of accuracy and recall of the model with 4 hidden layers with increasing size of training data

(35)

No. features data size dense layers sizes dropout l. rate No. epochs

768

100000 512,256 0.10 0.0001 3

160000 256 0.20 0.0010 3

200000 256,128 0.20 0.0010 3

300000 1024,512,256 0.10 0.0005 3

400000 1024,512,256 0.10 0.0005 3

500000 1024,512,256 0.10 0.0005 3

613000 1024,512,256,128 0.05 0.0001 4

1536

100000 512,256 0.10 0.0050 4

160000 2048,1024,512,256 0.10 0.0010 3

200000 3000,2048,1024 0.10 0.0010 3

300000 2048,1024,512 0.05 0.0005 5

400000 3000,2048,1024 0.20 0.0010 7

500000 3000,2048,1024 0.20 0.0001 4

613000 3000,2048,1024 0.20 0.0001 6

Table 4.4: Tensorflow model: change of the best parameters with increasing size of training data

60000, 80000 training examples) to find the “best” parameters. This parameters have been used later to train the models on the bigger data.

The parameters that we have changed were number and size of layers, rate of dropout and learning rate. Number of epochs has been changed too and has been influenced by early stopping. We have noticed that using the dropout layers in the network increases accuracy and recall, and that using smaller learning rate gives us better results than with the modeltrained on the large training set. In all models 1-4 dense layers have been used depending on the size of training data.

Change of the best parameters with increasing size of training data are shown in the Table 4.4. We have evaluated the models trained on the data of different size. Change of the best accuracy, recall and F₂ with increasing train data size are shown in Table 4.5.

We assume that better results for the models trained on large parts of data can be obtained by trying different parameters. In almost all cases changing the learning rate has influenced the results the most. The first part of data has been from MultiNLI dataset and starting from the size 300000 the data from SNLI dataset has been added. The authors of the datasets have claimed that the MultiNLI dataset “includes a more diverse range of text”¹³ and the sentences are longer than in the SNLI dataset. This explains the decrease in accuracy and recall of the models trained on the data of size 300000 comparing to the models trained on the data of size 200000. However the authors of the datasets advise to use them together as one big dataset for training.

13https://nlp.stanford.edu/projects/snli/

Yuliya Yamalutdinova Detection of contradictions in pairs of texts in Kazakh

BACHELOR THESIS