Distributional Semantic Models - Ing.LukášSvoboda DistributionalSemanticsUsingNeuralNetworks

DSMs learn contextual patterns from huge amount of textual data. They typically represent the meaning as a vector which reflects the contextual (distributional) information across the texts [Turney and Pantel, 2010]. The words w∈W are associated with a vector of real numbersw∈R^k. Repres-ented geometrically, the meaning is a point in a k-dimensional space. The words that are closely related in meaning tend to be closer in the space. This architecture is sometimes referred to as theSemantic Space. The vector rep-resentation allows us to measure similarity between the meanings, most often by the cosine of the angle between the corresponding vectors. Approaches that extract such vectors are often called Word Embedding methods.

In last years, the extraction of meaning from a text became the funda-mental research area in NLP. Word-based semantic spaces provide impressive performance in a variety of NLP tasks, such as language modeling [Brychc´ın and Konop´ık, 2015], NER [Konkol et al., 2015a], sentiment analysis [Hercig et al., 2016a], and many others (see Section 3.8).

In this thesis we focus on Czech, which belongs to the West Slavic family and Croatian, from the South Slavic family language. Czech has seven cases and three genders. Croatian language has also seven cases and three genders.

Many properties of both languages are very similar because of historical sim-ilarities and mutual interaction. Both languages have a relatively free word order (from the purely syntactic point of view): words in a sentence can usually be ordered in several ways which carry a slightly different meaning.

These properties of Czech and Croatian language complicate the distribu-tional semantics modeling. High number of word forms and more sequences of words that are possible in the language lead to a higher number of n-grams.

Free word order, according our opinion, complicates the fundamental use of Distributional Hypothesis.

2.2.1 Context Types

Different types of context induce different kinds of semantic space models.

[Riordan and Jones, 2011] and [McNamara, 2011] distinguish context-word and context-region approaches to the meaning extraction. In this thesis we use the notionlocal context andglobal context, respectively, because we think this notion describes the principle of the meaning extraction better.

Distributional Semantics Distributional Semantic Models

Global context

The models that use the global context are usually based upon bag-of-word hypothesis, assuming that the words are semantically similar if they occur in similar documents, and that the word order has no meaning. The document can be a sentence, a paragraph, or an entire text. These models are able to register long-range dependencies among words. For example, if the document is about hockey, it is likely to contain words like hockey-stick or skates, and these words are found to be related in meaning.

Local context

The local context models are those that collect short contexts around the word using a moving window to model its semantics. These methods do not require text that is naturally divided into documents or pieces of text. Thanks to the short context, these models can take the word order into account, thus they usually model semantic as well as syntactic relations among words.

In contrast to the global semantics models, these models are able to find mutually substitutable words in the given context. Given the sentence The dog is an animal, the word dog can be for example replaced by cat.

2.2.2 Model Architectures

There are several architectures that have been successfully used to extract meaning from raw text. In our opinion, the following four architectures are the most important for our work (see other architectures in [Svoboda, 2016]):

Co-occurrence Matrix

The frequencies of co-occurring words (often taken as an argument of some weighting function, e.g. term frequency – inverse document frequency (TF-IDF) [Ramos et al., 2003], mutual information [E. Shannon, 1948], etc.) are recorded into a matrix. The dimension of such matrix is sometimes big, and thus the singular value decomposition (SVD) or different algorithm can be used for dimensionality reduction.

Distributional Semantics Distributional Semantic Models

Formally, the co-occurrence matrix of a textual corpus is a square matrix of unique words with dimensions N ×N. A cell m_ij contains the number of times word w_i co-occurs with word w_j within a specific context. Context can be either a natural unit such as a sentence or a certain window of m words (where m is an application-dependent parameter). The upper and lower triangles of the matrix are identical since co-occurrence is a symmetric relation.

Representative of this architecture is GloVe (Global Vectors) [Pennington et al., 2014] model that focuses more on the global statistics of the trained data. This approach analyses log-bilinear regression models that effectively capture global statistics and also captures word analogies. Authors propose a weighted least squares regression model that trains on global word-word co-occurrence counts. The main concept of this model is the observation that ratios of word-word co-occurrence probabilities have the potential for encoding meaning of words.

Topic Model

The group of methods based upon the bag-of-word hypothesis that try to discover latent (hidden) topics in the text are called topic models. They usually represent the meaning of the text as a vector of topics but it is also possible to use them for representing the meaning of a word. The number of topics in the text is usually set in advance.

It is assumed that documents may vary in domain, topic and styles, which means that they also differ in the probability distribution of n-grams. This assumption is used for adapting language models to the long context (do-main, topic, style of particular documents). LSA (or similar methods) [Choi et al., 2001] aim to partition a document into blocks, such that each segment is coherent and consecutive segments are about different topics. This long context information is added to standard n-gram models to improve their performance. A very effective group of models (sometimes called topic-based language models) work with this idea for the benefit of language modeling.

In [Bellegarda, 2000] a significant reduction in perplexity¹ (down to 33%)

1A measurement of how well a probability distribution or probability model predicts a sample

Distributional Semantics Language Models

and WER² (down to 16%) in the WSJ³ corpus was shown. Many other authors have obtained good results with PLSA [Gildea and Hofmann, 1999, Wang et al., 2003] and LDA [Tam and Schultz, 2005, 2006] approaches.

Neural Network

In the last years, these models have become very popular. It is the human brain that defines semantics, so it is natural to use a neural network for the meaning extraction. The principles of the meaning extraction differ with the architecture of a neural network. Much work on improving the learning of word representations with Neural Networks has been done, from feed-forward networks [Bengio et al., 2003] to hierarchical models [Morin and Bengio, 2005, Mnih and Hinton, 2009] and recently recurrent neural networks [Mikolov et al., 2010].

In [Mikolov et al., 2013a,c] Mikolov examined existing word embeddings and showed that these representations already captured meaningful syntactic and semantic regularities such as the singular and plural relation that vectors orange−oranges=plane−planes. Read more in section 3.8

In document Ing.LukášSvoboda DistributionalSemanticsUsingNeuralNetworks (Stránka 16-19)