Recursive Neural Networks - Ing.LukášSvoboda DistributionalSemanticsUsingNeuralNetworks

Recursive Neural Networks (RNN) are popular in NLP due to their capability for processing arbitrary length sequences. The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that is not ideal, especially in NLP tasks. If you want to predict the next word in a sentence you better know which words came before it. RNNs operate with each element of the sequence being presented to the input nodes of the RNN in turn. They are called recurrent because values computed from each element is carried over to the computation for the next element. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps, because it is also often claimed that learning long-term dependencies by stochastic gradient descent can be difficult [Bengio et al., 1994].

Neural Networks Recursive Neural Networks

For Language Modeling [Mikolov et al., 2010] a so called simple recurrent neural network (see figure 3.4) or Elman network [Elman, 1990] is being used.

3.6.1 RNNs with Long Short-Term Memory

Long Short-Term Memory (LSTM) units [Hochreiter and Schmidhuber, 1997]

have re-emerged as a popular architecture due to their representational power and effectiveness at capturing long-term dependencies. LSTMs do not have a fundamentally different architecture from RNNs, but they use a different function to compute the hidden state. There are many LSTM architectures, some evaluation of different architectures has been done in [Jozefowicz et al., 2015].

The memory in LSTMs are called cells and they take as input the previous state hst−1 and current input x_t. Internaly these cells decide what to keep in (and what to erase from) memory. They then combine the previous state, the current memory, and the input.

In a traditional recurrent neural network, during the gradient phase of back-propagation, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.

When the weights in this matrix are small (if the leading eigenvalue of the weight matrix is smaller than 1), it can lead to a situation called vanishing gradients [Bengio et al., 1994] where the gradient signal gets so small that learning either becomes very slow or stops working altogether. It can also make more difficult the task of learning long-term dependencies in the data.

Conversely, if the weights in this matrix are large (or, again, more formally, if the leading eigenvalue of the weight matrix is larger than 1), it can lead to a situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to as exploding gradients.

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell (fig. 3.5). Cells take as input the previous state ht−1 and current input x_t. Internally these cells decide what to keep in (and what to erase from) memory. They then combine the previous state, the current memory, and the input.

Neural Networks Recursive Neural Networks

Figure 3.4: Picture shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean that we write out copies of the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-stage neural network, one stage for each word. On the picture we see:

• x_t is the input at time step t. For example for language modeling, x₁ could be seen as a vector corresponding to the second word of a sen-tence.

• hs_t is the hidden state at time step t. It is the networks “memory”

(captures information about what happened in all the previous time steps0) and it is calculated based on the previous hidden state and the input at the current step: hs_t =f(U x_t+W hs_t−1), where the f is usually our well known nonlinearity function such as tanh. hs−1, which is required to calculate the first hidden state, is typically initialized to all zeroes.

• y_t is the output at step t. For example, if we wanted to predict the next word in a sentence, it would be a vector of probabilities across our vocabulary, y_t = sof tmax(V hs_t). Output is calculated based on the memory at time t, but it is more complicated in practice, because hs_t can not capture information from too many time steps ago (explained in Section 3.6.1). Sof tmax regression is a probabilistic method with function similar to the Logistic regresion, we use the softmax function to map inputs to the the predictions (can be multinomial).

• U and W are parameters of RNN that are shared across the whole network and are not different at each layer as it is for example in Feed-forward Neural Networks and its weight parameters.

Neural Networks Deep Learning

Figure 3.5: LSTM memory cell. Green boxes represent learned neural net-work layers, while circles inside a cel represents pointwise operations.

The forget gate is one of the most important features of the LSTM net-work [Greff et al., 2015]. It makes the decision what information we are going to throw away from the cell state. The input gate layer decides which values we will update (which information we keep). It has turned out that these types of units are very efficient at capturing long-term dependencies.

Mathematical background of LSTM and further information has been presented in our technical report [Svoboda, 2016].

In document Ing.LukášSvoboda DistributionalSemanticsUsingNeuralNetworks (Stránka 31-34)