Hlavní práce71304_ilid01.pdf, 1.7 MB Stáhnout

(1)

University of Economics in Prague

Faculty of Finance and Accounting

Department of Monetary Theory and Policy Study program: Financial Engineering

Master Thesis

Reinforcement learning in trading

Author of Master Thesis: Bc. Dmytro Ilienko

Supervisor of Master Thesis: Ing. Milan Fičura, Ph.D.

Year of Defense: 2021

(2)

Čestné prohlášení

Prohlašuji, že jsem diplomovou práci na téma „Reinforcement Learning in Trading“ vypracoval samostatně a veškerou použitou literaturu a další prameny jsem řádně označil a uvedl v přiloženém seznamu.

V Praze dne 30.04.2021 ...

Dmytro Ilienko

Declaration of honor

I hereby declare, that I prepared this master thesis on the topic „Reinforcement Learning in Trading“ independently and that all the literature and other sources are identified and listed in the attachment.

In Prague on 30.04.2021 ...

Dmytro Ilienko

(3)

Abstrakt

V této diplomové práci zkoumáme aplikaci zpětnovazebního učení při obchodování s kryptoměnami. Tato metoda je vybrána, protože její rámec dokonale zapadá do daného problému. Místo predikce cen se model snaží předpovědět nejlepší akci mezi možnostmi nákupu, držení a prodeje, která by měla vést k nejvyššímu očekávanému výnosu. Hlavním cílem této práce je zjistit, zda model zpětnovazebního učení může překonat jednoduchou strategii

„Kup a drž“ na rostoucím trhu. Práce je zaměřena na praktickou aplikaci prezentovaných metod, a proto fungující implementace testovaných modelů bude podrobně popsána.

Klíčová slova

Strojové učení, zpětnovazební učení, neuronové sítě, finanční obchodování, kryptoměny, bitcoin

Abstract

In this master thesis we examine the application of reinforcement learning to the cryptocurrency trading task. This method is chosen because its framework fits perfectly to the problem at hand.

Instead of predicting prices, the model tries to predict the best action among a buy, hold and sell options which should lead to the highest expected return. The primary goal of this work is to investigate whether on a rising market the reinforcement learning model can outperform the simple “Buy & Hold” strategy. The work is oriented on a practical application of the presented methods, and working implementation of the tested models is provided in detail.

Keywords

Machine learning, reinforcement learning, neural networks, financial trading, cryptocurrency, bitcoin

(4)

Acronyms

A2C Advantage Actor Critic AI Artificial Intelligence BTC Bitcoin

DP Deep Learning

EMH Efficient Market Hypothesis GPU Graphics Processing Unit FFNN Feedforward Neural Network LSTM Long Short-Term Memory MDP Markov Decision Process

ML Machine Learning

MP Markov Process

NN Neural Network

OHLC Open, High, Low, Close trading price ReLU Rectified Linear Unit

RL Reinforcement Learning RNN Recurrent Neural Network SGD Stochastic Gradient Descent

TD Temporal Difference

(5)

Introduction

There are various instruments traded on financial markets. For every instrument there is a price that changes over time. Trading is the activity of buying or selling financial instruments and this activity might be related to different goals like making a profit, gaining protection from future price movements called hedging or simply acquiring the needed product.

Since the first financial markets were established, people have been trying to predict future price movements to achieve profits. Nowadays, there is a whole industry including financial consultants, investment funds, banks, and individual investors exploring this problem from different angles and trying to predict the market movements and find the best moments to trade in order to maximize the profit. All these entities employ different techniques to predict prices, among which are technical and fundamental analysis, quantitative models and even market psychology. This problem is known to be extremely complex because prices are influenced by different macroeconomic factors, which can include government decisions, general economic conditions, commodity price indices, interest rates or investor expectations reflecting market psychology and investor sentiment. As such prices are considered to be highly dynamic, non- linear and chaotic (Abu-Mostafa & Atiya, 1996).

The question is whether in principle it is possible to predict future price movements and if yes, then what methods are best suited for the problem of financial trading. There are lots of theories investigating the first problem. Among the most popular theories are efficient market hypothesis along with random walk theory which suggest that it is impossible to predict price movements with an accuracy better than a random guess. On the other hand, there are a number of studies (Malkiel, 2003) (Bollen, Mao, & Zeng, 2011) that disprove both EMH and random walk theory.

These studies show the financial markets can be predicted to some extent.

This work does not aim to investigate this problem from a theoretical point of view but rather from a practical one using the field of machine learning and specifically its subfield called reinforcement learning. This approach is chosen because its core idea tightly relates to the problem of financial trading. In general, the reinforcement learning entity called an agent tries to learn for itself how to achieve a successful strategy that leads to the greatest long-term benefit which clearly correlates with the financial trading setting. Thus, the primary goal of this work is to test whether the reinforcement learning might be applicable to the trading problem and to measure its success against a classic long strategy. The proposed method will be employed on a cryptocurrency market specifically on the bitcoin which was chosen primarily because of its high volatility and high liquidity. During the evaluation period starting from September 2019 and up to April 2021 the BTC price rose significantly, therefore it might be a real challenge to outperform the long strategy.

(7)

The first chapter presents the fundamental characteristics of the cryptocurrency market along with already mentioned efficient market hypothesis theory. The second chapter describes the notions of artificial intelligence, machine learning and deep learning and their applications within finance. This chapter serves as an introduction for further chapters which extend the discussed concepts. The third chapter gives an overview of neural networks and how they are trained. We will use a neural network for a reinforcement learning task therefore it is important to understand how it works. In the fourth chapter, we introduce the reinforcement learning formalism and define the conventional notation. Additionally, one of the most popular methods in the whole reinforcement learning field called actor-critic which will be applied later is presented here. The fifth chapter presents performed experiments and provides an evaluation of the RL models compared to the long strategy called "Buy & Hold".

All the data used in this work are downloaded from the Internet from publicly available sources using Python language tools. Python libraries PyTorch, scikit-learn, matplotlib, pandas and others will be used to perform calculations, design a model and generate the outputs.

(8)

1 Financial Markets

Financial markets refer to any marketplace where the trading of securities and financial instruments occur, including stock and bond markets, Forex, derivatives and cryptocurrency markets. Financial markets may include assets or securities that are either listed on regulated exchanges or are traded over the counter. These markets allow transferring funds from surplus to deficit entities by entering a contract between the counterparties (Veselá, 2007).

1.1 Efficient Market Hypothesis

When talking about financial markets, it is important to mention a hypothesis about their efficiency in terms of assets pricing. Efficient market hypothesis developed by economist Eugene Fama states that share prices reflect all existing information; thus, it is not possible to achieve an excess return consistently. According to this theory, stocks and other financial instruments always trade at the fair values making it impossible to generate positive alpha by purchasing undervalued assets or sell assets for inflated prices. Therefore, nobody should be able to outperform the overall market and the only way to obtain higher returns is by investing into assets with a higher risk (Fama, 1965).

However, this hypothesis does not apply to every single market, because each market is different. That’s why it is possible to assume that some markets are less efficient than others.

An inefficient market can be defined as a one where assets prices do not accurately reflect their fair values, which may be the result of different reasons. Market inefficiencies may exist due to information asymmetries, a lack of buyers and sellers (i.e. low liquidity), high transaction costs or delays, market psychology, and human emotion, among other reasons (Downey, 2021). In reality, it seems that most markets have some level of inefficiencies.

Usually, markets are divided into three categories according to their degree of efficiency:

- Weak efficiency – it states that all historical information is already incorporated into assets prices, therefore it is impossible to use technical analysis to achieve superior gains.

- Semi-strong efficiency – this type of efficiency claims that all publicly available information is reflected in today’s asset prices. Neither fundamental nor technical analysis can be used to predict and beat the market.

- Strong efficiency – all public and private information is calculated into asset’s current price, hence it is impossible to generate positive alpha by any means, because prices on such markets preform a random walk.

1.2 Cryptocurrency Market

(9)

Cryptocurrency or crypto is a digital currency that can be exchanged online for goods and services. Cryptocurrencies work in a fully decentralized peer-to-peer system called blockchain.

This technology is spread across many computers that manages and records transactions. Many large companies have issued their own cryptocurrencies, often called tokens, and these can be traded specifically for the good or service that the company provides (Antonopoulos, 2017). Currently, cryptocurrencies are considered as an investment asset rather than the form of payment. Some cryptocurrencies may even generate a cash flow like stock dividends (the result of staking). However, when investing into crypto, investors seek mainly for capital gains, because this kind of asset is extremely volatile. That is the key reason why crypto is so appealing, because investors can double or even triple their investment in a relatively short time period. However, this abnormal return comes with an extensive risk. Additionally, comparing to other financial instruments such as stocks or bonds, investing into crypto is much simpler as there are no initial capital requirements (it is possible to start with any amount), thus anyone can open an account on a crypto exchange without any broker.

The market itself is highly segmented with hundreds of exchanges around the world. An exchange can be a market maker that takes the bid–ask spread as a commission, or a matching platform that simply charges fees as a percentage of the transaction amount. The biggest crypto exchanges based on volume are: Binance, Coinbase, Kraken and Bitfinex. Some exchanges do not trade cryptocurrency directly, but rather the derivatives on crypto, which allows creating highly leveraged positions. In this case crypto cannot be withdrawn, because technically the investor does not own a crypto but a contract. BitMEX and Bybit along with Binance are the most popular crypto derivative exchanges.

It is not obvious how to frame the cryptocurrency market within an efficient market hypothesis theory. Someone could argue that because it is a totally free market almost without any restrictions, it should be highly effective in terms of EMH. Looking from another perspective, enormous volatility and massive price changes in short time periods rather point to inefficiencies still presented on this market. The lack of any valuation model that price setting entities converge on, only reinforces this point of view. Hence, the application of AI techniques could be appropriate to this market.

(10)

2 AI and ML in Finance

In the past few years, artificial intelligence has been a subject of intense discussions in media and among broad public. Machine learning, deep learning, and AI come up in countless articles, sometimes even outside of technology-minded publications, many of them were related to financial area and especially to the topic of investments. The frequent question sounds like “Is it possible to use AI technologies to make a profit in financial markets?”. Before answering this question, it is important to understand the concepts of artificial intelligence, machine learning and deep learning and how they relate.

2.1 Artificial Intelligence

Artificial intelligence is the science and engineering of making computers behave in ways that, until recently, it was thought required human intelligence. Another definition of the field would be as follows: the effort to automate intellectual tasks normally performed by humans (Chollet, 2017). Those are very concise definitions, but they show how broad and vague the field is. As such, AI is a general field that encompasses machine learning, deep learning, and many more approaches that don’t involve any learning.

Figure 2.1 Artificial Intelligence, Machine Learning, Deep Learning. Source: Chollet, 2017

The idea of artificial intelligence was born in the 1950s. John McCarthy coined the term Artificial Intelligence, which he proposed in the famous Dartmouth conference in 1956. He was one of the first who explored the ways in which machines can learn and reason like humans (Bansal, 2020). For a long time, many experts believed that human-level artificial intelligence could be achieved by handcrafting a sufficiently large set of explicit rules which incorporate human knowledge. This approach is known as symbolic AI, and it was mainstream in the AI field to the late 1980s. Although symbolic AI was successful enough by solving well-defined, structured problems, such as playing chess, it turned out to be useless for solving more complex unstructured problems, such as weather prediction, image recognition and natural language processing. Then a new more advanced approach called machine learning arose (Chollet, 2017).

(11)

2.2 Machine Learning

Machine learning started to gain popularity in the 1990s, and it quickly became the most successful subfield of AI, driven by a rapid development in powerful computer technologies, availability of huge amounts of data and advance of complex algorithms (Wladawsky-Berger, 2018).

It seems that machine learning arose from the question: rather than be driven by handcrafted strict rules, could a computer learn on its own just by looking at data what the rules are and how to apply them in order to solve a specific problem? In symbolic AI, the program is provided by rules, which are applied to input data and then answers come out as a result. In contrast, a machine learning algorithm is provided only with inputs as well as expected outputs (answers), and the algorithm is expected to derive rules or mapping between input and output, which then can be applied to the new data to produce true answers.

Figure 2.2 Comparison of ML vs classical programming. Source: Chollet, 2017

This process is called training, because the algorithm tries to learn the relationship between inputs and outputs. As it is fed with more samples, at each step it identifies a stronger statistical structure that eventually allows it to solve the task without any given explicit rules.

Many machine learning algorithms were invented by statisticians, as such this field is tightly related to mathematical statistics, but it differs from statistics in several areas. Unlike statistics, machine learning assumes the availability of more than enough processing power, but more importantly the availability of large datasets (Linoff & Berry, 2011). While in statistics we usually know in advance what dependence we want to analyze and confirm in the data, in the case of machine learning methods we try to find dependencies in the data without having to formulate their analytical form in advance.

Machine learning is prediction-oriented, it uses large datasets to give the best answers possible.

Sometimes to achieve this goal, it uses rules which may not make much sense and often interpretability is partially or fully sacrificed for a better predictive power. Machine learning

(12)

exhibits comparatively less mathematical theory and is more engineering oriented, i.e. empirical ideas prevail over the theory.

2.3 Types of Machine Learning

There are 3 main types of machine learning:

- supervised learning - unsupervised learning - reinforcement learning

In supervised machine learning the algorithm is required to create a mapping between inputs and known targets, i.e. it is known in advance what the output values for the samples provided should be. The process of learning described in the previous section is an exact definition of supervised learning. Supervised learning is typically used for classification and regression tasks (Geron, 2017).

Unsupervised learning, on the other hand, does not have ground truth provided, hence the name

“unsupervised”. It only gets inputs and the goal is to find a natural structure presented within the data. The most well-known categories of unsupervised learning are clustering and dimensionality reduction. These types of algorithms are widely used for data analysis and visualizations (Geron, 2017).

Finally, reinforcement learning is a very different approach to learning. It lies in between the previous types of ML in a sense that there are no predefined labels as in supervised learning, however the algorithm is not completely blind as in case of unsupervised learning. There is a reward system, which drives the direction of a learning. This type of algorithms will be described in more detail in the following chapters.

2.4 Deep Learning

Taking from another perspective, machine learning algorithms can be divided into shallow (conventional) and deep (layered) methods. Most of the machine learning methods are considered as shallow, however the expression “shallow learning” is not standardized in the industry. But typically, such learning only involves transforming the input data into one successive representation space, basically into the output. Usually this is done via simple transformation such as high-dimensional non-linear projections or decision trees (Chollet, 2017).

On the other hand, deep learning methods, usually neural networks, consists of multiple transformation layers, at least 3 transformations. The input data flows through the sequence of

(13)

layers and with every successive layer new representations of the data are extracted. This process of constructing representations of the data is known as feature extraction. On the contrary, when applying shallow ML methods, the process of feature engineering must be done manually in advance.

2.5 Applications in Finance

Over the last decade, machine learning algorithms has been adopted by a wide range of businesses, because they proved to solve real-world problems more efficiently and more successfully than classical old-fashion methods. When it comes to financial area, it turned out that machine learning algorithms fit perfectly into domains of fraud detection, underwriting and credit scoring. Nowadays, within each bank and insurance company these processes are fully automated and involve some kind of AI techniques.

Moreover, great attention is paid to application of quantitative methods including ML to areas of investing and trading on financial markets. It is no secret that for many years most transactions on financial markets are being executed by algorithms rather than by traders as it was earlier. These algorithms trade financial instruments according to certain pre-specified rules (Chan, 2013). And now, in the era of machine learning, these algorithms are moving one step further, instead of using rules manually hard-coded by humans, they can use rules inferred by themselves from the data. These algorithms may be based on any information, from historical prices, which basically means technical analysis, to news feed and sentiment analysis derived from it. It is tempting to use AI in this area, because even when using historical data only, there may be profitable patterns which are just too complicated to be grasped by a human eye and that is where AI could come to the rescue. Additionally, these algorithms can be used in economics to test the efficiency of the markets.

(14)

3 Neural Networks

Neural networks are a class of powerful, flexible, general-purpose techniques readily applied to prediction, estimation, and classification problems. The first neural networks were conscious attempts to simulate the workings of biological neural networks using digital computers. In addition to biologists interested in the workings of the nervous system, early AI researchers saw neural networks as a way to endow computers with the ability to learn (Linoff & Berry, 2011).

This chapter starts with a brief review of an architecture of one of the simplest types of neural networks called feedforward neural networks and the building blocks required to construct such a neural network. Then it shows how a neural network is trained and how it is possible to improve the training process; this part applies not only to FFNN but in general to neural networks. At the end, the special type of neural networks dealing with sequences is presented.

3.1 Architecture

As was mentioned previously, neural networks consist of layers of data representations. At its core neural networks are composed of an input layer, which receives data from outside sources, one or more hidden layers that process the data, and an output layer that produces the outputs.

Simple feedforward neural networks with three individual layers were first and simplest type of neural networks. “Feedforward” means that connections between the nodes are not cyclical.

More sophisticated, innovative neural networks may have more than one of any type of layer – and each type of layer may be built differently.

Figure 3.1 Neural network architecture. Source: Nielsen, 2019

The number of nodes in each layer depends on different factors. The design of the input and output layers in a FFNN is often straightforward. No computation is performed in any of

(15)

the input nodes – they just pass on the information to the hidden nodes. Output layer architecture depends on the problem the FFNN is designed for. Design of the hidden layers must be explicitly specified by the developer. Neural network researchers have developed many design heuristics for the hidden layers, which help people get the behavior they want out of their models. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the model.

A single node in FFNN takes several inputs, x1, x2, …, performs linear and non-linear transformations using parameters of the layer and activation function and produces a single binary output. To compute the output it uses weights, w1, w2, …, real numbers expressing the effectiveness and importance of a particular input to the output. Larger the weight of input is, greater impact it will have on network (if all inputs are on the same scale). Linear transformation is done as follows:

𝑌 = ∑(𝑤𝑒𝑖𝑔ℎ𝑡 ∗ 𝑖𝑛𝑝𝑢𝑡) + 𝑏𝑖𝑎𝑠 (3.1)

The weighted sum of the inputs is computed first. Subsequently, a bias (b or , constant) is added to the weighted sum to offset the result. The bias is used to shift the result of activation function towards the positive or negative side. The addition of bias reduces the variance and hence introduces flexibility and better generalization to the neural network. The next step is to feed the computed value into the activation function, which converts an input of a node to an output. That output then is used as an input in the next layer in the stack. The whole process for a single node from an input to output is shown in picture below (Nielsen, 2019).

Figure 3.2 Information flow in a single node. Source: Nielsen, 2019

3.2 Activation Function

An activation function is used for limiting the amplitude of the output of a node. Typically, the normalized amplitude range of the output is written as the closed unit interval [0,1], or

(16)

alternatively [-1,1] (Haykin, 1999). If we do not apply an activation function, then the output signal would simply be a linear function as in the Equation 1. Now, a linear equation is easy to solve, but they are limited in their complexity and have less power to learn complex non-linear mappings from data.

The activation function must be differentiable because when performing parameters’

optimization algorithm, we need to know in which direction and how much to update the parameters. It is also desirable for an activation function and its derivative to be monotonic i.e.

either entirely non-increasing or non-decreasing (Goodfellow, Bengio, & Courville, 2016). The most commonly used activation functions are:

- Sigmoid function - it exists between 0 and 1. It is widely used for models where probability needs to be predicted as an output.

𝑓(𝑥) = 𝜎(𝑥) = 1

1 + 𝑒^−𝑥 (3.2)

- Tanh or hyperbolic tangent activation function - the range of the tanh function is between the negative 1 and positive 1, this function is sigmoidal (s - shaped) as well. The advantage is that the negative inputs will be mapped strongly negative and zero inputs will be mapped near zero.

𝑓(𝑥) = tanh(𝑥) =(𝑒^𝑥− 𝑒^−𝑥)

(𝑒^𝑥+ 𝑒^−𝑥) (3.3)

However, both previous functions suffer from vanishing gradient problem which makes training almost impossible, therefore they should not be used nowadays.

- Rectified Linear Unit activation function - half rectified function (below zero). This function avoids vanishing gradient problem.

𝑓(𝑥) = { 0 for 𝑥 ≤ 0

𝑥 for 𝑥 > 0 (3.4)

It has a range between zero and infinity. The problem here is that any negative input given to the ReLU turns the value into zero immediately, which in turn means that the weights of some nodes will never be adjusted because their derivative will always be zero resulting in “dead” nodes.

- Leaky ReLU is an attempt to solve the dying ReLU problem by introducing a small slope for negative values.

(17)

𝑓(𝑥) = { 0.01𝑥 for 𝑥 ≤ 0

𝑥 for 𝑥 > 0 (3.5) The leak helps to increase the range of the ReLU function values. But the limitation of ReLU functions is that they should only be used within hidden layers of FFNN. Hence, for output layers it is recommended to use a softmax function for a classification problem to compute the probabilities for the classes, and for a regression problem we should simply use a linear function.

3.3 Loss Function

The main goal for a neural network and for any supervised machine learning algorithm in general is to learn how to map inputs to their associated targets. In this context, “learning” means finding a set of values for the parameters of an algorithm to achieve this goal. But the problem is that a deep neural network can contain millions of parameters. Finding the correct value for all of them is not an obvious task, especially given that adjusting the value of one parameter will affect the behavior of the others. To control the performance of a neural network, we need to be able to measure how far the output is from the expected value. For this problem we use a loss function, 𝐶 (sometimes referred as a cost function). The loss function takes the prediction of the NN and the true target and computes a distance score, capturing how well the NN has performed on this specific example. Then it averages the quadratic distance scores for all the training examples, this method is known as the mean squared error.

𝐶(𝑤, 𝑏) = 1

2𝑚∑(𝑜_𝑖 − 𝑦_𝑖)²

𝑚

𝑖=1

(3.6)

where

𝑤 – set of weights, 𝑏 - set of biases,

𝑚 – number of examples, 𝑜_𝑖 – output from the network, 𝑦_𝑖 – target.

The fundamental idea is to use this score as a feedback signal to adjust the value of the parameters a little, in a direction that will lower the loss score for the current example and the value of a loss function for the entire set of examples.

(18)

Figure 3.3 The flow of a neural network training. Source: Chollet, 2017

Initially, the parameters of the network are assigned small random values, so the network merely implements a series of random transformations. Obviously, its output is far from the target, and the loss score is accordingly very high. Next we need to gradually adjust these parameters in the correct direction, based on a feedback signal received. This gradual adjustment of parameters, also called training, is basically the “learning” that machine learning is all about. This happens within what’s called a training loop, which works as follows:

1. Draw a batch of training samples and corresponding targets.

2. Run the network on training samples (a step called the forward pass) to obtain predictions.

3. Compute the loss of the network on the batch measuring the mismatch between predictions and targets.

4. Update all weights of the network in a way that slightly reduces the loss on this batch.

By repeating this training loop a sufficient number of times, we will get parameters values that minimize the loss of the NN. While the first 3 steps look pretty trivial, the step 4 is challenging.

Given an individual weight coefficient in the network, it is not obvious how to decide whether the coefficient should be increased or decreased, and by how much. One naive solution would be to freeze all parameters in the network except the one scalar coefficient being considered and try changing values for this coefficient. But such an approach would be horribly inefficient, because we’d need to compute two forward passes (increasing and decreasing the coefficient) for every individual coefficient (of which there are many, usually thousands and sometimes up to millions). A much better approach is to take advantage of the fact that all operations used in the NN are differentiable and compute the derivative of the loss with regard to the network’s coefficients. We can then move the coefficients in the opposite direction from the derivative, thus decreasing the cost (Nielsen, 2019).

(19)

3.4 Gradient Descent

To find a set of weights and biases which make the loss score as small as possible, we need to use an optimization algorithm. One of the most popular and most frequently used is a gradient descent algorithm, which was designed to solve such problems. A gradient is a generalization of the concept of derivatives to functions of multidimensional inputs, which are called tensors.

A derivative is a term that comes from calculus and is calculated as the slope of the graph at a particular point. The slope is described by drawing a tangent line to the graph at the point (Jordan

& Smith, 2002). Likewise, gradient can be interpreted as the tensor describing the curvature of a function around a particular point. So, if we are able to compute the slope of the function with respect to each parameter or gradient of the function, we might be able to compute the desired direction to reach the minima. The main idea of this approach is to take advantage of the fact that all operations used in the network are differentiable and compute the gradient of the loss function with regard to the network’s coefficients. We can then move the coefficients in the opposite direction from the gradient, thus decreasing the cost.

In general, we can approximate an infinitely small change (differential) in a function of one variable 𝑥 by multiplying its derivative with respect to 𝑥 by infinitely small change in 𝑥:

𝑑𝑓(𝑥) = 𝑑𝑓(𝑥)

𝑑𝑥 𝑑𝑥 (3.7)

In multidimensional case of 𝑛 variables 𝑓(𝑥₁, 𝑥₂, … , 𝑥_𝑛), we use the notion of partial derivatives:

𝑑𝑓 = 𝜕𝑓

𝜕𝑥₁𝑑𝑥₁+ 𝜕𝑓

𝜕𝑥₂𝑑𝑥₂+ ⋯ + 𝜕𝑓

𝜕𝑥_𝑛𝑑𝑥_𝑛 (3.8)

For neural network training we will use an equation 3.8 in context of gradually applying small changes to cost function in order to reach its minimum through the small adjustments of trainable parameters. Also, for convenience we will denote inputs in tensor forms (vectors and matrices). Then, the goal is to update the loss function by calculating its gradient with respect to the model parameters:

𝐶^′(𝜃) = 𝐶(𝜃) + 𝑑𝐶(𝜃) (3.9)

Therefore, for the derivative of a loss function the equation can be written as follows:

𝑑𝐶(𝜃) = 𝛻𝐶⋅𝑑𝜃 (3.10)

(20)

∇𝐶 = (𝜕𝐶

𝜕𝜃₁, 𝜕𝐶

𝜕𝜃₂, … , 𝜕𝐶

𝜕𝜃_𝑛)

𝑇

(3.11)

where

𝜃 – set of parameters including weights and biases,

∇𝐶 – gradient tensor – vector of partial derivatives of the loss function with respect to the model parameters 𝜃.

Equation 3.11 tells us that differential of the loss function can be approximated by the product of its gradient and differential of trainable parameters. Also, this equation lets us see how to choose 𝑑𝜃 so as to make 𝑑𝐶(𝜃) negative. In particular, suppose we choose:

𝑑𝜃 = −𝜂 ⋅ 𝛻𝐶 (3.12)

then

𝜃^′= 𝜃 − 𝜂𝜕𝐶

𝜕𝜃 (3.13)

where 𝜂 is a small, positive parameter, known as a learning rate. Essentially equation 3.12 gives us answers to two questions: which way to go in order to decrease the loss score and how big the steps should be. Learning rate defines the size of steps to take in order to reach the minimum of cost function. We can cover more area with larger steps and higher learning rate but are at the risk of overshooting the minima. On the other hand, small steps and smaller learning rates will consume a lot of time to reach the lowest point (Pandey, 2019).

𝑑𝐶(𝜃) = −𝜂 ⋅ 𝛻𝐶 ⋅ 𝛻𝐶 = −𝜂 ⋅ 𝛻𝐶² (3.14) Because 𝛻𝐶² ≥ 0, this guarantees that 𝑑𝐶(𝜃) ≤ 0, i.e. 𝐶 will always decrease if we change parameters according to the prescription in equation 3.15:

𝐶^′(𝜃) = 𝐶(𝜃) − 𝜂 ⋅ 𝛻𝐶² (3.15)

Summing up, the way the gradient descent algorithm works is - to compute the gradient 𝛻𝐶,

- to adjust parameters in the opposite direction, - to decrease the loss function.

This algorithm is applied repeatedly in a training loop in step 4 described in a previous section.

3.5 Batch, Mini Batch & Stochastic Gradient Descent

(21)

There are several ways how the training can be performed. However, before continuing, it is necessary to explain some terminology related to the training process. One epoch is when an entire dataset is passed forward and backward through the neural network. Usually one epoch is not enough, therefore the whole dataset is processed multiple times, i.e. run several epochs.

Sometimes, it is not feasible to process the entire dataset at once. Therefore, the data is divided into smaller parts, which are called batches, and processed one by one. Iterations or steps are the number of batches needed to complete one epoch.

When all the training data is taken into consideration in a single training loop it is called batch gradient descent. In practice this approach is not an efficient way to train NN. In order to compute the gradient of the loss function, we need to compute the gradients separately for each training input, then average them and use that mean gradient to update parameters. So that’s just one step of gradient descent in one epoch. Unfortunately, when the number of training inputs is very huge this can take a long time and learning thus occurs very slowly.

To tackle this problem, stochastic gradient descent can be used. In stochastic gradient descent, we consider just one example at a time to take one step. The term stochastic refers to the fact that each batch of data is drawn at random. Since we are considering a batch of just one example the cost will fluctuate over training examples and might never reach the global minima of the cost function but will keep fluctuating.

Batch gradient descent can be used for smoother loss curves as it converges directly to minima, each update would be more accurate, but far more expensive. Additionally, a mixture of batch gradient descent and SGD can be used, which is called mini batch gradient descent. We use a batch of a fixed number of training examples which is less than the actual dataset and call it a mini batch. By taking this approach we can achieve the advantages of both the former algorithms (Nielsen, 2019).

3.6 Backpropagation

We have described how to use the gradient descent algorithm to adjust the parameters of a neural network. However, it wasn’t explained how to compute the gradient of the cost function. In this section we'll explain a fast algorithm for computing such gradients known as backpropagation.

We casually assumed that because a function is differentiable, we can explicitly compute its derivative. In practice, a neural network function consists of many tensor operations chained together, each of which has a simple, known derivative. The central expression in backpropagation algorithm is a partial derivative of the cost function with respect to any parameter in the network as in equation 3.11 which tells us how quickly the loss score changes when the parameters are adjusted.

(22)

The following notation will be used in discussion about the backpropagation:

𝑙 – a layer under consideration, 𝐿 – last layer in FFNN,

𝑤_𝑗,𝑘^𝑙 – weight from the 𝑗^𝑡ℎ node in the (𝑙 − 1)^𝑡ℎ layer to the 𝑘^𝑡ℎ node in the 𝑙^𝑡ℎ layer, 𝑏_𝑘^𝑙 – bias of the 𝑘^𝑡ℎ node in the 𝑙^𝑡ℎ layer,

𝑎_𝑘^𝑙 – output from the 𝑘^𝑡ℎ node in the 𝑙^𝑡ℎ layer,

𝑧_𝑘^𝑙 – weighted input to the activation function of the 𝑘^𝑡ℎ node in the 𝑙^𝑡ℎ layer, 𝑔_𝑘^𝑙– activation function of the 𝑘^𝑡ℎ node in the 𝑙^𝑡ℎ layer,

𝛿_𝑘^𝑙 – error in the 𝑘^𝑡ℎ node in the 𝑙^𝑡ℎ layer.

The following diagram shows examples of these notations in use (layer superscripts are not shown):

Figure 3.4 The representation of a node's output. Source: Nielsen, 2019

With these notations, the output 𝑎_𝑘^𝐿 of the 𝑘^𝑡ℎ node in the output layer is related to the activations in the (𝐿 − 1)^𝑡ℎ layer by the equation:

𝑎_𝑘^𝐿 = 𝑔_𝑘^𝐿(∑ 𝑤_𝑗,𝑘^𝐿 ⋅𝑎_𝑗^𝐿−1

𝑘

+ 𝑏_𝑘^𝐿) (3.16)

We can rewrite equation 3.16 into the compact vectorized form:

𝑎^𝐿 = 𝑔((𝑤^𝐿)^𝑇𝑎^𝐿−1+ 𝑏^𝐿) (3.17) If we train a network shown in Figure 3.4, then we will need to compute a gradient consisting of 13 components – partial derivatives of cost function with respect to 9 weight parameters and 4 bias parameters. Each individual component of the gradient can be computed by the chain

(23)

rule; however, doing this separately for each parameter is inefficient. Backpropagation efficiently computes the gradient by avoiding duplicate calculations and not computing unnecessary intermediate values. It computes the gradient of each layer – specifically, the gradient of the weighted input of each layer referred to as an “error” – from back to front (Goodfellow, Bengio, & Courville, 2016). To calculate the partial derivative of the loss function 𝐶 with respect to the weights in the layer 𝐿 which is essentially the gradient of the weights in layer 𝐿, we will have to use the chain rule:

𝜕𝐶

𝜕w^𝐿 = 𝜕𝐶

𝜕𝑎^𝐿⋅𝜕𝑎^𝐿

𝜕𝑧^𝐿 ⋅ 𝜕𝑧^𝐿

𝜕w^𝐿 (3.18)

The first term on the right measures how fast the cost is changing as a function of the output in the last layer; in a vector form it is usually denoted as ∇_𝑎𝐿𝐶. Recalling that a^𝐿 = g(z^𝐿), the second term on the right can be written as g′(z^𝐿). The last term on the right is easily solved in the following equation:

𝑔[(w^𝐿)^𝑇⋅ 𝑎^𝐿−1+ 𝑏^𝐿]

𝜕w^𝐿 =𝑎^𝐿−1 (3.19)

The gradient ∇ is the transpose of the derivative of an output in terms of an input, so the matrices are transposed and the order of multiplication is reversed, but the entries are the same:

𝜕𝐶

𝜕𝑤^𝐿 = 𝑎^𝐿−1⋅ (g^′(𝑧^𝐿) ⊙ ∇_𝑎𝐿𝐶)^𝑇 (3.20)

The symbol ⊙ denotes the Hadamard product, i.e. elementwise product of the two vectors. The last two terms in equation 3.20 (the derivative of the loss function and the derivative of the activation function) give us the gradient of the weighted input at an output level ^𝜕𝐶

𝜕𝑧^𝐿 , which is referred to as an output error and denoted by 𝛿^𝐿. It can be expressed in matrix-based form:

𝛿^𝐿 = g^′(𝑧^𝐿) ⋅ ∇_𝑎𝐿𝐶 (3.21)

where the elements on the right side are diagonal matrices of the derivatives on each node, or in vector-based form:

𝛿^𝐿= g^′(𝑧^𝐿) ⊙ ∇_𝑎𝐿𝐶 (3.22)

The general formula for an error at level 𝑙 is different from the one used concretely for the output layer. The derivation of the general formula for an error using the chain rule is shown below:

(24)

𝛿^𝑙= 𝜕𝐶

𝜕𝑧^𝑙= 𝜕𝐶

𝜕𝑧^𝑙+1⋅𝜕𝑧^𝑙+1

𝜕𝑧^𝑙 (3.23)

The first term on the right is equal to 𝛿^𝑙+1 and the second term can be rewritten to the following form:

𝜕𝑧^𝑙+1

𝜕𝑧^𝑙 =𝜕[(𝑤^𝑙+1)^𝑇𝑔(𝑧^𝑙) + 𝑏^𝑙+1]

𝜕𝑧^𝑙 = (𝑤^𝑙+1)^𝑇 𝑔^′(𝑧^𝑙) (3.24) Again, we need to reverse the order of multiplication. Therefore, general formula for an error at level 𝑙 has the following form:

𝛿^𝑙 = 𝑔^′(𝑧^𝑙) ⊙ 𝑤^𝑙+1⋅ 𝛿^𝑙+1 (3.25) or we can substitute the error term and get the chain of derivatives of the activation functions, matrices of weights and a derivative of the loss function:

𝛿^𝑙= 𝑔^′(𝑧^𝑙) ⊙ 𝑤^𝑙+1… 𝑔^′(𝑧^𝐿−1) ⊙ 𝑤^𝐿⋅g^′(𝑧^𝐿) ⊙ ∇_𝑎𝐿𝐶 (3.26) Note that 𝛿^𝑙 is a vector of length equal to the number of nodes in level 𝑙; each term is interpreted as the cost attributable to that node. Finally, we obtain the following expression for the gradient of the weights in layer 𝑙:

∇_𝑊𝐿𝐶 = 𝑎^𝐿−1⋅ (𝛿^𝐿)^𝑇 (3.27)

Similarly, we can obtain a gradient of the bias vector which is equal to an error vector:

∇_𝐵𝐿𝐶 =𝛿^𝐿 (3.28)

For backpropagation, the activation 𝑎^𝑙 as well as the derivatives g′(z^𝑙) must be saved for a later use during the backwards pass.

To summarize, backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter had in the loss value. We can explicitly define an algorithm of computing the gradient of the cost function:

1. Feedforward: for each 𝑙 = 2, 3, … , 𝐿 compute 𝑧^𝑙 = (w^𝑙)^𝑇⋅ a^𝑙−1+ 𝑏^𝑙 and 𝑎^𝑙= g(𝑧^𝑙) 2. Output error 𝛿^𝐿: compute the vector 𝛿^𝐿 =g′(𝑧^𝐿) ⊙ ∇_𝑎𝐿𝐶

3. Backpropagate the error: for each 𝑙 = (𝐿 − 1), … , 2 compute 𝛿^𝑙 = 𝑔^′(𝑧^𝑙) ⊙ 𝑤^𝑙+1⋅ 𝛿^𝑙+1

(25)

4. Output: The gradient of the cost function is given ∇_𝑊𝐿𝐶 = 𝑎^𝐿−1⋅ (𝛿^𝐿)^𝑇and ∇_𝐵𝐿𝐶 =𝛿^𝐿 Examining the algorithm, we can see why it's called backpropagation. We compute the error vectors backward, starting from the final layer. The key advantage of a backpropagation algorithm is that since the only way a trainable parameter affects the loss is through its effect on the next layer, and it does so linearly, the errors are the only data we need to compute the gradients of the parameters at specific layer, and then we can compute the previous layer errors and repeat recursively. This avoids inefficiency in two ways. Firstly, it avoids duplication because when computing the gradient at one layer, we do not need to recompute all the derivatives on later layers each time. Secondly, it avoids unnecessary intermediate calculations because at each stage it directly computes the gradient of the weights and biases with respect to the ultimate output (the loss), rather than unnecessarily computing the derivatives of the values of hidden layers with respect to changes in weights.

It is important to understand that during model evaluation, the parameters are fixed, while the inputs vary (the target output may be unknown), and the network ends with the output layer, it does not include the loss function. During model training, the input–output pair is fixed, while the parameters vary, and the network ends with the loss function.

Nowadays, people implement networks in computational frameworks that are capable of automatic differentiation, such as TensorFlow or PyTorch. This means that, given a chain of operations with a known derivative, they can compute a gradient function for the chain (by applying the chain rule) that maps network parameter values to gradient values (Chollet, 2017).

3.7 Regularization

When training neural networks, it is import not only to find parameters that minimize the loss on the training data, but also that generalize well to previously unseen data, otherwise we are faced to overfitting. One approach how to solve overfitting is to limit the network capacity using parameter norms. L2 parameter regularization is one of the methods of regularization, that introduces weight decay. To the original cost function 𝐶 we need to add a term representing the magnitude of the weights (without biases), which should be minimized as well. The idea is that smaller weights better generalize. The new loss function then is:

𝐶_𝑅(𝜃) = 𝐶(𝜃) +1

2||𝑊||² (3.29)

3.8 Batch Normalization

(26)

Different dimensions of the input vector may have different scales or be shifted, which unnecessarily slows down the learning process. Generally, it is a good practice to normalize the inputs to have zero mean and unit variance across the whole dataset. However, this only helps in training the first hidden layer of a neural network. Input distributions of layers that are deeper in the network may still change uncontrollably as the network is trained.

Batch normalization is a technique for accelerating training of neural networks. It performs normalization to inputs of all layers. During training, inputs are normalized by mean and variance computed from the mini batch. Moving averages of mean and variance of inputs are maintained and used for normalization during testing. Batch normalization is differentiable and backpropagates errors through itself.

3.9 Recurrent Neural Networks

One of the major shortcomings of FFNN application within finance is the inability to deal with sequential data. Whenever the inputs are time series or any other kind of sequence data, the classical neural network cannot properly handle them. It is possible to feed the entire sequence of data to a NN at once but then the notion of time will be lost. For this kind of data there is a more suitable solution called recurrent neural network.

A recurrent neural network is a generalization of FFNN that has an internal state, called memory.

In simple terms, RNN remembers the past and its decisions are affected by knowledge it has learned from the past. RNN is recurrent in nature as it performs the same function in a loop for every input while the output for every input depends on the past computation. After producing the output, it is copied and sent back into the network. For making a decision, it considers the current input and the output that it has learned from the previous input. In RNN the output is influenced not just by weights applied on inputs like in a regular NN, but also by a “hidden”

state vector representing the context based on prior inputs and outputs.

Figure 3.5 displays how the RNN works. First the input is sent to the network and the output is produced. When the next input arrives (the next time step in financial time series), it is fed to RNN together with the previous output.

(27)

Figure 3.5 The information flow in RNN. Source: Chollet, 2017

Unlike feedforward neural networks, RNNs can use their internal state to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. In other neural networks, all the inputs are independent of each other but in RNN, they are related to each other (Chollet, 2017).

Each node in RNN has two sets of weights: in addition to the input weights as in FFNN, we use weights for the outputs from the previous time step. The output 𝑦 from a single layer RNN can be computed as follows:

𝑦_𝑡 = 𝑔(𝑥_𝑡∗ 𝑤_𝑥+ 𝑦_𝑡−1∗ 𝑤_𝑦+ 𝑏) (3.30) The part of an RNN that preserves the state is known as a memory cell. In general, a cell state at time 𝑡 is a function of some inputs at that time step and its state at the previous time step:

ℎ_𝑡 = 𝑓(ℎ_𝑡−1, 𝑥_𝑡) (3.31) An output at time step 𝑡, denoted 𝑦_𝑡, is also a function of the previous state and the current inputs. In the case of the basic cell, the output is simply equal to the state, but in more complex cells as in LSTM networks, this is not always the case.

For the problems that require learning long-term temporal dependencies, it can be difficult to train standard RNNs. This is because the gradient of the loss function decays exponentially with time, so-called vanishing gradient problem. The long-term dependency problem can be solved by using the improved version of plain RNNs that focuses on long-term dependencies called Long Short-Term Memory. All RNNs have the form of a chain of repeating modules where the repeating module have a very simple structure, such as a single tanh layer.

(28)

Figure 3.6 The chain of RNN modules. Source: Olah, 2015

LSTMs also have this chain like structure, but the repeating module is more complicated. Instead of having a single layer, there are four, interacting in a very special way.

Figure 3.7 The chain of LSTM modules. Source: Olah, 2015

The key building block of LSTM is the cell state which is the horizontal line running through the top cells’ chain. The cell state has only basic linear interactions allowing it to preserve the information as much as possible. The structures called gates control the information flow into the cell state. There are 3 such gates composed out of a sigmoid layer and a pointwise multiplication operation.

The first gate known as forget gate decides what information should be thrown away from the cell state. It receives the input and the output from the last cell and applies a sigmoid function which returns the number between 0 and 1, the higher the number – more information will be kept (Olah, 2015).

(29)

Figure 3.8 The forget gate in LSTM. Source: Olah, 2015

The second gate controls what new information will be stored in the cell state, and it has two parts. First, the sigmoid function is applied again to the input and previous output (but with different weights and bias) to decide which values will be updated. Next, a tanh function is applied to create a vector of new candidate values, 𝐶̃_𝑡, that could be added to the state. In the next step, these two are combined to create an update to the cell state (Olah, 2015).

Figure 3.9 The transformation of new inputs to be added. Source: Olah, 2015

To update the cell state, the old state is multiplied by the output from the forget gate, throwing away the things we decided to forget. Then, new information is embedded by adding new candidate values, scaled by how much we decided to update each state value (Olah, 2015).

(30)

Figure 3.10 Update of a cell state. Source: Olah, 2015

Now, the final output can be calculated using the updated cell state. Again, the sigmoid function is applied (with new weights and biases) to decide what parts of the cell will be used. Then, the tanh function is applied to the cell state and multiplied by the output from the sigmoid gate, so that we only output the parts we decided to (Olah, 2015).

Figure 3.11 The output from an LSTM cell. Source: Olah, 2015

(31)

4 Reinforcement Learning

This chapter is fully devoted to the concept of reinforcement learning and the theoretical ideas described here will be applied on the real data in the next chapter of this thesis. This chapter begins with an overview of the main building blocks in reinforcement learning along with a theoretical framework widely used as a formulation of RL problems called Markov decision process. The chapter then looks at value functions and policy which are two most important entities in the whole reinforcement learning field. Finally, one of the most popular RL method called actor-critic will be described in more detail.

As was mentioned, reinforcement learning is the distinct category of machine learning algorithms. Reinforcement learning is learning what to do in specific situations to maximize a numerical objective, which is represented by a reward signal. The learner is not provided by best actions explicitly, but instead must discover them on its own through the reward received by trying different actions. In complex problems, each action affects not only immediate reward but also future rewards and future actions. These two characteristics - trial-and-error search and delayed reward - are the two most important distinguishing features of reinforcement learning (Lapan, 2020). The term reinforcement comes from the fact that reward obtained by an agent should reinforce its behavior in a positive or negative way (Sutton & Barto, 2012).

There are two major RL entities – agent and environment, and the communication channels are – rewards, actions and observations.

- An agent is a piece of software that interacts with the environment by making observations about the environment’s state, executing certain actions, and receiving eventual rewards for this. The agent's goal is to maximize the cumulative reward it receives in the long run (Sutton & Barto, 2012).

- The environment is everything outside an agent, it has a certain state at a particular time.

- Observations are pieces of information that the environment provides the agent to tell what is going on around the agent.

- Reward is a scalar value the agent obtains periodically from the environment, and it is the main force that drives the agent's learning process. The purpose of reward is to tell our agent how well it has behaved. Reward is local, meaning that it reflects the success of the agent's recent activity and not all the successes achieved by the agent so far (Lapan, 2020).

- Actions are things that an agent is able to do in the environment.

4.1 Markov Decision Processes

Markov Decision Processes are an intuitive and fundamental formal framing for reinforcement learning problems. We will start with a simpler concept of Markov Process and then gradually reach the definition of MDP.

(32)

4.1.1 Markov Process

Markov process, which is also known as a Markov chain is the simplest model in a Markov family. Markov process is a system, which can switch between states according to some laws of dynamics of that system. This system cannot be influenced externally, therefore we can only observe it. All possible states for this system form a state space, which must have a finite number of possible states. A sequence of observations forms a chain of states observed over time. For such a system to be an MP, it needs to fulfill the Markov property, which means that the future system dynamics from any state can be predicted based solely on its present state and such predictions are just as good as the ones that could be made knowing the full history of observations. Each state must be self-contained and fully observable to be able to describe the future of the system.

In the language of conditional probability, Markov chain is a sequence 𝑆₀, 𝑆₁, … , 𝑆_𝑛 of random variables satisfying the rule of conditional independence (Ethier & Kurtz, 2009):

𝑃(𝑆_𝑛 = 𝑠_𝑛|𝑆_𝑛−1 = 𝑠_𝑛−1) = 𝑃(𝑆_𝑛 = 𝑠_𝑛|𝑆₀ = 𝑠₀, 𝑆₁ = 𝑠₁, … , 𝑆_𝑛−1 = 𝑠_𝑛−1) (4.1)

where 𝑠₀, 𝑠₁, … , 𝑠_𝑛 are possible states of the random variables.

In other words, the Markov property requires the states of the environment to be distinguishable from each other. If the Markov property is satisfied, it is possible to capture probabilities of transitions between states with a transition matrix with size N*N, where N is the size of state space of a system (Lapan, 2020). Therefore, the formal definition of an MP is as follows:

- a state space that a system can be in,

- a transition matrix with transition probabilities or a transition function defining a system dynamics.

It's worth noting that the Markov property implies stationarity, which means the underlying transition distribution for any state is stable in time. Non-stationarity means that there is some hidden factor that influences the system dynamics, and this factor is not included in observations, i.e. the environment is not fully observable (Lapan, 2020).

4.1.2 Markov Reward Process

Moving further, we can extend an MP by adding a reward function and arrive at Markov reward process. In simple case a reward can be expressed as another square matrix, similar to the transition matrix, with the reward given for transitioning from the current state to the target state.

Generally, an agent’s primary goal is to maximize the cumulative reward. In order to capture this goal formally we need to introduce a return as a cumulative sum of all rewards received by