Hlavní práce76540_mika00.pdf, 1.6 MB Stáhnout

(1)

U NIVERSITY OF E CONOMICS , P RAGUE Faculty of Finance and Accounting

Department of Banking and Insurance

MASTER THESIS

2021 Bc. Andrea Mikulovská

(2)

U NIVERSITY OF E CONOMICS , P RAGUE Faculty of Finance and Accounting

Department of Banking and Insurance Financial Engineering

Application of artificial intelligence in

predicting the volatility of financial asset prices

Author: Bc. Andrea Mikulovská Supervisor: Ing. Milan Fičura, Ph.D.

Year: 2021

(3)

Declaration of autorship

I hereby declare that I have prepared the master thesis on the topic „Application of artificial intelligence in predicting the volatility of financial asset prices“

independently and all the used literature and other sources I have properly marked and stated in the attached list.

In Prague, 29.12.2021

...

Bc. Andrea Mikulovská

(4)

Acknowledgements

I would like to forward many thanks to my supervisor Ing. Milan Fičura, Ph.D.

for his valuable advice and time which he devoted to helping me with preparation of this master thesis.

(5)

Abstrakt

V posledních desetiletích se dostává do popředí vysokofrekvenční obchodování a obecněji vysokofrekvenční datové modelování, a proto je tato práce zaměřena na takové modelování. U vysokofrekvenčních dat vzniká problém s vnitrodenní sezónností, který není tak snadné vyřešit jako mezidenní nebo nižší sezónnost. Cílem této diplomové práce je modelování, predikce a vypořádání se se sezónností 1- minutových vnitrodenních dat časové řady Tesla za použití benchmark modelu MC- GARCH a jeho porovnání s modely neuronových sítí, přesněji autoregresním modelem feedforwardové neuronové sítě s jednou skrytou vrstvou a sítí LSTM se dvěma skrytými vrstvami a také byl použit algoritmus k-nejbližšího sousedství. Pro každý model jsou vypočítáne tři různé rolling předpovědi s n-periodou dopředu.

Prediktivní schopnost modelů je následně porovnána pomocí RMSE metriky.

Klíčová slova

Volatilita, MC-GARCH, Neurónové sítě, K-nejbližší sousedství

Abstract

As the high-frequency trading and, in more general, high-frequency data modelling has come to the fore in the last decades, this thesis focuses on a such modelling. In high frequency data arises a problem with intraday seasonality which is not as easy to solve as interday or less frequent seasonality. The aim of this master thesis is modelling, predicting and coping with seasonality of 1-minute intraday data of the Tesla time series, using benchmark model MC-GARCH and comparing it with models of artificial neural networks, more precisely, feedforward neural network autoregression model with single hidden layer and LSTM network with two hidden layers, and also k-nearest neighbourhood algorithm was used. For each model, three different n-period ahead rolling predictions are made. Predictive ability of models is then compared using RMSE metric.

Key words

Volatility, MC-GARCH, Neural Network, K-nearest neighbourhood

(6)

Introduction

What is volatility? Alone as it is, volatility is not observable. One has to find a way how to express it. In finance, it is mostly done by calculating squared returns of financial asset prices. But what does it say to us? Volatility is basically an expresion of size of the risk. When trading stocks, investors want to know not only what value will stock price reach, but they also want to know a size of risk that the price will not reach a certain desirable value. An investor can find this information from volatility.

Basically it is conditional heteroskedasticity / conditional variance or variability, which is time-dependent, or in other words, it is dependant on the past values of the random component or residuals of the model. Volatility is mostly stationary, its changes vary within a certain range. It is also characterized by the formation of clusters, which means that periods of high volatility (high risk) alternate with periods of low volatility (low risk).

Volatility modeling using high-frequency intraday data has both advantages and disadvantages. The advantage is that we have a large amount of data with which tests and models can show more credibility. However, it is also a disadvantage, because such a large amount of data brings noise and higher variability in time series, which can cause complications in modeling. Especially, in high frequency data arises a problem with intraday seasonality which is not as easy to solve as interday or less frequent seasonality.

The aim of this work is focused on such modeling and predicting of intraday data of the Tesla time series, using benchmark model MC-GARCH and comparing it with models of artificial intelligence, concretely machine learning algorithms belonging to supervised learning algorithms, such as artificial neural networks, more precisely, feedforward neural network autoregression model with single hidden layer and LSTM network with two hidden layers. Also another machine learning algorithm k-nearest neighbourhood was used. For all these models, n-period ahead rolling 1-minute predictions are made using three different periods ahead, and RMSE of each is calculated to compare predictive ability of models.

(9)

8

The thesis is organized into six chapters. In the first chapter is briefly mentioned history of volatility modeling. Chapters two and three are devoted to theoretical background and detailed deffinitions of models with corresponding formulas. Studies dedicated to similar topics are reviewed in chapter four. Next, in chapter five, data structures, graphs of time series and adjusting of data are described. Also procedures followed in each model construction are mentioned in this chapter. In the last chapter, chapter six, results of all models are presented, predictions compared to real data are shown in graphs and root mean squared errors for each model are calculated for the purposes of comparison of models.

(10)

9

1 Brief history of volatility modeling

Volatility is very useful in different financial areas, for example in asset pricing, option pricing, porfolio optimization, VaR estimation, etc. It follows that volatility is needed to make financial decisions, and thus it is valuable to forecast volatility. There are many econometric methods that are widely used to model and forecast the volatility of financial assets. Engle (1982) was the first to come up with the idea of modeling conditional heteroskedasticity and introduced model called autoregressive conditional heteroskedasticity (ARCH). He applied the ARCH(1) model to inflation in the UK. It is a method based on the Box-Jenkins methodology (1970) and its ARMA model. The model is estimated by maximizing the likelihood and then the best model can be used to predict future volatility. An extention of the ARCH model is a generalized autoregressive conditional heteroscedasticity (GARCH) model, whose authors are Bollerslev (1986) and Taylor (1986). The essence of this model is that conditional variance is expressed as a function of previous errors and previous variances. Then there are also some modified GARCH models such as GARCH-M model from authors Engle, Lilien and Robins (1987). The pressence of relationship between log returns and risk is assumed in this model. Further extention of the GARCH model is exponential generalized autoregressive conditional heteroscedasticity (EGARCH) model by Nelson (1991).

Another, recently more popular way to forecast volatility is using machine learning algorithms. In comparison with econometric models, machine learning methods are more data-driven. Probably the most popular technique for forecasting volatility is neural network.

The first people who started developing this technique were Warren McCulloch and Walter Pitts (1943). They created a computational model for neural networks. In 1945 was informaly idntroduced the model or recurrent neural networks, later in 1956 it was formalized by Kleene. Another big step in this field was made by Rosenblatt (1958) who created the simple perceptron. A few years later, the first networks with many layers were published by Ivakhnenko and Lapa in 1965. They introduced the first general working learning algorithm for supervised feedforward multilayer perceptrons with many layers. They used nonlinear activation functions based on

(11)

10

additions and multiplications. Thereafter, Minsky and Papert (1969) discovered some drawbacks in basic perceptrons. They believed that perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks. They believed that the only way to know if a perceptron works well or not is to prove it mathematically. Their rigorous work does not make the perceptron look very good.

General method for reverse mode of automatic differentiation of discrete connected networks of nested differentiable functions was published by Seppo Linnainmaa (1970). It is also known as efficient error backpropagation. Werbos (1975) enabled practical training of multi-layer networks by using backpropagation algorithm and in 1982, he applied Linnainmaa's automatic differentiation method to neural networks.

In 1997, Hochreiter and Schmidhuber came up with Long short-term memory networks as a solution for recurrent neural network’s drawback – vanishing gradient.

Dean, Corrado, Monga, Chen, Devin, Le & Ng (2012) developed algorithms for large scale distributed training which increase scale and speed of network training.

They created a network that learned to recognize higher-level concepts only from watching unlabeled images.

In 2009, artificial neural networks started to winning prizes, being able to approach human level performance on various tasks, mostly in pattern recognition and machine learning. As an example, the bi-directional and multi-dimensional long short- term memory network introduced by Graves et al. (2009) won a few competitions in handwriting recognition without any prior knowledge about the three languages to be learned.

(12)

11

2 ARIMA – based models

In this section, benchmark models for volatility modeling and forecasting are described. These methods are basod on ARIMA models which are based on Box- Jenkins methodology (1970). This methodology is used for stochastic modeling of time series. It is based on the definition of a stochastic process - a series of random variables arranged over time. The time series is thus understood as the realization of the stochastic process.

In time series analysis is very important the concept of stationarity. It can be characterized as stability over time or exclusion of the influence of time. Stationary time series therefore has a constant mean, a constant variance, and a covariance and correlation the function depends only on the distance of random variables, not on time (Arlt, 2003). Stationarity is most often verified using a Augmented Dickey-Fuller unit root test (ADF).

There are two main processes – AR and MA. AR(1) is autoregressive process of order one. It can be expressed as follows:

Xt=Φ¹ Xt-1+at

Where Xt-1 is laged time series and at is random component. If | Φ1 | <1, process AR (1) is stationary. The autocorrelations decrease exponentially towards the past, and thus it is a process with a short memory. There are also special cases of the AR (1) process. When | Φ1 | = 1, it is a random walk. It is an integrated process of order 1, so after the first difference the process is stationary. When |Φ1| = 0, it is a white noise, ie a series of uncorrelated random variables of one probability distribution with zero mean, finite and constant variance, and zero autocovariance and autocorrelation function. The process of white noise is unpredictable. The second process is MA(1) moving average. It is a shortened linear process which is always stationary. It looks like this:

Xt=at - θ1at-1

By combining the AR and MA processes, we obtain mixed ARMA and ARIMA processes, where I in ARIMA process is order of integration. By extending them with the seasonal component, we get the SARMA and SARIMA models.

(1.1)

(1.2)

(13)

12

2.1 ARCH model

If we can express the dependence on the past of the time series variance by autoregression, it is the ARCH model - autoregressive conditional heteroskedasticity.

Variance of residuals is thought to be a function of the square of random components from the past, so it is time dependent. The ARCH model is expressed by the mean equation and the variance equation of time series.

Yt = β1 + xt´ β + ut (2.1)

From a simple linear regression model, where xt is a vector of regressors and β is a vector of parameters, we can express the dependence of the variance of the random component ut on the past as follows:

σ²t = α0 + α1u²t-1 (2.2)

Equation (1.2) expresses the basic ARCH(1) model because the conditional variance depends only on one laged value of the random component ut-1. Now we substitute σ²t

for ht to simplify the marking.

ht = α0 + α1u²t-1 (2.3)

It is clear that if a large shock (large ut) occurs in the past period, then the variance of this shock ut (conditional variance σ²t) will most likely be large too. However, conditional variance may also depend on more than one laged random component, and because of this the ARCH model can be extended. Generally, it is ARCH (q).

The disadvantage of the ARCH model is that it assumes that positive and negative shocks have the same effect on volatility, but in practice it looks different. In practice, negative shocks have a much higher impact on volatility than positive shocks (leverage effect). This fact generaly holds in equity market. The model describes the behavior of volatility mechanically and in practice it is used less and less (Hušek, 2009).

2.2 GARCH model

The shortcomings of the ARCH model are fixed by the generalized ARCH model (generalized autoregressive conditional heteroskedasticity). This model can describe the volatility clustering, heavy tails and the significant kurtosis in the distribution of

(14)

13

financial time series. The authors of the model are independently Bollerslev (1986) and Taylor (1986).

The basic model GARCH (1,1) is an extension of the model ARCH (1) by a laged value of conditional variance ht-1.

ht = α0 + α1u²t-1 + β1ht-1 (2.4) By substituting the value of the laged u²t, adding u²t to both sides and moving ht to the right side, we obtain a notation in the form of ARMA (1,1) model for u²t.

u²t = α0 + (α1 + β1) u²t-1 + vt - β1vt-1 (2.5) Where vt = u²t - ht , the AR component is u²t-1 and the MA components are vt and vt-1. The stationarity condition in this model is α1 + β1 <1. If this value is close to 1, this indicates a significant persistence of volatility.

The GARCH model can be easily explained as the weighted average of the three components of the variance. The first component is a constant variance that corresponds to the long-term average. The second part is a variance from the previous period and the third part is new information that was not yet available in the previous period. (Engle, 2003)

A simple GARCH (1,1) can be extended to the general form GARCH (p, q), where q is the number of lags of the value u²t and p is the number of lags of the value ht. The model has the following notation:

ht = α0 +∑ αiu²t-i +∑ βjht-j (2.6) In most cases, a simple GARCH (1,1) is sufficient to describe volatility clustering, and thus higher-order models are used very little. (Hušek, 2009)

Diagnosis of the presence of conditional heteroskedasticity in the time series is performed using the LM test, where the null hypothesis H0 states that the conditional variance ht is constant, and thus that the coefficients α1, β1 are zero.

The MLE (maximum likelihood estimation) method is used to estimate GARCH models. An alternative is the QMLE (qasi-maximum likelihood estimation) method, which is robust even if residuals are not normally distributed.

Modified GARCH models can fix some of the shortcomings of classical GARCH models, such as that some estimated coefficients may not meet the condition of non- negativity, or that they cannot describe the leverage effect of financial time series.

(15)

14

2.3 Modified GARCH models

Over time, various modifications of the standard GARCH model have been developed. In this part of the work they will be briefly introduced.

If the classical GARCH (1,1) model contains a unit root, it is called integrated or the IGARCH (1,1) model. The responses of past shocks in this model are permanent.

If there is a relationship between the logarithmic returns and risk (measured by conditional variance), it can be described with the GARCH-M model first used by Engle, Lilien and Robins (1987). We obtain the model by extending the average equation by the term representing the risk. We can write the model with the following notation:

rt = cht + ut (2.7)

where c is the risk premium coefficient, ht can be expressed using equation (1.4) and ut is a random component.

Other modifications work with the fact that positive and negative shocks do not have the same effect on volatility. This phenomenon is called the leverage effect.

Figure 1 Leverage effect of volatility Source: VÝROST, V. G. T. (2003).

Modified models try to describe and incorporate this asymmetry into the model. The first of these is the exponential GARCH or EGARCH with witch first came up Nelson (1991). He tries to capture the asymmetry of shocks using weighted shocks.

The model is based on the logarithmic expression of the conditional variance. It is expressed by the following equation:

(16)

15 ln(ℎ_𝑡) = 𝑎 + 𝛼 ( 𝑢_𝑡−1

√ℎ_𝑡−1) + 𝛾 (|𝑢_𝑡−1|

√ℎ_𝑡−1− 𝜂) Where a = α0 + β1ln(ht-1), 𝜂 = √2/𝜋 .

There are many other modifications that work with nonlinearities, asymmetry and long volatility memory, also solve the problem that returns do not have to meet normality condition and may have different parametric or non-parametric distributions. These include models such as APARCH, FIGARCH, SWARCH, GJR- GARCH, TARCH, SPARCH, GED-ARCH and many more. But we do not go into them further in this work.

2.4 Intraday GARCH models

Recently, there has been a growing interest in high-frequency trading and thus in the associated analysis of high-frequency data. What is more, the estimation of volatility with intraday returns can be more accurate than volatility estimates based on daily data, because squared daily returns used as a proxy of the true variance, are considered as an unbiased but noisy estimator of volatility. Thus, intraday volatility modeling has come to the fore.

Going deeper, modeling of intraday volatility can be more dificult than modeling less frequent data mailny because in the process, there arises problem with intraday seasonality and how to incorporate it in the model. Standard ARMA/GARCH models can cope with seasonality by adding lags corresponding to seasonal period into the model. But in high-frequency volatility modeling, we would have to include too many lags to the model, which can lead to overfitting.

There are more alternative GARCH models for intraday volatility modeling such as PGARCH, MC-GARCH and more. In this thesis we focus on MC-GARCH model.

A relatively newer model for predicting intraday volatility, MC-GARCH (Multiplicative component GARCH), was introduced by Engle and Sokalska (2012).

The authors divided the volatility of high-frequency yields into three components, namely the daily, deterministic / seasonal (diurnal) and stochastic (intraday) (2.8)

(17)

16

component. Daily volatility is determined exogenously. This model is expressed as follows:

rt,i = μt,i + et,i

(2.9) et,i = (qt,iσtsi)zt,i

where rt, i is the return on a financial asset, qt, i is the stochastic intraday volatility, σt is the daily, exogenously determined volatility forecast, si is the seasonal / deterministic volatility in each regular interval i, zt, i. The diurnal component of volatility is defined as:

𝑠_𝑖 = 1

𝑇∑(𝑒_𝑡,𝑖² 𝜎_𝑡²)

𝑇

𝑡=1

When we divide the residuals by seasonal and daily volatility, we get normalized residuals, which are then used to form the stochastic volatility component qt, i. The MC-GARCH model should provide more accurate estimates because it works with the three volatility components separately. It works best if seasonality is present in the data. The daily exogenously determined volatility may be estimated with simple GARCH model. Intraday volatility is then estimated with volatility data normalized by seasonal and exogenously determined daily volatility components.

(2.10)

(18)

17

3 Machine learning in volatility modeling

In this chapter, newer methods for volatility modeling are presented. To be more concrete, machine learning as a core part of artificial intelligence and data science is briefly presented here.

What is machine learning? It can be said that machine learning are some computational methods which use experience to make predictions or to improve performance. Machine learning is mostly about generalization. It combines computer science with some ideas from statistics. Machine learning finds use in various spheres of life such as technology, science, healthcare, marketing and also in financial modeling. Machine learning methods can deal with problems such as text classification (e.g. spam detection), speech processing, object recognition, face detection, medical diagnosis, unassisted control of vehicles, fraud detection and many more. (Mohri, Rostamizadeh & Talwalkar, 2018)

The most widely used machine learning tasks are for example classification which assigns a category for all items in dataset, clustering which partitions a set of items into a subset. This learning task is often used with large datasets such as social network data. Ranking is used to order items with respect to some criteria which is very useful in web search. Dimensionality reduction transforms an initial structure of items into a lower dimensional structure and keeps some properties of the initial structure. And there is also regression which is used when the goal is to predict some continuous variable.

We can distinguish different types of learning scenarios because training data is received by various methods, there are many types of training data and also many types of test data which we need to evaluate learning method. Here are three learning scenarion used the most.

Supervised learning is most widely used. This learning system is supposed to predict the labels of patterns. As a trainnig data, set of labeled data is provided and predictions are made for all unseen points. Supervised learning scenario forms predictions through mapping function f(x), which generates an output y for each x. It is commonly used in classification or ranking. Here are included for example face recognition, medical diagnosis, spam classifiers and many more. In the last decade,

(19)

18

deep networks have come to the fore. It is basically multilayer network, and each layer computes parametrized function of its inputs (Jordan & Mitchell, 2015). Some methods belonging to supervised learning are decision trees, logistic regression, support vector machines, neural networks, Bayesian classifiers, and more.

Unsupervised learning methods analyze unlabeled data and based on these unlabeled data it makes predictions for unseen points. Typical unsupervised learning problems are dimensionality reduction with methods such as principal component analysis, factor analysis, autoencoders and random projections, and clustering which is made for finding a partition of dataset without knowing labels of desired partition. (Anthony & Bartlett, 2009)

Reinforcement learning is the third most used learning method. Here the training and testing phases are intermixed. In order to collect information, learner actively interacts with environment and recieves an prompt reward for each action.

The information from training data is intermediate between supervised and unsupervised learning (Jordan & Mitchell, 2015). The training data here provide an indication if the action is correct or not. If it is not, the process proceeds until the correct action is found. Reinforcement learning helped in some work in psychology and neuroscience.

These learning methods are best known, but there are also some other methods which are usually a mixture of the methods above. For example semi-supervised learning where the dataset contains both labeled an unlabeled data, active learning where we can choose and collect the trainig data, and many more.

In volatility modeling, machine learning methods use as an input historical data to forecast volatility for each day and using statistical methods they estimate realized volatility for the same day. After that, the distance between them is measured to train the model. It is usualy done by calculating mean squared error or mean absolute error.

The aim is to minimize these indicators in order to receive most effective machine learning model for forecasting volatility.

(20)

19

3.1 Artificial neural networks

From supervised learning methods, probably the best known is neural network.

This method can be used for solving many tasks from face recognition to predicting a future prices of shares in finance. In the last few decades, great amount of literature in finance has implemented artificial neural networks as a forecasting method. The great advantage of neural networks is the ability to approximate linear and nonlinear behaviors without knowing the structure of the data available. This fact makes it suitable for forecasting time series with long-memory and nonlinear dependencies, which is typical for conditional volatility.

Artificial neural networks (ANN) imitate the structure of neurons in the human brain. In the human brain, neurons are connected through synapses and they form a biological neural network. The information is transmitted between them through a network by electrochemical signals. In ANN is this process imitated, neurons are also connected to each other. The networks can get a problem-solving ability by doing an adjustment of their connection weights through learning.

Many types of ANN exist, but all of them have a common charasteristics. They consist of units which are connected, this represents neurons. All neurons receive inputs which are processed and then they generate an output. The units create architectures with specific patterns of connections between them. This represents the synapses. The ANN is trained to respond to data presented rather than programmed. It means that ANN learns to perform a task through training, not by programming as it is done in standard computation. (Abdi, 1994)

This is how artificial neural networks looks like:

Figure 2 ANN

Source: NIELSEN, Michael A. (2015). Neural Networks and Deep Learning.

(21)

20

3.1.1 Simple perceptron

It is a type of artificial neuron developed by Frank Rosenblatt (1958). Perceptron works like this: it takes binary inputs xj and generate a single binary output, 0 or 1.

Then there exist very important weights wj assigned to each input. These weights express the importance of inputs to the output. Then the value of output (0 or 1) depends on the weighted sum of inputs ∑jwjxj, if it is greater than or less than threshold value.

𝑂𝑢𝑡𝑝𝑢𝑡 = {

0 𝑖𝑓: ∑ 𝑥𝑗𝑤_𝑗 ≤ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

𝑗

1 𝑖𝑓: ∑ 𝑥_𝑗𝑤_𝑗 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

𝑗

If we change a notation a bit, we use w . x instead of ∑jwjxj, and threshold move to the other side of inequality and replace it by bias, we get this notation:

𝑂𝑢𝑡𝑝𝑢𝑡 = {0 𝑖𝑓: 𝑤 . 𝑥 + 𝑏 ≤ 0 1 𝑖𝑓: 𝑤 . 𝑥 + 𝑏 > 0

Bias can be defined as a measure of how easy is to get perceptron to output a 1.

When there is very big bias, it is easy for perceptron to get output a 1. If the value of bias is very negative, it is dificult to get output a 1. (Nielsen, 2015)

Figure 3 Simple perceptron

Source: NIELSEN, Michael A. (2015). Neural Networks and Deep Learning.

This is basic mathematical model. By assigning different weights and setting various thresholds, we can get different models. But perceptron is a very simple model and it is not very useful in practice.

3.1.2 Sigmoid neuron

Sigmoid neuron is similar to simple perceptron. The difference between them is that if there occur a small changes in weights and bias of sigmoid neuron, it only causes (3.1)

(3.2)

(22)

21

a small change in output. This is not guaranteed with simple perceptron. That is a significant fact due to which the sigmoid neurons are able to learn. (Nielsen, 2015)

Sigmoid neuron also has inputs, but the values of inputs are not limited to be only 0 or 1, but they can also take any values between 0 and 1. It also has weights for all inputs and overall bias. But instead of output of values 0 or 1, there is an output in the form of sigmoid function σ(z) or σ(w . x + b), and it is defined as follows:

𝜎(𝑧) = 1 1 + 𝑒^−𝑧

Sigma is basically logistic function and this class of neurons is called logistic neurons. The explicit formula of the output of a logistic / sigmoid neuron is:

1

1 + exp (− ∑ 𝑤_𝑗 _𝑗𝑥_𝑗− 𝑏)

Sigmoid neurons are used in practice more often than perceptrons.

3.1.3 The architecture of ANN and Activation function

The artificial neural terwork is composed of different layers. In feedforward neural network, the leftmost layer is the input layer and the rightmost layer is output layer. In between them is space for so called hidden layers. In recurrent neural networks it is a bit more complicated because we can move both directions between the layers.

Figure 4 The architecture of ANN Source: Author’s own work

In figure 4 on the left side, there is just a single hiddel layer neural network, while on the right side there is a multiple hidden layers neural network. Concretely, there are (3.2)

(3.3)

(23)

22

input and output layers (the mostleft and mostright layers), and in between them there are two hidden layers.

While the way the input and output layers work is pretty straightforward, it is not that clear in hidden layers. There have been developed many design heuristics of hidden layers to help to get the behaviour of ANN exactly as we want. (Nielsen, 2015) Activation functions are crutial part of neural networks. It defines the way how is the weighted sum of inputs transformed into an output in layers of network. Different activation functions can be used in different parts of the model. Activation function for hidden layers determines how the network learns the training dataset. Activation function for output layer defines the type of prediction the model makes. Hidden layers usually use the same activation function. Output layer often uses an activation function different from the one used in hidden layers. These functions are typically differentiable. (Brownlee, 2021) There exist many types of activation functions, the most used functions in practice are described below.

For hidden layers, the three most used activation functions are Rectified linear activation, Logistic activation and Hyperbolic tangent activation.

Logistic activation function or sigmoid function, is the same function as the one used in logistic regression. This function takes real value inputs and the output is only within range 0 to 1. The equation of logistic activation function is the same as the equation (3.2).

Figure 5 Sigmoid unit Source: MITCHELL (1997)

Hyperbolic tangent activation function, or Tanh for short, is very similar to logistic activation function. The function takes real value as input and the output is within range -1 to 1. Tanh activation function is defined as follows:

𝑒^𝑥− 𝑒^−𝑥

𝑒^𝑥+ 𝑒^−𝑥 ^(3.4)

(24)

23

Rectified linear activation function, or ReLU for short, is easy to implement and also it overcomes limitations of logistic and hyperbolic tangent activation functions, such as vanishing gradient. ReLU is defined as follows:

max(0, x) (3.5)

If the value x is negative, the function returns value 0, otherwise, the value x is returned.

Figure 6 Activation functions for hidden layers

Activation functions for output layer the most used in practice are Linear, Logistic and Softmax function.

Linear output activation function is the easiest. It does not change weighted sum of inputs and it returns value directly. It is also called identity function.

Logistic or sigmoid output activation function was defined before. Softmax output activation function generate an output as a vector of values with the same length as input vector, which sum to 1. It can be explained as a probabilities of class membership. (Brownlee, 2021) The function can be calculated as follows:

𝑒^𝑥 𝑠𝑢𝑚(𝑒^𝑥)

Choosing appropriate output activation function is based on type of prediction problem you are solving.

In neural networks is shed light on problem with vanishing gradient. Basically when more layers are added to neural networks, the gradients of the loss function goes to zero, making the network hard to train. If n hidden layers use an activation for example the sigmoid function, n small derivatives are multiplied together and though the gradient decreases exponentially as we go down to the initial layers. As a solution we may use another activation function or to use residual networks. (Wang, 2019)

(3.6)

(25)

24

3.1.4 Feedforward neural networks

Feedforward neural network is essential model from neural networks. The main representative is multilayer perceptrons (MLPs). The logic of this neural network is that it approximates function f and the information moves only in one direction through the layers. It is fed forward, never fed back. There are no feedback connections between the layers, there are no loops in network. Feedforward neural networks usually compose two or more hidden layers. Number of all layers in neural network gives us the depth of the model. Number of only hidden layers in neural network gives us width of the model. (Goodfellow, Bengio & Courville, 2016)

3.1.5 Recurrent neural networks (RNN)

Recurrent neural network is a model in which feedback loops are possible. This type of neural network is specialized for processing sequential data or time series data.

It is typical for this network, that parameters are shared across different parts of the model. (Goodfellow, Bengio & Courville, 2016)

Recurrent neural networks are usually used for problems like speech recognition or language translation and they are incorporated in some of the popular applications e.g. Siri or Google Translate.

RNNs have been less influential than feedforward neural networks. But they are really interesting in a way that they are much more closer to how our brains work than feedforward neural networks. It is possible that recurrent neural networks can solve some problems which would be solved with difficulties if using feedforward neural networks. (Nielsen, 2015)

RNNs use their internal memory (state), they take information from prior inputs in order to influence the current input and output. Traditional neural networks assume independence of prior inputs and current outputs, the output of RNN depends on the prior elements within the sequence. Formula for current state is defined as follows:

ht = f (ht-1, xt)

where h is a single hidden vector. When we apply some activation function we get:

ht = Tanh (Whhht-1 + Wxhxt)

(3.7)

(3.8)

(26)

25

Here Whh is weight at previous hidden state, Whx is weight at current input state and Tanh is activation function. Getting this all together we get an output:

yt = Whyht

Yt is an output and Why is weight at output state. Future events would also be helpful in generating the output of a given sequence, unidirectional RNNs are not able to account for these events in predictions. Fortunately, there are also bidirectional recurrent neural networks which are able to process the future data to improve accuracy of predictions. (IBM, 2020)

Unfortunately, there are also some drawbacks of RNNs such as very difficult training process, exploding gradients and vanishing gradients. Gradient issues are dependant on the size of the gradient, that is the slope of loss function along the error curve. (Goodfellow, Bengio & Courville, 2016) One more disadvantage is that if we are using Tanh or ReLU as an activation function, RNN cannot process very long sequences and thus it cannot capture long memory properties of data.

3.1.6 LSTM networks

Long short-term memory network is modified version of RNN architecture. It was invented by Hochreiter and Schmidhuber in 1997 as a solution to vanishing gradient problem.

As a type of recurrent neural networks, feedback loops are present here. A crucial step forward here has been to make the weight on this self-loop conditioned on the context, rather than fixed. By making the weight of this self-loop gated, controlled by another hidden unit, the time scale of integration can be changed dynamically. And thus even for an LSTM with fixed parameters, the time scale of integration can change based on the input sequence, because the time constants are output by the model. (Goodfellow, Bengio & Courville, 2016)

This type of network specializes on the problem of long-term dependencies. If the previous state that is influencing the current prediction is not in the recent past, the RNN may not be able to corectly predict the current state. As a solution for this, LSTMs have “cells” in the hidden layers, with three gates: an input gate, an output (3.9)

(27)

26

gate, and a forget gate. These gates control the information flow which is needed to predict the output. (IBM, 2020)

Figure 7 Architecture of LTSM network

Source: https://aditi-mittal.medium.com/understanding-rnn-and-lstm-f7cdf6dfc14e

Input gate determines which values from input will be used to modify memory.

In this gate, sigmoid and tanh functions are used.

it = σ(Wi xt + Uiht-1 + bi) C't = tanh(Wc.ht-1 + Uc xt + bc)

ct = σ(ft*ct-1 + it* C't)

Forget gate finds out which infrormation should be discarted. It is done by using sigmoid function.

ft = σ(Wf ht-1 + Uf xt + bf)

Output gate – input and memory of the block generates output.

ot = σ(Wo ht-1 + Uo xt + bo) ht = tanh(ct)*ot

Sigmoid function chooses which values to let through and tanh function generates weightage of values that passed, which tells us the level of their importance. In the equations, W and U are matrices containing weights of input and recurrent connections.

(3.10)

(3.11)

(3.12)

(28)

27

3.1.7 Training of the model

Artificial neural networks are trained by processing examples, each of which contains a known input and result, and this forms probability-weighted associations between these two. Probabilities are stored within the data structure of the net. The training of an artificial neural network is conducted by determining the difference between the processed output of the network and a target output, in other words, it is determined by error. Then in the learning process the network proceeds with adjusting its weight according to a learning rule using error value to improve accuracy of the result. If it is succesful, it will produce output which is increasingly similar to the target output. After a sufficient number of these adjustments the training can be terminated based upon certain criteria. By this criteria is usually ment minimizing the observed errors. Learning is complete when additional observations does not reduce the error rate. Even after learning, the error rate typically does not reach zero. Although, authors Du, Lee, Li, Wang & Zhai (2019) in their paper prove that gradient descent achieves zero training loss in polynomial time for a deep overparameterized neural network with residual connections (ResNet). If after learning, the error rate is too high, the network should be redesigned. This is done by defining a loss function that is evaluated periodically during learning. As long as its output continues to decline, learning continues. (Zell, 2003)

Loss function is an function which is used in training algorithm, which finds weights and biases so that the output from the network approximates function y(x) for all training inputs x. (Nielsen, 2019)

One example of loss function can be written as follows:

𝐶(𝑤, 𝑏) ≡ 1

2𝑛∑ ‖𝑦(𝑥) − 𝑎‖

𝑥

Where w are all weights in the network, b all the biases, n is the number of training inputs, a is the vector of outputs from the network where x is input, and the sum is over all training inputs x. Formula 3.13 is called the quadratic loss function; or it is also known as the mean squared error. If a smooth loss function like the quadratic loss is used, it can be easy to figure out how to make small changes in the weights and biases and though how to get an improvement in the loss. And so there is need to focus first on minimizing the quadratic loss, and only after that it is examined the classification (3.13)

(29)

28

accuracy. The main goal in training an artificial neural network is to find weights and biases that minimize the quadratic loss function C(w,b). For this problem we can imagine that function of many variables is given and we want to minimize that function. This minimization problem can be solved with technique called gradient descent. (Nielsen, 2019)

Gradient descent is an optimization technique which is used to improve neural network-based models by minimizing the loss function. It is a process that occurs in the backpropagation phase where the goal is to continuously resample the gradient of the model’s parameter in the opposite direction based on the weight w, updating consistently until we reach the global minimum of function J(w). The process of gradient descent is shown in the figure 8.

Figure 8 Gradient descent process

Source: https://towardsdatascience.com/gradient-descent-3a7db7520711

There are a number of gradient descent algorithms, for example batch gradient descent, stochastic gradient descent, mini-batch gradient descent and more. In the total gradient descent algorithm, the weights then are updated after each sweep over the training set. But the stochastic gradient descent algorithm has been shown to be faster, more reliable. In this algorithm, the weights are updated after the presentation of each example, according to the gradient of the loss function. (Bottou, 1991)

Stochastic gradient descent randomly picks out a small number m of randomly chosen training inputs called mini-batch. If the sample size m is large enough it is

(30)

29

expected that the average value of the gradient ∇CXj will be roughly equal to the average over all ∇Cx, that is:

∇𝐶 ≈ 1

𝑚∑ ∇𝐶_𝑋𝑗

𝑚

𝑗=1

In practice, stochastic gradient descent is widely used and powerful technique for learning in artificial neural networks.

Backpropagation - Backpropagation is a method used to adjust the connection weights to compensate for each error found during learning. The error amount is effectively divided among the connections. In other words, backpropagation calculates the gradient of the cost function associated with a given state with respect to the weights. The weight updates can be done through stochastic gradient descent or other methods, such as Extreme Learning Machines or others.

Backpropagation algorithm was first discovered in the 1970s, but it became publicly known after a famous paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams in 1986 was published.

Backpropagation is the partial derivative of the cost function with respect to weight in the network. It tells us how quickly the cost function changes when the weights and biases are changed. Backpropagation is a fast algorithm for learning which gives detailed insights into how changing the weights and biases changes the overall behaviour of the network. The goal of backpropagation is to compute the partial derivatives of the cost function with respect to any weight or bias in the network. For backpropagation to work it is needed to make two assumptions about the form of the cost function. The first assumption is that the cost function can be written as an average over cost functions for individual training examples. The reason for this assumption is that what backpropagation actually lets us do is compute the partial derivatives for a single training example and then we recover by averaging over training examples.

Second assumption of the cost function is that it can be written as a function of the outputs from the neural network. (Goodfellow, Bengio & Courville, 2016)

Neural networks are typically trained using the backpropagation of error algorithm.

(3.14)

(31)

30

3.2 K-nearest neigbourhood

K-nearest neighbors algorithm (KNN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges (1951), and later expanded by Thomas Cover and Peter Hart (1967). The KNN algorithm assumes that similar things exist in close proximity, which means that similar things are near to each other. It is used for classification and regression and in both cases, the input consists of the k closest training examples in data set. KNN algorithm can be applied as a fundamental prediction technique when there is little or no prior knowledge about the distribution of the data.

K-nearest neighbour is a supervised machine learning algorithm that stores all the available cases and classifies the new data on a similarity measure. In the classification setting, the K-nearest neighborhood algorithm essentially forms a majority vote between the k most similar instances to a given unseen observation. Similarity is defined according to a distance metric between two data points. Very popular is the Euclidean distance method or methods like Manhattan, Minkowski, and Hamming distance methods. In the regression settings, predictions are constructed as the average of targets for k-nearest-neighbors.

KNN method is simple to implement and with eonough data it can do a good job.

But it is necessary to know that we have a meaningful distance function. We also have to keep in mind that the main disadvantage of KNN is problem called „curse of dimensionality“. It is that there is probably not enough data for the number of dimensions. With increasing number of dimensions, the size of the data space increases exponentially, and the amount of data needed to maintain density also increases.

Without significant increases in the size of the data, KNN loses predictive power. (Grant, 2019)

There is also a crucial problem of determining k value. Finding a relevant value of k can be quite a challenge because there is no structured method for finding it.

Choosing smaller values for k can be noisy and will have a higher influence on the result. On the other hand, choosing larger values of k will have smoother decision boundaries and though lower variance but increased bias. One way to determine k is to find out with various values by trial and error and assuming that training data is unknown. Another way to choose k is through cross-validation. The procedure is that

(32)

31

we take the small portion from the training dataset and call it a validation dataset, and then use the same to evaluate different possible values of k. This way the label for every instance in the validation set is predicted using different values of k and then we choose value of k that gives us the best performance on the validation set. Then we take that value and use that as the final setting of algorithm so we are minimizing the validation error. In general practice, choosing the value of k can be done as follows:

𝑘 = √𝑁

where N equals to the number of samples in training dataset.

A confusion matrix or "matching matrix" is often used as a tool to validate the accuracy of k-NN classification.

KNN’s main disadvantage is that the algorithm becomes significantly slower as the volume of data increases. This makes it an impractical choice in environments where predictions need to be made rapidly. Moreover, there are faster algorithms that can produce more accurate classification and regression results.

But if there are sufficient computing resources to speedily handle the data that are used to make predictions, KNN can be useful in solving problems that have solutions that depend on identifying similar objects.

(3.15)

(33)

32

3.3 Validation of models

To see the reliability of predictions and therefore to choose the best model, it is needed to evaluate forecasting accuracy. This is represented by root mean squared error.

Root mean squared error is calculated as follows:

𝑅𝑀𝑆𝐸 = √1

𝑛∑(𝑓_𝑖 − 𝑜_𝑖)²

𝑛

𝑖=1

Where n = number of observations, f = forecasts and o = observed values.

Compared to easier metric MAE (mean absolute error) which treats all the errors equally, RMSE due to squared term penalizes large errors. In the easiest way, the data are divided into two splits, training data and testing data in some split ratio, typically 30:70 in favour of training data. On training data all the modeling is performed, then based on these models, predictions are made and compared with testing data. Based on this process, root mean squared error of all models is calculated, compared and the best model with minimal RMSE is selected.

Unfortunately, when searching for the best model with this method by minimizing RMSE, it may lead to overfit models. When looking for optimal error among models, there is need to find balance between variance and bias.

Figure 9 Variance vs bias balance Source: Formánek, 2020

(3.16)

(34)

33

This problem can be solved by so called k-fold cross validation which is commonly used in practice to select the best model to be used. In this master thesis, to validate the models, is also used k-fold cross validation.

K-fold cross validation uses re-sampling. The data are randomly partitioned into k equal subsets. One of the k parts is used as test sample, remaining k-1 parts are training data. Then the model is repeatedly fit k times, and each of the k parts of data are used as test sample exactly once. Finally, k results from the folds are then averaged to make a single estimation of, in our case, root mean squared error. (Formánek, 2020) As we want to get the best model, we look for the lowest cross validation result.

(35)

34

4 Literature review

Volatility modeling is very widespread. Many authors test and compare different types of models to achieve the best results. From GARCH models, for example, Zhang, Haghani and Zeng (2014) tested the Component GARCH model to predict travel time.

They found that the standard GARCH models were not able to describe the seasonality and trend in the data, and therefore preferred the Component GARCH model.

Narsoo (2016) used 1-minute EUR / USD exchange rate data with the presence of seasonality to model volatility. Due to the presence of seasonality, the author used MC-GARCH to model volatility and found that this model is suitable for modeling intraday volatility data and also for modeling intraday VaR. Next, the volatility of the Nasdaq-100 non-financial stock index was modeled by Aliyev, Ajayi & Gasim (2020) using asymmetric EGARCH and GJR-GARCH models. The authors revealed the presence of leverage effect and volatility clusters, with negative shocks having a much greater impact than positive shocks. Volatility modeling has also been used for cryptocurrencies. Naimy, Haddad, Fernandez-Avilés & Khoury (2021) modeled six major cryptocurrencies using modified GARCH models. Among them, the GJR- GARCH model had the best results for the given cryptocurrencies, which was able to describe the asymmetries in the volatility of cryptocurrencies. GARCH models are also used, for example, in modeling dynamic covariance matrices. De Nard, Engle, Ledoit & Wolf (2020) have worked on this topic and found that this model did not work very well and therefore used the new DCC-NL model.

Abounoori & Zabol (2020) modeled intraday gold data, and their best model was RGARCH (realized GARCH), which is a simultaneous model of realized volatility and conditional variability.

In all the mentioned papers, the authors used some of the modifications of the standard GARCH model, because they can better capture asymmetries in volatility.

A large number of authors used variety of machine learning models to forecast volatility. For example Bucci (2020) compared predictive ability of feedforward and recurrent neural networks and found out that recurrent neural networks can outperform traditional econometric models. She also used LSTM and NARX models and showed improved forecasting accuracy in high volatile times. Carr, Wu and Zhang (2020) used

(36)

35

in their paper feedforward neural networks, random forests and ridge regression. They showed that new weighting method achieve better predictability of future return variance, and also they add that methods combining traditional and machine leraning techniques are performing the best. In the paper from Faldzinski, Fiszeder &

Orzeszko (2021) is compared performance of another machine learning model – support vector machines with GARCH models. They found out that, if used squared daily return as a proxy for volatility, support vector machines can perform forecasts with lower errors than GARCH models. However, they showed that if Parkinson estimator is used as a proxy of volatility, GARCH models perform better. Another authors Alonso-Monsalve, Suárez-Cetrulo, Cervantes & Quintana (2020) are in their paper comparing multilayer perceptrons with convolutional neural networks on high frequency data of cryptocurrency exchange rate. They concluded that convolutional neural networks significantly outperformed other models. Volatility of gold were predicting Vidal & Kristjanpoller (2020) using hybrid model. They added LSTM networks model to convolutional neural networks and observed a significant improvement of predictions of gold volatility.

A lots of papers combined neural networks and the GARCH model, creating a hybrid model and used it to forecast volatility. These papers used this type of method and studied different kinds of assets such as stock indices, metals, oil, or bitcoin. All papers concluded that hybrid neural networks can nicely forecast volatility.

For example Verma (2021) forecasts volatility of crude oil using hybrid model combining GARCH model with LSTM networks. The result is that hybrid model improves forecasting accuracy and performs better than ordinary GARCH model.

Mademlis & Dritsakis (2021) found significant leverage effect in Italian stock market and they also examine if hybrid models can improve volatility predictions. They proved their hypotesis and concluded that hybrid model consisting of neural network and EGARCH model provides the best results. Also Liu & So (2020) dedicated their paper to forecasting volatility with hybrid model incorporating GARCH model into artificial neural network. The result is, again, that hybrid model outperforms the standard GARCH model.

Many authors also dedicated their studies to another machine learning method k- nearest neighborhood. For example Cortez, Rodríguez-García & Mongrut (2021) used

(37)

36

k-nearest neighbor approach to predict exchange market liquidity. They compared the predictions on the short-term market liquidity of the crypto and fiat currencies by using time-series models such as ARMA and GARCH and a nonparametric machine learning algorithm the KNN approach. They found that the KNN is a better predictor of the log rate of the bid-ask spreads of crypto and fiat currencies than the ARMA and GARCH models because of the nonlinearity of the market liquidity and the complexity of its market microstructure. The result though is that the KNN approach is better to capture the short-term liquidity of cryptocurrencies than the ARMA and GARCH models.

Also Lahmiri & Bekiros (2020) incorporated KNN in their study about forecasting intraday Bitcoin market.They employed three different types of models, machine learning approahces including support vector regressions and Gaussian Poisson regressions, algorithmic models such as regression trees and the k-nearest neighbours and also artificial neural network topologies such as feedforward, Bayesian regularization and radial basis function networks. Their results show that the radial basis function networks performs an outstanding accuracy in forecasting. The overall advantage of artificial neural networks is due to parallel processing features that efficiently simmulate human decision-making in the presence of underlying nonlinear input-output relationships in noisy signal environments.

This work uses the modification of standard GARCH model, MC-GARCH as benchmark model and it is compared with machine learning methods – artificial neural networks and k-nearest neighbourhood. More precisely, for these models, using intraday volatility rolling n-period ahead forecasts, predictive ability of models is being compared in this work.

(38)

37

5 Data and procedures

This chapter first contains description of the data and work with them such as transformation, scaling, testing and diagnostic tests. Then here are shown the applications of individual models, described all procedures of modeling and predictions. The aim of this work is to compare predictive ability of four models: MC- GARCH, feedforward neural network autoregression model, LSTM network and k- nearest neighborhood and in each model were compared three different n-period ahead rolling predictions, more precisely, 1-minute, 60-minute and 389-minute period ahead rolling predictions.

For the purposes of this work, intraday minute data of the financial time series of the company Tesla, Inc. were used. The Company's segments include automotive, and energy generation and storage. The data comes from Finam, for the period of time from 13.9.2021 to 12.11.2021 and we have 16901 observations, which is approximately time of two months, excluding weekends. Within the day, the time range of minute data is given from 9:30 to 16:00 and so for one day there are 389 observations. The time range is actually the official trading hours on the NYSE. The process of Tesla‘s intraday prices is shown in the figure 10 down below.

Figure 10 Tesla time series

Source: Author’s own work, data from Finam

Aditional dataset was needed for calculating the daily exogenously determined volatility forecast for the MC-GARCH model. This data was downloaded from Yahoo Finance for the time period from 14.10.2019 to 13.10.2021.

(39)

38

All calculations were performed in the R 4.1.2 program using various implemented packages. When performing LSTM alghoritm, there was needed help from Anaconda environment to be able to use Keras package .

First important step is to transform the data from original close prices to logarithmic returns and then to volatility. The reason to transform data into logarithmic returns is that there is general assumption that stock prices are log-normally distributed and also they are usually already stationary and therefore no further adjustments are needed. Logarithmic returns are calculated using the following formula:

𝑟_𝑡 = ln ( 𝑥_𝑡 𝑥_𝑡−1)

Where xt is close price at time t and xt-1 is its laged value. Logarithmic returns of Tesla, Inc. are shown in figure 11.

Figure 11 Logarithmic returns of Tesla, Inc.

Source: Author’s own work

In the figure 11 we can clearly see that there are some clusters, and so we observe that periods of greater variability alternate with periods of less variability. Next, from logarithmic returns is then calculated volatility by taking the rt from formula 5.1 to the (5.1)

Hlavní práce76540_mika00.pdf, 1.6 MB Stáhnout

U NIVERSITY OF E CONOMICS , P RAGUE Faculty of Finance and Accounting

Department of Banking and Insurance

MASTER THESIS

2021 Bc. Andrea Mikulovská

U NIVERSITY OF E CONOMICS , P RAGUE Faculty of Finance and Accounting

Department of Banking and Insurance Financial Engineering

Application of artificial intelligence in

predicting the volatility of financial asset prices

Author: Bc. Andrea Mikulovská Supervisor: Ing. Milan Fičura, Ph.D.

Year: 2021

Declaration of autorship

Acknowledgements

Abstrakt

Klíčová slova

Abstract

Key words

Contents

Introduction

1 Brief history of volatility modeling

2 ARIMA – based models

2.1 ARCH model

2.2 GARCH model

2.3 Modified GARCH models

2.4 Intraday GARCH models

3 Machine learning in volatility modeling

3.1 Artificial neural networks

3.1.1 Simple perceptron

3.1.2 Sigmoid neuron

3.1.3 The architecture of ANN and Activation function

3.1.4 Feedforward neural networks

3.1.5 Recurrent neural networks (RNN)

3.1.6 LSTM networks

3.1.7 Training of the model

3.2 K-nearest neigbourhood

3.3 Validation of models

4 Literature review

5 Data and procedures