Predicting volatility with neural networks

Predicting realized volatility is critical for trading signals and position calibration. Econometric models, such as GARCH and HAR, forecast future volatility based on past returns in a fairly intuitive and transparent way. However, recurrent neural networks have become a serious competitor. Neural networks are adaptive machine learning methods that use interconnected layers of neurons. Activations in one layer determine the activations in the next layer. Neural networks learn by finding activation function weights and biases through training data. Recurrent neural networks are a class of neural networks designed for modeling sequences of data, such as time series. And specialized recurrent neural networks have been developed to retain longer memory, particularly LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). The advantage of neural networks is their flexibility to include complex interactions of features, non-linear effects, and various types of non-price information.

The below post is based on various papers and posts that are linked next to the quotes. Headings, cursive text, and text in brackets have been added. Also, a range of orthographic and grammatical errors have been corrected and mathematical expressions have been expressed in common language.

This post ties in with this site’s summary of statistical methods.

Some formal basics of volatility

“Volatility changes over time. High volatility means high risk and sharp price fluctuations, while low volatility refers to smooth price changes…Simulations of [asset prices] are often modeled using stochastic differential equations…[that include a] drift coefficient or mean of returns over some time period, a diffusion coefficient or the standard deviation of the same returns, [and a stochastic process as] Wiener process or Brownian Motion…Usually…volatility changes stochastically overtime…The volatility’s randomness is often described by a different equation driven by a different Wiener process.[That] model is called a stochastic volatility model…Stochastic volatility models are expressed as a stochastic process, which means that the volatility value at time t is latent and unobservable.” [Antulov-Fantulin and Rodikov]

“Daily realized volatility is defined as the square root of the sum of intra-day squared returns…Realized volatility (RV) is a consistent estimator of the squared root of the integrated variance (IV). There is even a more robust result stating that realized volatility is a consistent estimator of quadratic variation if the underlying process is a semimartingale.” [Antulov-Fantulin and Rodikov]

Traditional volatility prediction models

ARCH/ GARCH

“Autoregressive Conditional Heteroskedasticity, or ARCH, is a method that explicitly models the change in variance over time in a time series. Specifically, an ARCH method models the variance at a time step as a function of the residual errors from a mean process (e.g. a zero mean)…Generalized Autoregressive Conditional Heteroskedasticity, or GARCH, is an extension of the ARCH model that incorporates a moving average component together with the autoregressive component. Specifically, the model includes lag variance terms (e.g. the observations if modeling the white noise residual errors of another process), together with lag residual errors from a mean process.” [Brownlee]

“The generalized ARCH model [estimates] variance as future volatility [based on] long-run variance and recent variance. Thus, the clustering effect is a sharp increase of volatility…not followed by a sharp drop…Various extensions have been introduced [such as] exponential GARCH, GJR-GARCH, and threshold GARCH [motivated by] stylized facts about volatility.” [Antulov-Fantulin and Rodikov]

HAR

“The HAR [heterogeneous autoregression] model essentially claims that the conditional variance of … returns is a linear function of the lagged squared return over the identical return horizon in combination with the squared returns over longer and/or shorter return horizons…Inspired by the success of HAR-type models, most work…has extended the HAR model in the direction of generalizing with jumps, leverage effects, and other nonlinear behaviors…The HAR model has an intuitive interpretation that agents with daily, weekly, and monthly trading frequencies perceive and respond to, altering the corresponding components of volatility.” [Qiu et al.]

“The heterogeneous Autoregression Realized Volatility (HAR-RV) model…is based on the assumption that agents’…perception of volatility depends on their investment horizons and [can be] divided into short-term, medium-term and long-term…Different agents…have different investment periods and participate in trading on the exchange with different frequencies…and respond to different types of volatility…A short-term agent may react differently to fluctuations in volatility compared to a medium- or long-term investor. The HAR-RV model is an additive cascade of partial volatilities generated at different time horizons…[for example] daily, weekly, and monthly observed realized volatilities…that follows an autoregressive process…The HAR-RV approach is a more stable and accurate estimate for realized volatility.” [Antulov-Fantulin and Rodikov]

Neural networks: basics and key types for financial markets

The very basics

“A neural network is an adaptive system that learns by using interconnected nodes or neurons in a layered structure that resembles a human brain. A neural network can learn from data—so it can be trained to recognize patterns, classify data, and forecast future events.” [MathWorks]

“Neural Networks consist of artificial neurons that are similar to the biological model of neurons. It receives data input and then combines the input with its internal activation state as well as with an optional threshold activation function. Then by using an output function, it produces the output.” [hackr.io]

Neural networks consist of layers, i.e. sets of nodes or neurons. There is typically an input layer, an output layer, and a number of hidden layers in between. A neuron is loosely a function that returns a number between 0 and 1. The number returned by the neuron is called its activation. For example, the neurons of an input layer could be the pixels of an image and the numbers could denote their brightness.
Within a network, activations in one layer determine the activations in the next layer. The activation of a neuron is governed by a specific weighting function that takes as arguments all the activations of the previous layer. It is typically a function of a weighted sum that ensures that activations are always between 0 and 1, such as a sigmoid or rectified linear unit function. The function also uses a bias parameter, whose value determines a threshold that the weighted sum must exceed to activate meaningfully.
Learning means with neural networks finding weights and biases that are appropriate for solving the problem at hand, using training data. The main method by which neural networks learn is gradient descent: parameters are set to minimize the average cost of errors, typically the squared differences between the estimated values in the output layer and the actual labels. The learning algorithm finds that minimum by starting with a random parameter set and then sequentially changing parameters in the direction that reduces their costs most.

Types of neural networks for financial markets

“Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series…Schematically, a RNN layer uses a for loop to iterate over the timesteps of a sequence, while maintaining an internal state that encodes information about the timesteps it has seen so far.” [TensorFlow]

RNNs are designed to model sequenced data. A sequence is an order of states. Examples are text, audio, or time series. RNNs fulfill this function through sequential memory, which makes it easy to recognize sequential patterns. It uses a looping mechanism (simple ‘for’ loop in code) that allows information to flow from one hidden state to the next. Only after the sequential information has all been passed to the hidden layer the hidden state is passed on and the output layer is activated.
RNNs have a short-term memory issue. This means as steps are added to the loop the RNN struggles to retain the information of previous steps. This is caused by the “vanishing gradient” problem of backpropagation. Adjustments of parameters based on errors of the output layer decrease with each layer backward. Gradients shrink exponentially as the algorithm backpropagates down, for example moving backward through timestamps. Put simply, the earlier layers fail to do any learning and long-range dependencies are being neglected.

Two specialized recurrent neural networks have been developed to mitigate short-term memory: LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). They work like RNNs but are capable of learning long-term dependencies by using “gates”. The gates are tensor operations that learn what dependencies should be added to the hidden state.

“An LSTM network is a type of recurrent neural network (RNN) that can learn long-term dependencies between time steps of sequence data.” [MathWorks]

“An LSTM has a similar control flow as a recurrent neural network. It processes data passing on information as it propagates forward. The differences are the operations within the LSTM’s cells…
The core concept of LSTMs is the cell state, and its various gates. The cell state act as a transport highway that transfers relative information all the way down the sequence chain. You can think of it as the ‘memory’ of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make its way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information gets added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training…
We have three different gates that regulate information flow in an LSTM cell. A forget gate, input gate, and output gate…The forget gate…decides what information should be thrown away or kept…The input gate…updates the cell state… The output gate decides what the next hidden state should be.” [Michael Phi]

“The Gated Recurrent Unit (GRU) is the younger sibling of the more popular Long Short-Term Memory (LSTM) network, and also a type of Recurrent Neural Network (RNN). Just like its sibling, GRUs are able to effectively retain long-term dependencies in sequential data.” [Loye]

“To solve the vanishing gradient problem of a standard RNN, GRU uses, so-called, update gate and reset gate. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.”[Kostadinov]

GRU is a lightweight version of LTSM where it combines long-term and short-term memory into its hidden state. Thus, while LSTM has cell states and hidden states, GRU only has hidden states. Thus, GRU only has two gates: an update gate (that decides how much of past memory to retain) and a reset gate (that decides how much of past memory to forget). Retaining and forgetting are different actions, i.e. different modes of manipulating past information.

Application of neural networks for volatility forecasting

“We study and analyze various non-parametric machine learning models for forecasting multi-asset intraday and daily volatilities by using high-frequency data from the U.S. equity market. We demonstrate that, by taking advantage of commonality in intraday volatility, the model’s forecasting performance can significantly be improved…A measure for evaluating the commonality in intraday volatility is proposed, that is the adjusted R-squared value from linear regressions of a given stock’s realized volatility against the market realized volatility…Commonality over the daily horizon is turbulent over time, although commonality in intraday realized volatilities is strong and stable…For most models, the incorporation of commonality leads to better out-of-sample performance through pooling data together and adding market volatility as additional features.” [Zhang et al]

“Neural networks are in general, superior to other techniques [reflecting] the capability of neural networks for handling complex interactions among predictors… The high-dimensional nature of ML methods allows for better approximations to unknown and potentially complex data-generating processes, in contrast with traditional economic models…Furthermore, to alleviate the concerns of overfitting, we conduct a stringent out-of-sample test, using the existent trained models to forecast the volatility of completely new stocks that are not included in the training sample. Our results reveal that neural networks still outperform other approaches (including the OLS models trained for each new stock).” [Zhang et al]

“We investigate whether a totally nonparametric model is able to outperform econometric methods in forecasting realized volatility. In particular, the analysis …compares the forecasting accuracy of time series models with several neural networks architectures…The data set employed in this study comprises…observations from February 1950 to December 2017 of the Standard & Poor’s (S&P) index…The latent volatility is estimated through the ex-post measurement of volatility based on high-frequency data, namely realized volatility…computed as the sum of squared daily returns…Recurrent neural networks are able to outperform all the traditional econometric methods. Additionally, capturing long-range dependence through LSTM seems to improve the forecasting accuracy also in a highly volatile period.” [Bucci]

“We have applied a Long Short-Term Memory neural network to model S&P 500 volatility, incorporating Google domestic trends as indicators of the public mood and macroeconomic factors…This work shows the potential of deep learning financial time series in the presence of strong noise [and holds] strong promise for better predicting stock behavior via deep learning and neural network models.”[Xiong, Nichols and Shen]

“This study investigates the strengths and weaknesses of machine learning models for realised volatility forecasting of 23 NASDAQ stocks over the period from 2007 to 2016. Three types of daily data are used, variables used in the HAR-family of models, limit order book variables and news sentiment variables…Using a Long-Short-Term-Memory (LSTM) model combined with…four sets of variables each with 21 lags are trained with the loss function of minimising mean squared erors. These experiments provide strong evidence for the stronger forecasting power of machine learning models than all HAR-family of models.” [Rahimikia and Poon]

“The volatility prediction task is of non-trivial complexity due to noise, market microstructure, heteroscedasticity, exogenous and asymmetric effect of news, and the presence of different time scales, among others…We studied and analyzed how neural networks can learn to capture the temporal structure of realized volatility. We…implement Long Short Term Memory (LTSM) and…Gated Recurrent Unit (GRU). Machine learning can approximate any linear and non-linear behavior and…learn data structure…We investigated the approach with LSTM and GRU types for realized volatility forecasting tasks and compared the predictive ability of neural networks with widely used EWMA, HAR, GARCH-family models… LSTM outperformed well-known models in this field, such as HAR-RV. Out-of-sample accuracy tests have shown that LSTM offers significant advantages in both types of markets.” [Antulov-Fantulin and Rodikov]