Backtesting with modern statistical tools
Backtesting refers to calculations of theoretical profits and losses that would have arisen from applying an algorithmic trading strategy in the past. Its function is to assess the quality of a trading strategy in the future. Statistical programming has made backtesting easy. However, its computational power and convenience can also be corrosive to the investment process due to its tendency to sniff out temporary patterns, while data samples for cross-validation are limited. Moreover, the business of algorithmic trading strategies, unfortunately, provides strong incentives for overfitting models and embellishing backtests (view post here). Similarly, academic researchers in the field of trading factors often feel compelled to resort to data mining in order to produce publishable ‘significant’ empirical findings (view post here).
Good backtests require sound principles and integrity (view post here). Sound principles should include [1] formulating a logical economic theory upfront, [2] choosing sample data upfront, [3] keeping the model simple and intuitive, and [4] limiting try-outs when testing ideas. Realistic performance expectations of trading strategies should be based on a range of plausible versions of a strategy, not an optimized one. Bayesian inference works well for that approach, as it estimates both the performance parameters and their uncertainty. The most important principle of all is integrity: aiming to produce good research rather than good backtests and to communicate statistical findings honestly rather than selling them.
One of the greatest ills of classical market prediction models is exaggerated performance metrics that arise from choosing the model structure with hindsight. Even if backtests estimate model parameters sequentially and apply them strictly out of sample, the choice of hyperparameters is often made with full knowledge of the history of markets and economies. For example, the type of estimation, the functional form, and – most importantly – the set of considered features are often chosen with hindsight. This hindsight bias can be reduced by sequential hyperparameter tuning or ensemble methods.
- A data-driven process for tuning hyperparameters can partly endogenize model choice. In its simplest form, it involves three steps: model training, model validation, and method testing. This process [1] optimizes the parameters of a range of plausible candidate models (hyperparameters) based on a training data set, [2] chooses the best model according to some numerical criterion (such as accuracy or coefficient of determination) based on a separate validation data set, and [3] evaluates the success of the learning method, i.e. the combination of parameter estimation and model selection, by its ability to predict the targets of a further unrelated test set.
- An alternative is ensemble learning. Rather than choosing a single model, ensemble methods combine the decisions of multiple models to improve prediction performance. This combination is governed by a “meta-model”. For macro trading this means that the influence of base models is endogenized and data-dependent and -hence – the overall learning method can be simulated based on the data alone, reducing the hindsight bias from model choice.
Ensemble learning is particularly useful if one uses flexible models, whose estimates vary a lot with the training set because they mitigate these models’ tendency to memorize noise. There are two types of ensemble learning methods:
- Heterogeneous ensemble learning methods train different types of models on the same data set. First, each model makes its prediction. Then a meta-model aggregates the predictions of the individual models. Preferably the different models should have different “skills” or strengths. Examples of this approach include the voting classifier, averaging ensembles, and stacking.
- Homogeneous ensemble learning methods use the same type of model but are trained on different data. The methods include bootstrap aggregation (bagging), random forests, and popular boosting methods (Adaboost and gradient boosting). Homogeneous ensemble methods have been shown to produce predictive power for credit spread forecasts (view post here), switches between risk parity strategies (paper here), stock returns(paper here), and equity reward-risk timing (view post here).
The evaluation of a trading strategy typically relies on statistical metrics. Alas, many measures are incomplete and can be outrightly misleading. An interesting concept is the discriminant ratio (‘D-ratio’), which measures an algorithm’s success in improving risk-adjusted returns versus a related buy-and-hold portfolio (view post here).
For the development of algorithmic trading strategies, it can be highly beneficial to integrate transaction costs into the development process. For example, one can use a “portfolio machine learning method” to that end (forthcoming post).