### Backtesting with modern statistical tools

Backtesting refers to calculations of theoretical profits and losses that would have arisen from applying an algorithmic trading strategy in the past. Its function is to assess the quality of a trading strategy in the future. Statistical programming has made backtesting easy. However, its __computational power and convenience can also be corrosive to the investment process due to its tendency to sniff out temporary patterns__, while data samples for cross-validation are limited. Moreover, the business of algorithmic trading strategies, unfortunately, provides strong incentives for overfitting models and embellishing backtests (view post here). Similarly, academic researchers in the field of trading factors often feel compelled to resort to data mining in order to produce publishable ‘significant’ empirical findings (view post here).

__Good backtests require sound principles and integrity__ (view post here). Sound principles should include [1] formulating a logical economic theory upfront, [2] choosing sample data upfront, [3] keeping the model simple and intuitive, and [4] limiting try-outs when testing ideas. Realistic performance expectations of trading strategies should be based on a range of plausible versions of a strategy, not an optimized one. Bayesian inference works well for that approach, as it estimates both the performance parameters and their uncertainty. The most important principle of all is integrity: aiming to produce good research rather than good backtests and to communicate statistical findings honestly rather than selling them.

__One of the greatest ills of classical market prediction models is exaggerated performance metrics that arise from choosing the model structure with hindsight__. Even if backtests estimate model parameters sequentially and apply them strictly out of sample, the choice of hyperparameters is often made with full knowledge of the history of markets and economies. For example, the type of estimation, the functional form, and – most importantly – the set of considered features are often chosen with hindsight. This hindsight bias can be reduced by sequential hyperparameter tuning or ensemble methods.

- A
**data-driven process for tuning hyperparameters** can partly endogenize model choice. In its simplest form, it __involves three steps: model training, model validation, and method testing__. This process [1] optimizes the parameters of a range of plausible candidate models (hyperparameters) based on a training data set, [2] chooses the best model according to some numerical criterion (such as accuracy or coefficient of determination) based on a separate validation data set, and [3] evaluates the success of the learning method, i.e. the combination of parameter estimation and model selection, by its ability to predict the targets of a further unrelated test set.
- An alternative is
**ensemble learning**. Rather than choosing a single model, __ensemble methods combine the decisions of multiple models to improve prediction performance. This combination is governed by a “meta-model”__. For macro trading this means that the influence of base models is endogenized and data-dependent and -hence – the overall learning method can be simulated based on the data alone, reducing the hindsight bias from model choice.

Ensemble learning is particularly useful if one uses flexible models, whose estimates vary a lot with the training set because they mitigate these models’ tendency to memorize noise. There are two types of ensemble learning methods:
**Heterogeneous ensemble learning** methods train different types of models on the same data set. First, each model makes its prediction. Then a meta-model aggregates the predictions of the individual models. Preferably the different models should have different “skills” or strengths. Examples of this approach include the voting classifier, averaging ensembles, and stacking.
**Homogeneous ensemble learning** methods use the same type of model but are trained on different data. The methods include bootstrap aggregation (bagging), random forests, and popular boosting methods (Adaboost and gradient boosting). Homogeneous ensemble methods have been shown to produce predictive power for credit spread forecasts (view post here), switches between risk parity strategies (paper here), stock returns(paper here), and equity reward-risk timing (view post here).

The evaluation of a trading strategy typically relies on statistical metrics. Alas, many measures are incomplete and can be outrightly misleading. An interesting concept is the discriminant ratio (‘D-ratio’), which measures an algorithm’s success in improving risk-adjusted returns versus a related buy-and-hold portfolio (view post here).

For the development of algorithmic trading strategies, it can be highly beneficial to integrate transaction costs into the development process. For example, one can use a “portfolio machine learning method” to that end (forthcoming post).