How to measure the quality of a trading signal

The quality of a trading signal depends on its ability to predict future target returns and to generate material economic value when applied to positioning. Statistical metrics of these two properties are related but not identical. Empirical evidence must support both. Moreover, there are alternative criteria for predictive power and economic trading value, which are summarized in this post. The right choice depends on the characteristics of the trading signal and the objective of the strategy. Each strategy calls for a bespoke appropriate criterion function. This is particularly important for statistical learning that seeks to optimize hyperparameters of trading models and derive meaningful backtests.

The below post is based on proprietary research of Macrosynergy and a number of posts and papers that are linked next to the quotes below.

This post ties in with this site’s summary of quantitative methods for macro information efficiency.

The importance of measuring predictive power

An essential quality criterion of a trading signal is a reliable predictive relation to subsequent target returns. A trading signal here is a metric that principally governs the positions of a trading or investment strategy. Reliability here means reason and evidence that the relation will continue in the future. Reason requires a plausible theory, such as information advantage or payment of simplicity subsidies in markets. Evidence requires statistics that testify to the significance of the hypothesized relation in the past.

In the age of machine learning, measurement of reliable predictive power is particularly important, because it guides automated decisions based on statistical learning. Key choices that are guided by statistics are:

whether a signal or a “family” of signals should be adopted for trading,
which of several competing signals should be chosen, and
which model should be employed to optimize the exact form of the signal.

Simply put, successful use of statistical learning for trading strategies requires an appropriate criterion for predictive power, which depends on the purpose of the trading strategy and the nature of the available data. For example, a strategy that takes medium-term positions of varying size, maybe based on risk premium estimates, calls for a signal that is linearly correlated with subsequent returns. By contrast, a strategy that takes short-term positions at constant volatility targets requires a signal that predicts subsequent return directions with sufficiently high accuracy.

The importance of measuring value creation

The value of a trading signal ultimately depends on its expected contribution to the PnL of an investor. There is typically a positive relation between predictive power and value generation, but it is not a straightforward one and predictive power does not always guarantee a “good” PnL. In particular value generation has to consider additional properties of the signal, such as its correlation with standard risk premia or its bias towards being simply “long” the market.

Hence, evidence of positive PnL contribution is an important complementary criterion for adopting a trading signal and the optimization of related models. Often it is useful to consider evidence of reliable predictive power as a necessary condition for using a trading signal and evidence of material value generation as a sufficient one. A good backtested PnL with poor predictive power points to accidental value generation, due maybe to a directional bias or a single large positive moment of the simulated PnL. Conversely, strong predictive power with a poor PnL often points to construction faults in the signal, such as unintended directional biases or extreme outliers.

The choice of the value criterion is plarticularly closely linked to the purpose of a strategy. For example, for strategies that are meant to diversify existing risk it is important to adjust for systemic risk related to a benchmark. Indeed, some short-biased strategies with strong positive payoffs in market crises may be valuable in a broader context even if they do not produce positive PnLs by themselves.

Quality metrics for predictive power

In financial market research, data scientists typically distinguish between two basic types of quality criteria of predictions.

The first focuses on the feature’s ability to predict the exact value of the target, typically relying, explicitly or implicitly, on the residuals of regression models.
The second type of criteria focuses on the feature’s ability to distinguish periods of positive and negative returns, relying on confusion matrix values of binary classification.

Below we discuss these two, always from the angle of evaluation of the predictive power of a single signal. If that signal has been created by model estimation in the first place, then the evaluation is assumed to use a hold-out “test” sample distinct from the “training” data used in model estimation or hyperparameter optimization.

Residual-based statistics

Residuals are simply differences between predicted values and realized values (or “ground truth” values) of the target returns. This means that this class of statistics depends critically on the size of the prediction errors. This has two major implications:

Financial returns depend on many influences and are notoriously hard to predict exactly. Hence, the goodness of fit of trading signals is generally low when compared to other fields of statistical analyses. Thresholds for acceptance of signals are usually in the low single digits of the percent of predicted variation and focus on the significance of a relation rather than ambitious targets for the ratio of explained return variations. One does not need to predict a great share of return variation to earn a great deal of money in trading.
A few large unpredicted outliers in the target returns can have a great impact on the statistics. Residuals-based statistics heavily penalize a trading signal for failing to predict large, outsized target returns. This means that these statistics are more meaningful criteria if they are applied to signals and target positions that have been properly “risk-managed” in the first place, as opposed to raw returns. If the features and targets of the analysis are just rough proxies of the signals and positions that are actually traded, residuals-based statistics may summarily be misleading.

Various residuals-based statistics accentuate or mitigate these traits and, hence, their usefulness depends on context:

The most common residuals-based statistical criterion is the coefficient of determination or R-squared. It is simply the ratio of the explained variance of the target returns and total variance. Since variance here squares the variation of returns, outliers matter more than normal returns. For example, a “flash crash” of a big market drawdown with swift recovery the next day outweighs the importance of many normal trading days for assessing the share of explained variance. Also, because the R-squared disregards other features of a relation, such as sample size and correlation with benchmarks, it is a relative metric comparing signals or models of their generation based on the same dataset. It is not designed to compare the success of trading signals across types of strategy or across different markets. Also, it does not come with a natural threshold based on which we can decide whether to trade a signal of not.
The mean absolute error and related statistics mitigate the influence of outliers relative to the R-squared:
“Absolute Error, also known as L1 loss, is the absolute difference between a predicted value and the actual value. The aggregation of all loss values is called the cost function, where the cost function for Absolute Error is known as Mean Absolute Error… [It] is less sensitive towards outliers… and provides an even measure of how well the model performs.” [Jadon, Jadon, and Patil]
However, like the R-squared, the mean absolute error only allows comparing similar signals based on common data sets. There is no clear guidance on what error is acceptable for a trading signal.
The lack of a clear decision threshold and comparability of the R-squared argues for a focus on a valid metric of the significance of the feature-target relation. For single-market tests, the criterion can be t-statistics and related probabilities. For a strategy evaluated across multiple markets, the panel structure of the data must be respected, for example in form of the Macrosynergy panel test [view related post here].
Since the main significance tests of correlation or regression assume normality of distributions, one can consider as an alternative or second opinion the significance of non-parametric correlation, which does not depend on a specific distribution of target returns:
“For non-normal distributions (for data with extreme values, outliers), correlation coefficients should be calculated from the ranks of the data, not from their actual values. The coefficients designed for this purpose are Spearman’s rho and Kendall’s Tau. In fact, normality is essential for the calculation of the significance and confidence intervals, not the correlation coefficient itself. Kendall’s tau is an extension of Spearman’s rho. It should be used when the same rank is repeated too many times in a small dataset.” [Akoglu]

Confusion matrix-based statistics

If one considers a trading signal mainly to predict positive versus negative expected returns, one can apply performance criteria of binary classification. These are based on a confusion matrix, i.e., the tabulated counts of true positive, false positive, true negative, and false negative classifications. For the evaluation of trading signals, several related statistics are useful:

Accuracy gives the ratio of correctly classified return directions to all classified returns. Thus, it is the sum of true positives and true negatives divided by the total sum of classifications. This metric is intuitive and principally gives equal importance to positives and negatives. Despite its popularity, this metric can be very misleading for assessing trading signals: if the sample is unbalanced, it effectively gives greater weight to the classification of the majority class. For example, if target returns in a test sample have been mainly positive, any feature with a positive bias can produce an above 50% accuracy, even if it has no predictive power.
Balanced accuracy is the average ratio of correct positive and negative classifications. This metric always gives equal weight to the success of positive and negative predictions. It is more appropriate than accuracy if we require the signal to do equally well for long and short positions and do not consider it essential that it replicates the historic class bias of our training data set.
If the correct prediction of either positive or negative returns is particularly important, precision or negative predicted value are valid performance criteria. Precision measures the accuracy of positive classifications alone. It is the ratio of true positives to all positive classifications and, thus, represents the confidence in a positive prediction when we predict a positive return. Similarly, negative predictive value is the accuracy of negative signals alone. One-sided accuracy metrics can be useful if the strategy’s purpose implies a low tolerance for false classifications on one side and less concern about missing out on classifying in this direction. For example, if an occasional negative signal would lead to expensive position liquidations of a long-biased strategy, a high negative predictive value is desirable for containing transaction costs.
If it is important not to miss out on either positive or negative return periods, sensitivity (or recall) and specificity are important criteria. Sensitivity of a binary trading signal measures its success in predicting positive returns. It gives the ratio of true positives to all periods of positive returns. Analogously, specificity gives the ratio of true negatives to all periods of negative returns. For example, if a strategy aims to outperform the market in drawdown periods, high specificity is a criterion for its success.
The F1 score is a “harmonic” metric of precision and recall. To be exact, it is two times the product of precision and recall, divided by the sum of precision and recall. Its value is between zero and one and penalizes both an excessively wide and an excessively narrow net for positive classification. Like balanced accuracy, it is a suitable criterion for unbalanced samples. For example, if there is an overwhelming majority of positive returns, it is easy to get high precision but hard to get a high sensitivity. Moreover, and different from balanced accuracy, the F1 score also balances between getting positive predictions right and not missing out on positive returns. Therefore, the F1 score provides a fairly broad assessment of signal quality that suits many purposes. This score can be generalised to the Fβ score, where β can be chosen to weight precision over recall and vice-versa.
To obtain a quality assessment that is independent of zero or neutral value of the signal, one can use the Area Under the Curve (AUC) score. It is often referred to as AUC-ROC and represents the area under the Receiver Operating Characteristic (ROC) Curve. The ROC Curve is a plot of the performance of a binary classifier at different classification thresholds. It shows the true positive rate or sensitivity (the proportion of positive cases that are correctly classified) on the y-axis and the false positive rate (the proportion of negative cases that are mistakenly identified as positive) on the x-axis. Each point on the ROC Curve corresponds to a specific threshold applied to the classifier’s predicted probabilities, and the curve itself is generated by connecting these points. The AUC-ROC score is calculated by measuring the full area under the ROC Curve. It ranges from 0 to 1. An AUC-ROC of 1 is a perfect classifier, meaning it can perfectly separate the positive and negative instances, achieving a true positive rate of 1 and a false positive rate of 0. An AUC-ROC of 0.5 is a random classifier with no power of discrimination for the sample.
The AUC-ROC score is suitable for trading signals that apply different thresholds for setting the direction and size of positions. It is also suitable for assessing the predictive power of a signal whose neutral level is uncertain.
The AUC-ROC score is not always suitable for unbalanced data sets. For example, if the training data contain many positive return periods and only a few negatives, the negatives will play a disproportionately important role since their classification matters more for sensitivity. In the unbalanced scenario, the underlying ROC curve is pushed towards the top-left corner of the graph, meaning that the AUC-ROC is useful for model comparison but relatively uninformative as a standalone metric when data imbalance is present.
The AUC-PR metric returns the area underneath a precision-sensitivity plot. This can be seen as an extension of the F1 score, since the curve plots precision & recall pairs for a range of classification thresholds. The area under that curve is a good summary of model performance irrespective of the threshold used. Typically, this is produced for the minority class in a classification problem, but the curve (and subsequently the metric) can be computed for either of the two classes. By focusing on the positive or negative samples directly, the issue of imbalance that arises in the use of an ROC curve is alleviated.

Quality metrics of PnL value creation

The basis of criteria for value generation is generic naïve profit and loss series (PnLs). These are time series of positive and negative payoffs arising from the simple application of the evaluated trading signal to positioning. In its simplest form, the trading factor values are multiplied with subsequent returns, subject to a rebalancing period, i.e., a period for which applied signals are not changed to avoid overtrading.

As portfolio optimization often boils down to some form of mean-variance maximization, standard performance ratios of PnLs represent return-risk tradeoffs. Different versions of these ratios arise from different definitions of risk perspectives and differences in the desired profile of a strategy PnL.

Sharpe ratio

The Sharpe ratio is the annualized excess return of a strategy, i.e., total return minus funding costs or risk-free rate, divided by its annual standard deviation.
“A general and key tenant in optimal asset allocation is to strike a balance between expected utility maximization and risk minimization. Traditionally, risk is measured by the variance of the returns… Sharpe ratio basically corresponds to the case where the utility function [of an investor] is linear [in expected return], and it penalises risk through the use of standard deviation…” [Ephrem and Nassar]

While the Sharpe ratio is a dominant convention in strategy evaluation, it does not optimize wealth accumulation. It can be misleading as a criterion for allocation if return fluctuations are not symmetrical or if uncertainty about returns is very low. The former overstates the relative value of strategies with outsized drawdowns, such as carry strategies, and the latter overstates the value of cash-like returns with tiny spreads to a risk-free rate, such as short-term high-grade bills:

“Standard deviation captures the fluctuations of a random variable around its mean regardless of whether these fluctuations are above or below the mean. In risk terms, the standard deviation penalises both the good returns and the bad! When the distribution of the returns is not symmetrical about the mean (whether that distribution is Gaussian or not doesn’t matter), the Sharpe ratio will again be misleading as a performance gauge of different strategies…
[Also] if [the] standard deviation [of returns] goes to zero, the returns on the proposed investment strategy become certain…. Although the return becomes certain as the standard deviation goes to zero, that return may be quite low. Since dividing any number by zero gives infinity, the use of the Sharpe ratio in the case of a very small standard deviation can be very misleading when comparing different strategies.” [Ephrem and Nassar]

A Sharpe ratio based on annualized daily returns can be biased to the low side if it is applied to a relative value portfolio or a portfolio with balanced long-short positions if the various components of the book are traded in different time zones. In this case, the impact of global market shocks is measured for different positions on different days, adding to recorded return volatility. Other risk-adjusted performance ratios based on daily returns suffer from the same potential distortion.

Sortino ratio

The Sortino ratio is an alternative risk-adjusted performance ratio that divides excess returns by the annualized standard downside deviation alone. The latter considers only negative deviations from a minimum acceptable return and sets all positive return deviations to zero before calculating the root mean squared deviation.

Sortino ratios are particularly suitable if large upside deviations are an expected feature of a trading strategy. For example, strategies that trade escalating market shocks are expected to display occasional large returns, and the associated variation should not reduce performance ratios, as would happen in the case of the Sharpe ratio.

Moreover, unlike the Sharpe ratio, the Sortino ratio does not assume that returns are normally distributed. There is a broad range of strategies that violate normality by design, such as tail risk-hedging strategies that take occasional outsized risk or risk premium strategies whose risk taking varies greatly overtime.

Calmar ratio

The Calmar ratio divides the compounded annual growth rate of a portfolio by its maximum drawdown. Maximum drawdown here denotes the largest peak-to-trough percent decline in the portfolio’s mark-to-market value over the sample period. This ratio helps judge if returns compensate appropriately for drawdown risk.
“The maximum cumulative loss from a peak to a following bottom, commonly denoted the maximum drawdown, is a measure of how sustained one’s losses can be. Large drawdowns usually lead to fund redemptions, and so the maximum drawdown is the risk measure of choice for many money management professionals.” [Atiya and Malik]

Since drawdowns often govern risk management and capital allocation to trading strategies, this ratio helps assess the survival probability of a trading signal in an institutional trading environment.

An important drawback of the Calmar ratio is that, unlike the Sharpe ratio, it does not scale with the length of time series. Maximum drawdowns of long data series tend to be larger than for short data series. Hence, the Calmar ratio can only support decisions for strategies of equal sample lengths. Some academic work has been done on “normalized Calmar ratios” of portfolios over different time horizons, but these metrics are not (yet) widely accepted.

Omega ratio

The Omega ratio is calculated as the probability of weighted gains above a threshold divided by the probability of weighted losses below the threshold. As such, the ratio considers the entire distribution of past returns. The threshold is typically the risk-free rate of a minimum acceptable return. If the ratio is above 1, history suggests that the odds of a positive performance are skewed favourably.

“The Sharpe ratio, assumes that the standard deviation of the return distribution provides the full description of risk. However, risk averse investors tend to strongly dislike negative returns and large drawdowns. They would even prefer to partly sacrifice positive returns in order to avoid negative ones… The Sortino ratio has been advocated in order to capture the asymmetry of the return distribution. It replaces the standard deviation in the Sharpe ratio by the downside deviation which captures only the downside risk. However, higher moments are incorporated only implicitly.
The Omega measure… incorporates all the moments of the distribution as it is a direct transformation of it. This measure splits the return universe into two sub-parts according to a threshold. The ‘good’ returns are above this threshold and the ‘bad’ returns below. Very simply put, the Omega measure is defined as the ratio of the gain with respect to the threshold and the loss with respect to the same threshold.
The Omega function is defined by varying the threshold… The evaluation of an investment with the Omega function should be considered for thresholds between 0% and the risk free rate. Intuitively, this type of threshold corresponds to the notion of capital protection already advocated.
Besides incorporating all the moments, the Omega function has two interesting properties. Firstly, when the threshold is set to the mean of the distribution, the Omega measure is equal to one. Secondly, whatever the threshold is, all investments may be ranked. In the context of the Sharpe ratio, the ranking is almost impossible for negative ratios.” [Bacmann and Scholz]

Like the Sortino ratio, and unlike the Sharpe ratio, the Omega ratio does not rely on normal distribution of the strategy returns. Unlike the Sortino ratio it considers higher-order moments of the return distributions. Hence, it is most suitable for signals that lead to unusual but characteristic distributions, maybe as consequence of concentrated risk-taking.

Kappa

Kappa refers to a class of performance ratios that divide excess returns by higher order deviations, considering skewness and kurtosis. These ratios may be helpful if strategy returns are evidently non-normal and negatively skewed or fat tails are a concern.

“Kappa [is] a generalized risk-adjusted performance measure… The Omega and the Sortino ratio each represent a single case of [this] more generalized… measure… In certain circumstances, other Kappa variants may be more appropriate or provide more powerful insights.
The lower partial moment required to calculate Kappa can be estimated from a sample of actual returns by treating the sample observations as points in a discrete return distribution… Values for the first four moments of a return distribution are sufficient in many cases to enable a robust estimation of Kappa: it is not necessary to know the individual data points in the distribution
The ranking of a given investment alternative can change according to the Kappa variant chosen, due in part to differences among the variants in their sensitivity to skewness and kurtosis. The choice of one Kappa variant over another will therefore materially affect the user’s evaluation of competing investment alternatives, as well as the composition of any portfolio optimized to maximize the value of Kappa at some return threshold.” [Kaplan and Knowles]

Tail ratio

The tail ratio seeks to estimate the probability of extreme positive returns relative to extreme negative returns. It is a useful performance criterion for trading factors that take concentrated risk, for example, based on rare economic or market events.

Mathematically, the tail ratio compares the return level that is exceeded in a high quantile, such as the 95^th percentile, with the absolute value of a return that is exceeded in a commensurate low quantile, such as the 5^th percentile. A tail ratio above 1 indicates a historical tendency for extreme positive returns exceeding extreme negative returns.

Treynor ratio

The Treynor ratio adjusts excess returns only for systematic market risk. Thus, it is defined as the ratio of the excess return of a strategy, over and above a benchmark rate, divided by the “beta” of the strategy with respect to a benchmark index, such as the global equity index. The beta is the sensitivity of the strategy to the specified index. As such, the Treynor ratio measures the excess return per unit of benchmark risk.

“Treynor’s original interpretation of the ratio [is] abnormal excess return (Jensen’s alpha) to systematic risk exposure (the beta)… The Treynor ratio provides additional information [over and above] Jensen’s alpha: two securities with different risk levels that provide the same excess returns over the same period will have the same alpha but will differ with respect to the Treynor ratio. The improvement comes from the fact that the Treynor ratio provides the performance of the portfolio per unit of systematic risk…
The generalized Treynor ratio [using multiple benchmarks] is defined as the abnormal return of a portfolio per unit of weighted average systematic risk, the weight of each risk loading being the value of the corresponding risk premium.” [Huebner]

This ratio is most suitable for strategies that aim at reducing systemic market risk or focus on pure alpha.

Consistency-weighed returns

None of the above standard ratios directly considers the seasonality of strategy returns (albeit the Calmar ratio considers it indirectly). High seasonality here means that positive PnL generation in concentrated on a small number of periods. In some cases, high seasonality may be acceptable and expected, like for crisis escalation strategies, but often that feature is undesirable and undermines the sustainability of a trading factor on its own.

The concept of consistency-weighted returns penalizes aggregate returns by the deviation of their path from a steady upward drift:
“The consistency-weighted return offers an analytical approach to gauge the proximity of a trading strategy to [a steady linear value generation] We aim to understand the linearity of [strategy returns through] linear regression. [We] fit a line [and caculate] the R-squared [of a linear function in time]. An R-squared value of 1 denotes a perfect linear relationship… A value near 0 suggests a lack of linearity, indicating that the data points are dispersed widely. Consistency-weighted returns are R-squared times annualized returns.”[Barredo Lago]

The above concept encompasses the measurement of “returns stability”, which is often measured as the inverse of the standard deviation of the difference between cumulative returns and expected cumulative returns from a linear trend.

How to combine metrics

If several performance criteria matter for signal evaluation, one needs to device a criterion function to combine them. All the above criteria can be seen as potential arguments for such a function. The appropriate criterion function will typically be bespoke for the type of strategy and the set of traded contracts considered:

A simple approach would be to scale and average several criteria. Averaging is crude and scaling is a challenge, however. Performance metrics for a strategy type do not come in large samples and have no natural standard deviations Hence, this approach would normally use logical or and plausible ranges for normalization, which are subject to judgment.
Another approach is to rank signals by various criteria and then average the ranks. This is more straightforward than parametric averaging but ignores distances in performance. Moreover, relative ranking is not generally suitable for setting minimum thresholds for a signal quality, unless one has a minimum-performance strategy to rank against.
Often there is a logical form of criterion function, given the specific purpose of a strategy. One approach is to principally optimize a single performance criterion, but subject to other criteria passing threshold values. This approach distinguishes between necessary and sufficient conditions of a good rating. For example, the criterion to optimize may be a Sortino ratio, but its qualification may be subject to predictive power indicated by a significance test and above-50% balanced accuracy. The latter two conditions can be transformed into a dummy that only sets to one if they are met and that is multiplied with the Sortino ratio.