Regression-based macro trading signals

Jupyter Notebook

Regression is one method for combining macro indicators into a single trading signal. Specifically, statistical learning based on regression can optimize model parameters and hyperparameters sequentially and produce signals based on whatever model has predicted returns best up to a point in time. This method learns from growing datasets and produces valid point-in-time signals for backtesting. However, whether regression delivers good signals depends on managing the bias-variance trade-off of machine learning. This post provides guidance on pre-selecting the right regression models and hyperparameter grids based on theory and empirical evidence. It considers the advantages and disadvantages of various regression methods, including non-negative least squares, elastic net, weighted least squares, least absolute deviations, and nearest neighbors.

A Jupyter notebook for audit and replication of the research results can be downloaded here. The notebook operation requires access to J.P. Morgan DataQuery to download data from JPMaQS, a premium service of quantamental indicators. J.P. Morgan offers free trials for institutional clients.
Also, there is an academic research support program that sponsors data sets for relevant projects.

The below post is based on proprietary research of Macrosynergy Ltd. It ties in with this site’s summary of quantitative methods for macro information efficiency,

The basics of regression-based trading signals

A regression-based trading signal is a modified point-in-time regression forecast of returns. A regression model can jointly consider several explanatory variables (henceforth called “features”) and assign effective weights based on their past relations to financial returns (“targets”). The construction of point-in-time regression-based forecasts relies on a statistical learning process that generally involves three operations:

the sequential choice of an optimal regression model based on past predictive performance,
a point-in-time estimation of its coefficients and
the prediction of future returns based on that model.

Regression allows learning from the data under clear theoretical restrictions, such as linearity or non-negativity of coefficient. Regression-based statistical learning is implemented in the mainstream scikit-learn package. The sequential application of historical data to regression-based learning allows for generating trading signals that are a realistic basis for backtests.

However, regression-based signal generation has its limitations, particularly in the field of macro trading. Financial returns are notoriously hard to explain, and data on the relationship between critical macroeconomic developments and modern financial market returns are scant and limited by the occurrence of business cycles, policy changes, or financial crises. This means that the bias-variance trade-off for statistical learning is a critical consideration and that regression-based signals are not always superior to theory-guided or equally weighted combinations of features.

A practical way to use regression for trading signals

First, regression-based signal generation needs point-in-time information. Therefore, we take all macroeconomic indicators from the J.P. Morgan Macrosynergy Quantamental System (“JPMaQS”). Its quantamental indicators are real-time information states of the market and the public with respect to an economic concept and, hence, are suitable for testing relations with subsequent returns and backtesting related trading strategies. JPMaQS also provides a broad range of daily generic returns across asset classes and types of contracts.

Based on point-in-time data, the generation of regression-based signals is implemented in Python and scikit-learn or related wrapper functions (as shown in the linked Jupyter notebook) according to the following steps:

Preparation of data panels of plausible features and targets: For macro strategies, panels are two-dimensional datasets that contain one type of indicator (category) across time and relevant countries of currency areas. All transformations of original JPMaQS data categories can be done on a panel basis using the Macrosynergy package’s helper functions (view documentation here). These operate on a standard working daily format of panels of economic and market indicators.
Transformation of standard daily panels into a scikit-learn-friendly format: In the example notebook, the panels are transformed from a long format into one double-indexed pandas data frame for all features (X) and a double-indexed pandas series of targets (y). The index dimensions are cross-section (currency area) and time. Also, the periodicity for macro indicator analysis typically needs to be downsampled, for example, monthly, by using the latest value of the features and the sum of the targets. Finally, the features are lagged by one period.
Note that the use of double-indexed (panel) data frames necessitates the use of special wrapper functions to apply the panel data to standard scikit-learn machine-learning pipelines.
Specification of the optimization criterion and cross-validation method for the machine learning process: The linked Jupyter Notebook uses the standard R2 score for evaluating the performance of regression models. Cross-validation methods for panels have been summarized in a previous post (Optimizing macro trading signals – A practical introduction). Here, we use the training data splitting of the RollingKFoldPanelSplit class, which instantiates splitters where temporally adjacent panel training sets of fixed joint maximum time spans can border the test set from the past and future. Thus, most folds do not respect the chronological order but allow training with past and future information. This is a non-standard approach to time series but perfectly valid in the context of financial markets. For example, it would ask if a model trained with data from the 2010s and 2020s would have predicted relations in the 2000s.
Definition of candidate models for the machine learning pipeline: These are collected in two Python dictionaries of regression models and their hyperparameter grids. These can then be passed on to the appropriate scikit-learn classes and methods or their Macrosynergy package wrappers. Standard regression options and their implications are discussed below.
Sequential regression model optimization and model-based predictions: In order to generate a time series of optimal regression model versions and related predictions, we use the SignalOptimizer class of the Macrosynergy package, which allows us to apply standard scikit-learn principles to quantamental data panels. At each point in time learning process selects optimal hyperparameters based on the specified cross-validation and uses the resulting model to derive predictions. This process implies three sources of regress signal variation: changes in feature values, changes in model parameters, and changes in model version (hyperparameters).

The resultant regression signals can then be evaluated as to their predictive power for subsequent returns (effective test sets), their accuracy for predicting the direction of return, and their implied PnL generation for a simple trading rule.

The bias-variance trade-off

There are many different types of regression and related hyperparameters. A critical decision is which model versions to offer to the learning pipeline. In principle, one could optimize model selection over a vast array of options. However, doing so without prior theory-based pre-selection bears the risk of high model instability. This reflects that the well-known bias-variance trade-off of machine learning models is particularly important for macro trading strategies.

In machine learning, bias refers to the inability of a machine learning method to accommodate a “true” relation fully. The more restrictive a model, the greater the expected bias. For example, the imposition of a linear relation between features and targets increases estimation errors if the true relation is non-linear.
Variance in machine learning refers to refers to differences in model predictions that arise from training on different subsets. A model with low variance generalizes well to unseen data. Alas, the more flexible an estimation method the more it “hugs the data” and the more the model parameters rely on the specifics of the training data. Consequently, flexibility tends to increase variance. Excessive model flexibility is called “overfit”.

Given the scarcity and seasonality of macro events and regimes, statistical learning for macro trading signals faces a “steep” bias-variance trade-off: this means that model flexibility comes at a high price of enhanced variance. Overfitting does not just produce poor forecasts but also enhances the seasonality of strategy performance.

Conversely, good theoretical priors that allow restrictions reduce variances with limited costs in terms of bias. And reduced variance increases the accuracy and consistency of predictions. Hence, the main body of the research below provides some theoretical guidance for choosing a specific regression model and tests empirically how the choice of candidate models has affected resultant trading signal performance.

Three example strategies for evaluation

The empirical analysis below uses regression signals to apply candidate features in three different asset classes and related macro trading strategies. The basic ideas behind these strategies have been published previously and all features used have a natural neutral level at zero. However, unlike in the original versions of the above strategies, we have added “speculative” features for the below analyses. These features have weak theoretical backing, and their inclusion simulates the usage of inferior predictors in the signal-generating process.

Rates strategy: This strategy seeks to trade 5-year interest rate swaps across 22 countries in accordance with excess GDP growth trends, excess CPI inflation, and excess private credit growth, all of which are presumed to predict long-duration returns negatively. The original version of this strategy has been described in a previous post. Here, we added PPI inflation, in excess of the effective inflation target, and industrial production growth, in excess of the 5-year mean, as speculative signal candidates, with presumed negative effects.
Equity strategy: This strategy takes positions in equity index futures in 8 countries based on labor market tightness (presumed negative effect), excess inflation (presumed negative effect), and index return momentum (presumed positive effect). This is loosely based on an original strategy described in a previous post. Again, we added excess PPI inflation and industrial production growth as speculative signal candidates with presumed negative effects.
FX strategy: This strategy trades FX forwards across x exchange rates based on macro features of the base currency. The features include changes in external balance ratios, relative GDP growth trends, manufacturing survey score increases, and terms-of-trade improvements, all of which are presumed to have positive predictive power with respect to returns on long positions in the base currency. The original version of this strategy has been described in a previous post. We added excess PPI inflation and industrial production growth as speculative signal candidates with presumed positive effects.

The initial benchmark for applying regression-based learning to all strategies is “conceptual parity,” a simple unweighted composite score of the normalized candidate categories, each multiplied by the sign of its presumed directional impact. This is a high hurdle. If feature candidates are based on sound reason and logic, diversified “conceptual parity” often produces strong out-of-sample signals.

In the sections below, we compare regression-based signals with conceptual parity and across regression methods from 2000 to 2024. Since regression signals need a few years of data to start, the actual comparison period is 2003/04 to 2024. Three main criteria for signal comparison are the following:

Correlation coefficients of the Pearson and Kendall (non-parametric) type of the relation between month-end signals and next month’s target returns across the data panel. The panel analysis considers both intertemporal and cross-section covariances.
Accuracy and balanced accuracy of month-end signal-based predictions of the direction of next month’s returns. Accuracy measures the ratio of correctly predicted directions to all predictions, and balanced accuracy measures the average of the ratios of correctly detected positive returns and correctly detected negative returns.
Sharpe and Sortino ratios of naïve PnLs. A naïve PnL is a trading strategy that takes by the end of the first trading day of each month a position in each contract of the strategy in accordance with the signal at the end of the previous month. The signal is the regression prediction (or conceptual parity score), normalized and winsorized (capped or floored) at 2 standard deviations. The naïve PnL does not consider transaction costs or risk management rules other than the signal winsorization.

We also display the market correlation of strategies based on the various signals, whereby the market benchmark for the duration strategy is the 10-year U.S. treasury bond return, and the benchmark for the FX and equity strategy is the S&P 500 index future return.

OLS-based signal versus conceptual parity

We test the consequences of using a learning process with standard ordinary least squares (OLS) regression for condensing the information of multiple candidate features compared to conceptual risk parity. The only important hyperparameter to optimize over is the inclusion of an intercept in the regression. Although all features have a theoretical neutral level at zero, an intercept would correct for any errors in the underlying assumptions. Yet, the price for potential bias is that past long-term seasons of positive or negative target returns translate into sizable intercepts and future directional bias of the regression signal.

If we employ regression estimates to predict targets, features are essentially assigned weights based on their empirical regression coefficients. These coefficients measure the expected change in the target for a one-unit change in the feature, holding all other features constant. Therefore, OLS effectively assigns weights to features based on their historical individual explanatory power.

OLS failed to outperform conceptual parity on average for the three types of macro strategies. Whilst the accuracy of OLS signals was higher balanced accuracy, forward correlation coefficients and PnL performance ratios were all lower. Also, the market benchmark correlation of OLS-based strategies was, on average, higher. The underperformance of OLS mainly arose in the FX space. It reflected the learning method’s preference for regression models with intercepts from 2008 to 2014, translating the strong season for FX returns of the earlier 2000s into a positive bias for signals during and after the great financial crisis.

The empirical analysis provided two important lessons:

Only allow constants if there is a good reason. The main motivation for allowing regression constants is that we do not know the neutral threshold of a feature. However, if features have approximate neutral levels near their zero values, the intercept of an OLS regression will mainly pick up the historic long-term performance of the traded asset class. If this historic return average reflects structural risk premia this may be acceptable. However, if the regression intercept picks up longer performance seasons, it will simply extrapolate past return averages into the future.
This extrapolation risk is particularly acute with shorter data samples. Interestingly, in all strategy types, the statistical learning process ultimately preferred a regression model without intercept towards the end of the sample period (2024)
Don’t compare regression signals and fixed-weight signals by correlation metrics. Generally, regression-based signals are less suitable than parity scores for producing high linear correlation. That is because the calculation method changes. Regression-based signal variation does not arise merely from feature variation but from changes in model parameters and hyperparameters. And the latter sources of variation have no plausible relation to target return. For example, in the empirical analyses of the duration strategy, the OLS signals produce lower predictive correlation but higher accuracy and balanced accuracy and almost the same strategy performance ratios.

Non-negative least squares (NNLS) versus OLS-based signal

We test the consequences of using non-negative least squares (NNLS) versus ordinary least squares. On both sides, the only important hyperparameter is the inclusion of an intercept.

NNLS is a regression technique used to approximate the solution of an overdetermined system of linear equations with the additional constraint that the coefficients must be non-negative. This is like placing independent half-flat priors on the feature weights, using the usual Gaussian likelihood, in a Bayesian linear regression context. The main advantage of NNLS is that it allows consideration of theoretical priors on the direction of impact, reducing dependence on scarce data. If all regressors are formulated such that theoretical priors stipulate a positive relation with the targets, the NNLS will pre-select regressors by theoretical plausibility. Compared to OLS, NNLS tends to reduce variance and increase bias.

NNLS-based learning outperforms OLS-based learning based on all averages of performance metrics. PnL outperformance is small and gentle over time but consistent across time and types of strategies. Also, machine learning methods that are offered both OLS and NNLS predominantly prefer NNLS, the restriction of coefficient non-negativity.

The empirical analysis provided two important lessons:

NNLS produces greater model stability. This is mainly because NNLS excludes all theoretically implausible contributors to the signals and thus reduces the model construction options of the learning process. This is particularly beneficial when we have many correlated feature candidates and limited data history.
The benefits of NNLS may only show very gradually. In our data example, NNLS is not a game changer compared to OLS. Signals are broadly similar, which is not surprising, given that we only used a small set of features, most of which are conceptually different. However, long-term correlations and performance ratios were higher for all strategies over the 20-year periods.

Elastic net-based signal versus least-squares-based signal

We test whether the characteristics and performance of elastic net-based regression signals relative to standard least squares (OLS and NNLS) signals. The elastic net-based learning process can choose from a larger selection of hyperparameters, governing the strength of regularization and the type of penalty imposed.

An elastic net is a flexible form of regularized regression. Generally, regularization is any technique employed with the aim of reducing generalization error. Here, regularization adds penalties to a model’s objective function in accordance with the size of coefficients in order to prevent overfitting. In the case of regression, the Lasso and Ridge models are employed to that end. Lasso penalizes the absolute size of coefficients (L1 penalty), shrinking coefficients linearly, possibly all the way to zero. Ridge penalizes the squared size of coefficients (L2 penalty), which merely shrinks the value of coefficients. Elastic Net combines both L1 and L2 penalties.

In our data examples, elastic net, on average, produced signals with higher accuracy but lower correlation and PnL performance ratios. For the duration and equity strategies, the elastic net produced very similar PnL profiles as OLS-based signals. However, the elastic net-based learning process “overregularized” features for the FX space and failed to produce non-zero signals prior to 2008 and after 2018. For all strategies, the elastic net-based learning process prefers models with non-negative coefficient restrictions.

The empirical analysis provides two important lessons:

Elastic net may make excessive demands on the quality of financial return predictors. Regularized regressions that include a heavy L1 penalty can easily remove all features if they are few in number and quality. And high predictive quality does not come easily for financial returns. Even the best features only produce a small ratio of return variance at higher frequencies. Hence, in practice, it may be beneficial to allow only small “alphas,” i.e., small penalties for coefficient size, and to disallow a dominant weight of L1 regularization.
Elastic net has a penchant for sporadic instability of signals. This arises from the greater number of hyperparameters that the statistical learning process can choose from. Hyperparameter instability is consequential for transaction costs and recorded signal-return correlation.

Time-weighted regression signal versus unweighted regression signal

We test the consequences of using time-weighted least squares regression (TWLS) in the learning process, with or without non-negative coefficient restrictions, versus standard least squares (OLS and NNLS).

Weighted Least Squares (WLS) is a form of generalized least squares that increases the importance of some samples relative to others. Time-Weighted Least Squares (TWLS) allow prioritizing more recent information in the model fit by defining a half-life of exponential decay in units of the native dataset frequency. The half-life of the decay is one of the hyperparameters that the learning process determines over time.

On average, using time-weighted least squares in the learning process has produced modestly higher predictive accuracy, correlation and PnL performance ratios.

There are a few important empirical lessons (and warnings) in respect to this method:

The TWLS-based learning process tends to produce greater signal instability. This happens for two reasons. First, there is instability due to the greater choice of hyperparameters, i.e., the half-life of the lookback windows. Second, there is greater parameter instability due to the concentration of the estimation window on more recent data.
Beware of TWLS models that simulate trend following. Generally, the learning process with TWLS models uses constants more than the OLS/NNLS-based process. This seems to be another consequence of the focus on more recent history. Recent seasonality of returns or omitted explanatory variables result in better cross-validation results for models with constant. However, this way, the TWLS constants become estimates of recent return trends, particularly if shorter half-lives are chosen. The only way to cleanse the regression-based signal from such trend following is to disallow the use of a constant.
TWLS methods like non-negativity restrictions. For all asset class strategies, time-weighted least squares almost exclusively use non-negative least squares, i.e., impose a restriction to coefficients in line with theory. Here, the behavior of hyperparameter optimization is in line with the theory: shorter effective lookback periods call for more restrictions as the bias-variance trade-off is quite poor.

Sign-weighted regression signal versus unweighted regression signal

We test the consequences of using sign-weighted least squares regression (SWLS) in the learning process, with or without non-negative coefficient restrictions, versus standard least squares (OLS and NNLS).

Sign-weighted least squares (SWLS) equalize the contribution of positive and negative samples to the model fit. If, for example, returns are predominantly positive, historic observations with negative target returns are assigned higher weights than those with negative returns. This mitigates the directional bias in general and largely removes any bias that manifests through the recession constant.

On average, across the strategies, statistical learning with sign-weighted least squares produces slightly higher correlation and PnL performance ratios than least squares. Importantly, the average benchmark correlation of strategies has been very low (around 5%) versus 25% for the least squares-based signal.

The main empirical lessons reflect the purpose of SWLS:

SWLS-based learning reduces directional bias and benchmark correlation. Since the method weighs positive and negative return experiences equally, all directional bias that arises from the seasonality of returns (equity market boom) or an omitted variable (like a long-term premium) is removed. This is echoed by the removal of the long bias across all our sample strategies. Such a complete removal is desirable if the experiences of (rarer) negative return periods are truly more valuable than positive return periods, for example, because we would expect that their weight would increase or normalize in the future.
SWLS likes to work with non-negativity restrictions. In our examples, SWLS learning would have mostly chosen models with non-negative coefficient restrictions. This may be a sign of suitability for implementing theoretical priors across different asset return seasons.

Least absolute deviation regression signal versus least squares regression signal

We test if a learning process with the least absolute deviation LAD regression (using unweighted, time-weighted, and sign-weighted versions) produces better regression-based signals than a process that uses the least-squares version of the process.

LAD regression is median regression, i.e., a special case of quantile regression. It is a robust regression method that is less sensitive to outliers than standard least squares regression. Least squares can compromise the message of the many for the sake of a few, specifically extreme values. LAD mitigates this issue by using absolute values of errors rather than their squares.

LAD regression does not generally improve signal quality. Average accuracy and balanced accuracy for our strategy types have been higher than for least squares, but correlation and portfolio performance ratios have been smaller.

LAD is not generally a game-changer for macro signals. Even though economic data and financial returns are prone to outliers, these are often not large enough to bring out the full benefits of the LAD approach. This may reflect that with the low average explanatory power of features with respect to future financial returns, regressions rarely produce large coefficients in the first place, and the main job of the regression is really selecting features and weighing them relative to each other.
LAD also likes to work with non-negativity restrictions. For all strategies, the most frequently chosen LAD and LS versions use the non-negative coefficient restrictions. This is a reminder of the benefits of theoretical priors, at least regarding the direction of feature impact.

KNN regression signal versus least squares regression signal

We test how KNN regression-based signals perform compared with signals derived from least squares (OLS/NNLS)-based statistical learning

All the above models are linear and parametric. The KNN class of models makes predictions by averaging the nearest k training samples, possibly taking a weighted average based on sample distance. In this context, this leads to return predictions that are based on the most similar feature constellations of the past. In the concept of macro signals, this reduces theoretical priors (and probably enhances model variance) for the sake of less model bias.

Signals of KNN-based learning are very different from the least squares-based signals. Average performance metrics are worse than for least squares-based signals.

There have been lessons from the application of KNN regression as well:

KNN is for problems with few theoretical clues. KNN-based learning operates with little theory and restrictions. Moreover, key hyperparameters, such as the number of neighbors, lack clear theoretical guidance. This explains why its regression signals are more at the mercy of past experiences and why the optimal model changes often.
Pre-selection matters most. KNN may be very different from linear regression, but the signals of these two learning methods are still highly correlated, and their PnL profiles are similar. This hammers home the truth that the selection of a good, plausible set of predictors is often far more important than the applied learning method and emphasizes the paramount importance of data quality.