Optimizing macro trading signals – A practical introduction

Jupyter Notebook

Based on theory and empirical evidence, point-in-time indicators of macroeconomic trends and states are strong candidates for trading signals. A key challenge is to select and condense them into a single signal. The simplest (and often successful) approach is conceptual risk parity, i.e., an equally weighted average of normalized scores. However, there is scope for optimization. Statistical learning offers methods for sequentially choosing the best model class and other hyperparameters for signal generation, thus supporting realistic backtests and automated operation of strategies.
This post and an attached Jupyter Notebook show implementations of sequential signal optimization with the scikit-learn package and some specialized extensions. In particular, the post applies statistical learning to sequential optimization of three important tasks: feature selection, return prediction, and market regime classification.

The post below is based on the proprietary research and development work of Macrosynergy.

The post should be viewed in conjunction with the code of the related Jupyter Notebook available here.
For free data access, use an equivalent Jupyter Notebook on Kaggle here.

Benefits and challenges of statistical learning for macro trading signals

Statistical learning provides methods that extract insights from datasets. Its power increases with the availability of quantitative information. Not only does statistical learning support the estimation of relations across variables (parameters), but it also governs the choice of models for such estimates (hyperparameters). Moreover, for macro trading, statistical learning has another major benefit: it allows realistic backtesting. Rather than choosing models and features arbitrarily and potentially with hindsight, statistical learning can simulate a rational rules-based choice of method in the past.

Compared with other research fields, data on the relation between macroeconomic developments and modern financial market returns are scant. This reflects the limited history of modern derivatives markets and the rarity of critical macroeconomic events, such as business cycles, policy changes, or financial crises. While many data series and points are available, occurrences of major shocks and trends are limited. For example, since 1990, the United States has experienced only four business cycles, according to the NBER, and only one major financial crisis. Hence, data on macroeconomic and related financial market experiences are rare and precious.

This is equally true for point-in-time economic information states, such as the data of the J.P. Morgan Macrosynergy Quantamental System (JPMaQS), which is designed for backtesting training strategies. Although the data typically spans three decades with daily-frequency information, each country’s underlying dominant macro developments are infrequent.

This has two major implications for the optimization of macro trading signals through statistical learning:

Statistical learning must often use data panels, i.e., draw on the experience of multiple and diverse currency areas or countries over time. Using such two-dimensional datasets calls for special methods of cross-validation and hyperparameter optimization. These are discussed below.
Statistical learning for macro trading signals has a less favorable bias-variance trade-off than for other areas of quantitative research. This means that as we move from restrictive to flexible models, the benefits of reduced bias typically come at a high price of enhanced variance. This reflects the scarcity and seasonality of macro events and regimes. Hence, one typically needs to be parsimonious in delegating decisions to statistical learning and must emphasize reasonable and logical priors.

A simple, practical approach

We show how statistical learning can practically support trading signal generation through sequential optimization based on panel cross-validation. In particular, we focus on three principal tasks:

Statistical learning can support selecting constituent data series for a composite signal from a set of candidates.
Statistical learning can specify and apply a regression method for translating a set of factors into a single return prediction.
Statistical learning can help choose and apply a classification model to detect favorable or unfavorable regimes for the target returns of a strategy.

The code in the associated Jupyter Notebook relies mainly on the scikit-learn machine learning package for Python, a popular standard for many statistical learning tasks. The notebook also uses the Macrosynergy package that includes specialized functions for extending scikit-learn’s application to panels data downloading and analyzing macro-quantamental data.

The proposed general method for selection, prediction, or classification uses an inner evaluation to sequentially optimize the signal method and an outer evaluation to assess the performance of the overall optimization process. Its implementation in sci-kit-learn can be summarized in six steps:

Specify pandas data frames of features and targets at the appropriate frequency. In the features data frame, the columns are indicator categories, and the rows are double indices of currency areas and time periods. The targets are a double-index pandas series of (lagged) target returns.
Define hyperparameter grids according to standard scikit-learn conventions. These mark the eligible set of model classes and other hyperparameters over which the statistical learning process optimizes based on inner cross-validation.
Choose an optimization criterion for the inner cross-validation, such as accuracy, R-squared, or a Sharpe ratio of stylized PnLs. For macro panel data some of these criteria require special scoring functions of the Macrosynergy package.
Specify time series panel splits for inner cross-validation using a specialized function of the Macrosynergy package.
Perform sequential optimization with the Macrosynergy package’s specialized class that applies scikit-learn cross-validation to the specific format of macro-quantamental panel data. Based on this optimization, one can extract optimal signals in a standard format and run diagnostics on the stability of the choice of optimal models over time.
Evaluate the sequentially optimized signals in terms of predictive power, accuracy, and naïve PnL generation.

This general process is used for the three principal optimization tasks in the sections below. Its practical application with full example code can be found in the Jupyter notebook for this post.

The example data set

For this present post, we optimize a signal for trading 5-year interest rate swaps (IRS) based on daily macroeconomic information from 2000 to December 2023. Target and feature data have been taken from the J.P. Morgan Macrosynergy Quantamental System (“JPMaQS”).

Both features and targets have been taken for 22 developed and emerging economies with liquid interest rate swap markets, which are Australia (AUD), Canada (CAD), Switzerland (CHF), Chile (CLP), Colombia (COP), Czech Republic (CZK), the euro area (EUR), the UK (GBP), Hungary (HUF), Israel (ILS), India (INR), Japan (JPY), South Korea (KRW), Mexico (MXN), Norway (NOK), Poland (PLN), Sweden (SEK), Thailand (THB), Turkey (TRY), Taiwan (TWD), the U.S. (USD), and South Africa (ZAR).

Macro quantamental indicators are distinct from regular economic time series insofar as they are based solely on information that was available at the end of the day for which they are recorded. Hence, these data can be compared to market price data and are well-suited for backtesting trading ideas. While JPMaQS is generally a premium data service for professional and institutional investors, the data set used here is freely available and can be downloaded, for example, from Kaggle [access here]. The four quantamental indicator categories used as features for this post are:

Excess GDP growth: This is the latest estimated GDP growth trend, % over a year ago, 3-month moving average minus a median of that country’s actual GDP growth rate over the past five years. The latest GDP growth trend is estimated based on actual national accounts and monthly activity data based on sets of regressions that replicate conventional charting methods in markets (view documentation). The link between excess GDP growth and subsequent IRS fixed receiver returns is presumed to be negative.
Excess inflation: This is the difference between information states of consumer price inflation (view documentation) and a currency area’s estimated effective inflation target (view documentation). Consumer price inflation is approximated as the average of a headline and a core CPI growth measure. The link between excess inflation and subsequent IRS fixed receiver returns is presumed to be negative.
Excess private credit growth: This is the difference between annual growth rates of private credit that are statistically adjusted for jumps (view documentation) and the sum of a currency area’s 5-year median GDP growth and effective inflation target. The link between excess private credit growth and subsequent IRS fixed receiver returns is presumed to be negative.
Real 5-year yield: This real yield is calculated as the 5-year swap yield (view documentation) minus 5-year ahead estimated inflation expectation according to a Macrosynergy methodology (view documentation). The real swap yields are presumed to have positive predictive power with respect to swap returns.

The targets are returns on 5-year IRS fixed receiver positions as % of risk capital on position scaled to a 10% (annualized) return volatility target, assuming monthly roll (view documentation). The volatility targeting makes statistical risk comparable across the diverse set of developed and emerging markets.

The dataset also includes “tradability dummies” for the various currency areas. These are based on FX forward markets, not IRS, but here, they serve as a proxy to exclude periods of illiquidity, capital controls, and currency pegs that compromise the validity of the data in certain countries and time periods.

Signal optimization is based on the above indicator categories, i.e., panels of indicators across all 22 currency areas, sequentially normalized around their zero values based on panel standard deviations up to their reference date, “winsorized” (capped or floored) at three standard deviations and multiplied by their presumed sign of directional impact.

The benchmark for the success of all optimization exercises below is a simple unweighted composite score of the four normalized indicator categories, each multiplied by the sign of its presumed directional impact. This means we check if sequential optimization improves the predictive power and stylized value generation relative to an unweighted composite. This is a high hurdle. If feature candidates are based on good reason and logic, such a diversified “conceptual parity” benchmark for a wide set of countries is not easy to beat (see previous post on double diversification benefits).

Cross-validation for global macro signal models

Cross-validation is an assessment of the predictive quality of a model based on multiple splits of the data into training and test sets, where each pair is called a “fold.” It typically serves the purpose of hyperparameter tuning. Validation is not the same as testing. Testing evaluates tuned models and assesses the predictive quality of a tuning method.

Here, cross-validation is used for forming folds of training and test (validation) sets at each step of the expansion of the feature and target panels over time. This simulates an investor’s experience who uses optimization at each re-allocation date to select the best model and obtain a signal from this optimal model.

In scikit learn, cross-validation can be implemented through various functions of the model_selection module, such as cross_validate or cross_val_score. These require passing as argument a cross-validation splitter that governs the formation of training and test sets, typically in multiple folds. For time series data, there are two popular types of splitting principles:

Time-series splitting constructs folds by rolling forward in time with an expanding training set and a forward-sliding test set. The TimeSeriesSplit class of scikit-learn produces train/test indices to split samples at fixed time intervals with ascending time indices.
K-fold splitting builds its folds based on adjacent sections of the sample. Applied to time series, it would not only predict future data with the past data but also (unseen) past data with the future data. Thus, scikit-learn’s Kfold class produces train/test indices in the form of k consecutive folds (provided `shuffle` is set to False as per default)

However, none of these standard methods can be directly applied to panel data of global macro models. Cross-validation splitters for panel data must ascertain the logical cohesion of the training and test sets based on a double index of cross-sections and time periods, ensuring that all sets are sub-panels over common time spans and respecting missing or blacklisted time periods for individual cross-sections.

For this purpose, the Macrosynergy package implements three related classes for specifying panel cross-validation splits. They are designed to work on double-indexed feature matrixes and (lagged) target vectors in the same way as standard sci-kit learn splitter work on single-index data structures.

The ExpandingIncrementPanelSplit class initiates splitters of temporally expanding training panels at fixed intervals followed by subsequent test sets of (typically short) fixed time spans. This simulates sequential learning with growing information sets in fixed intervals.

The ExpandingKFoldPanelSplit class allows instantiating panel splitters where a fixed number of splits is implemented, but temporally adjacent panel training sets always precede test sets chronologically and where the time span of the training sets increases with the implied date of the train-test split. It is equivalent to scikit-learn’s TimeSeriesSplit but adapted for panels.

The RollingKFoldPanelSplit class instantiates splitters where temporally adjacent panel training sets of fixed joint maximum time spans can border the test set from both the past and future. Thus, most folds do not respect the chronological order but allow training with past and future information. While this does not simulate the evolution of information, it makes better use of the available data and is often acceptable for macro data as economic regimes come in cycles. It is equivalent to scikit-learn’s Kfold class but adapted for panels.

For the subsequent macro signal optimization examples, we use the RollingKFoldPanelSplit class with four folds since the data are scarce, and most indicators are related to larger and shorter economic cycles.

Optimized feature selection method

In this first example, the statistical learning method chooses sequentially an optimal method for selecting feature scores and then applies the best method to return the optimized selection for each recalibration date. Based on this optimal selection, the method calculates an average score. Thus, the optimal signal used at each recalibration date is an equally weighted mean of the subset recommended by the best model up to that date.

For sequential optimized selection (and all other optimization methods below), features and targets have been converted from daily frequency to monthly by taking the last information state of features and the sum of next month’s IRS fixed receiver returns. The monthly frequency is a realistic assumption of how often a trading system would update coefficients and trade on new parameters and balances the benefits of parameter optimization against transaction costs. Due to the minimum data requirements, optimal model and feature selection produce signals only from 2003.

The choice is made over two principal models and a set of model hyperparameters for each.

The first principal approach uses a Least Absolute Shrinkage and Selection Operator (LASSO) to determine the set of features that has jointly been significant in predicting returns in a linear regression. The LASSO is a regression type that uses “L1 regularization”, i.e., it adds a penalty term to the regression loss function that is linearly proportional to the coefficients’ absolute values. Coefficients of features that display little predictive power are thus shrunk to zero. In scikit-learn, this is implemented through the Lasso class of the linear_model module. For macro feature selection, one should force all coefficients of presumed positive-impact features to be positive. This restriction ensures that only features that predict with the theoretically correct sign are considered. The main hyperparameter that requires empirical validation is the constant determining the linear penalty on coefficient size, commonly denoted as alpha. We consider alpha values of 10, 1, 0.1, and 0.01 for the hyperparameter grid.
The second principal approach assesses the significance of features significantly through the Macrosynergy panel test and then selects at each date all features that have individually been significant return predictors. The panel test respects the data structure of features that are indexed by time and cross sections. Simply stacking data for regression leads to “pseudo-replication” and overestimated coefficient significance. This is avoided through panel regression models with period-specific random effects, which adjust targets and features of the predictive regression for common (global) influences (view post here). To apply this test for selection in a scikit-learn pipeline, one can use the MapSelector class of the Macrosynergy package’s learning module. The main hyperparameter that requires empirical validation is the acceptable threshold of the p-value, i.e., of the probability of a measured relation being accidental. We consider p-values of 1%, 5%, 10%, and 20% for the selection hyperparameter grid.

The actual sequential model selection and optimized signal calculation can be executed by the Macrosynergy package’s SignalOptimizer class. This class uses scikit-learn’s GridSearchCV and RandomizedSearchCV but specifically handles the calculation of quantamental predictions based on adaptive hyperparameters and model selection. This means it optimizes the selection of the method over the full set of hyperparameter values of the two models. As an optimization criterion, one can use the probability of the significance of the optimized signal in the test sets using the panel_significance_probability score of the Macrosynergy package.

Over time, there has been frequent change across six types of models, particularly in the first ten years of optimization. However, with a growing data set, the leaning method clearly preferred a LASSO selector with a low penalty (0.01) and a panel test selector with a restrictive p-value threshold of 1%.

A trading signal based on optimized feature selection shares most medium-term fluctuations of the non-optimized simple feature score average. However, it also displays notable and protracted differences and a greater proclivity to abrupt changes and volatility due to model changes. Optimized signals were only calculated for cross-section periods where markets were liquid and tradable using the “tradability dummies” referenced above.

In this example, sequential optimization of feature selection has improved the composite trading signal’s predictive power and stylized economic value. For the 22 markets and period 2003-2023, monthly accuracy and balanced accuracy of the prediction of the direction of subsequent returns increased from roughly 54% for the non-optimize score to around 55% for the optimized selection.

For the final evaluation of the optimization gains, we calculate naïve PnLs based on standard rules used in most Macrosynergy posts:

Positions are taken based on optimized and non-optimized feature scores in units of vol-targeted risk in IRS receivers. This means one unit of signal translates into one unit of expected PnL volatility in a specific cross section and holding period.
Positions are rebalanced monthly at the beginning of the month based on the last signal at the end of the previous month, allowing for a one-day slippage for trading.
The long-term volatility of the PnL for positions across all currency areas has been normalized to 10% annualized, mainly for the purpose of graphic displays.

This naïve PnL calculation method does not consider transaction costs or risk management rules, as those are specific to institutional settings.

Compared to the naïve PnL of the simple average score signal, the optimized feature selection signal displays a modest but notable pickup in performance. The naïve Sharpe ratio increases from 1-1.1 to above 1.2, and value generation has been more even across time.

Optimized return prediction method

The second statistical learning method chooses sequentially an optimal prediction method of monthly target returns and then applies its predictions as signals. Thus, at the end of each month, the method uses the optimized hyperparameters and parameters to derive a signal for the next month.

To build a grid for sequential optimization, we consider two principal regression models and a set of related hyperparameters:

The first principal approach is ordinary least-squares regression implemented through scikit-learn’s LinearRegression. We use regression with positive coefficient constraints to secure theoretically plausible predictive relations. The main hyperparameter that requires empirical validation is the inclusion of the intercept. Conceptually, all features have a neutral level near zero. While this assumption is a bit rough, fitting a constant would bias predictions to the average performance of IRS receiver positions in the training sample, which may be specific to economic and policy conditions.
As a second principal approach, we choose the k nearest neighbors regression. This is a non-parametric regression that can be implemented with scikit-learn’s KNeighborsRegressor. It predicts the target return of a feature set based on the target values of the k nearest feature data sets. In practice, once the k closest training samples to an unlabeled test sample are found, the (weighted or unweighted) mean of the training label of those k samples is used as the forecast. Important hyperparameters that require validation include the number of neighbors (we consider sets of powers of 2 from 1 to 12, i.e., up to 4096 neighbors) and the choice of weights. For the hyperparameter grid, we consider the neighbors’ uniform (equal) weights and distance-based weights.

As for feature selection, the sequential model selection and resultant optimized predictions can be executed by the Macrosynergy package’s SignalOptimizer class. The standard optimization criterion for the regression context is the coefficient of determination (R2 score).

The history of chosen selection models shows greater model choice stability than the optimized selection method. Generally, the learning method preferred the restrictive OLS regression over the more flexible neighbors regression, plausibly reflecting the unfavorable bias-variance trade-off of macro features.

Optimized regression-based predictors show similar medium-term dynamics as the simple feature score averages but consistent differences in values and marked difference in short-term dynamics. The average long bias of the optimized prediction is 72%, much higher than the bias for the average score, reflecting the influence of positive intercepts in the regression and the prevalence of positive IRS receiver returns before the 2020s.

Like optimized feature selection, optimal prediction produces signals with a higher monthly accuracy of 55% versus 54% for the simple score averages. By contrast, forward correlation coefficients have been lower, albeit remaining significant at the 1% level.

The comparison of naïve PnLs shows no big difference in overall long-term performance. The optimized prediction signal has produced a long-term (2003-2023) Sharpe ratio of 1.03 versus 1.05 for the non-optimized score. Value generation has been more even, however, for the optimized case.

Optimized return classifier method

The third statistical learning application is the classification of rates market environments into “good” or “bad” with respect to subsequent monthly IRS receiver returns. The proposed method chooses an optimal classifier of the direction of market returns sequentially and then applies the class that is predicted by the best classifier, i.e., “positive” or “negative,” as a simple binary trading signal for each currency area.

The hyperparameter grid for this optimization process is based on two conceptually different classification approaches:

Logistic regression is a popular generalized linear model used for binary classification tasks. It assumes that the probability of a case belonging to the positive class is a logistic function (sigmoid function) of a linear combination of feature values. This implies a linear relationship between the features and the log odds of the probability of the positive class. The odds of an event are defined as the ratio of the probability of a positive outcome to the probability of a negative outcome. In scikit-learn, logistic regression is implemented through the LogisticRegression. The main hyperparameter that requires empirical validation is the inclusion of the intercept.
The K-nearest neighbors algorithm is a particularly simple classification method. The method uses a training set with known classifications and features (annotated observations) to predict the class of new observations based on features alone. The prediction is the class of the nearest annotated observations. If the k neighbors belong to different classes, the choice is made by majority vote. In scikit-learn this is implemented through the KNeighborsClassifier. Hyperparameters that require validation include the number of neighbors and the choice of uniform (equal) weights of the neighbors versus distance-based weights.

Again, the sequential classifier selection and resultant optimized classifications are executed by the SignalOptimizer class of the Macrosynergy package. The optimization criterion is the balanced accuracy of the monthly return sign predictions.

The sequential optimization method reveals a strong preference for classification through logistic regression without intercept. It is the most restrictive model on the menu. Restrictive model outperformance in the macro context is a recurrent theme. It reflects the high costs of model flexibility in terms of model variance, i.e., sensitivity to small fluctuations or noise in the training data. Logistic regression without intercept has been the dominant choice for the last 15 years. Only in the early years with short data samples did the process dither about the model choice.

Optimized classification has differed notably from the implied signs of simple feature score averages across time and currency areas.

Optimized classification has produced accuracy statistics similar to the non-optimized averaged feature scores. Also, naïve value generation has been almost equal. The optimized signal’s PnL has been a bit higher and, from 2010, more consistent for optimized classification. Its long-term Sharpe ratio was 1.2 versus 1.1. The simple score average outperformed in the early years when the learning process was vacillating across models. However, over the past 15 years, the optimized classifiers have produced more risk-adjusted value.