Nowcasting macro trends with machine learning

Nowcasting economic trends can make use of a broad range of machine learning methods. This not only serves the purpose of optimization but also allows replication of past information states of the market and supports realistic backtesting. A practical framework for modern nowcasting is the three-step approach of (1) variable pre-selection, (2) orthogonalized factor formation, and (3) regression-based prediction. Various methods can be applied at each step, in accordance with the nature of the task. For example, pre-selection can be based on sure independence screening, t-stat-based selection, least-angle regression, or Bayesian moving averaging. Predictive models include many non-linear models, such as Markov switching models, quantile regression, random forests, gradient boosting, macroeconomic random forests, and linear gradient boosting. There is some evidence that linear regression-based methods outperform random forests in the field of macroeconomics.

Chinn Menzie, Baptiste Meunier, and Sebastian Stumpner (2023), “Nowcasting World Trade with Machine Learning: a Three-Step Approach”.
with an alternative version here.

Below are quotes from the paper. Emphasis, cursive text, and text in brackets have been added for clarity.

This post ties in with this site’s summary of quantitative methods for macro information efficiency.

A three-step approach to nowcasting with machine learning

“Real-time economic analysis often faces the fact that indicators are published with significant lags…A key novelty of this paper is the use of machine learning techniques for nowcasting…One key interest of such methods lies in their ability to handle non-linearities.”

“[We] propose a three-step approach for forecasting with machine learning and large datasets. The approach works sequentially: (step 1) a pre-selection technique identifies the most informative predictors among our dataset of 600 variables; (step 2) selected variables are summarized and orthogonalized into a few factors; and (step 3) factors are used as explanatory variables in the regression…using machine learning techniques. While such pre-selection and factor extraction have been already used in the literature, our contribution is to use them in a combined framework for machine learning.”

“[Also] we test a large number of different methods for pre-selection and factor extraction in order to assess the best-performing combination.”

“[The figure below] illustrates our three-step approach…Starting from a large dataset, the first step consists in selecting a few regressors with the highest predictive ability. The pre-selected dataset is then summarized in fewer factors, which are then used in non-linear regression models…The approach is straightforward to use and highly flexible. It can be applied seamlessly to any dataset and, once operational, withstand data changes (e.g. inclusion of new data).”

A version of this approach is “two-stage supervised learning” as proposed in a previous post. The first stage is scouting features, by applying an elastic net algorithm to available data sets during the regular release cycle, which identifies competitive features based on timelines and predictive power. Sequential scouting gives feature vintages. The second stage evaluates various candidate models based on the concurrent feature vintages and selects at any point in time one with the best historic predictive power.

Step 1: Feature selection methods

“When forecasting with a high-dimensional dataset, the literature generally concludes that the factor models are significantly more accurate when selecting fewer but more informative predictors. On a more theoretical ground [papers] find that larger datasets lead to poorer forecasting performances when idiosyncratic errors are cross-correlated or when the variables with higher predictive power are dominated.”

“The idea underlying pre-selection is to rank regressors based on a measure of their predictive power with respect to the target variable…Pre-selection is a data-driven step made automatically…It lifts the burden of selecting variables from the forecaster since data-driven pre-selection does not require any a priori knowledge. Instead of having to select variables by hand, the forecaster can feed the full dataset into the framework.”

“We consider four techniques from the literature:”

“The Sure Independence Screening: regressors are ranked based on their marginal [bivariate] correlation with the target predictor…It has the sure screening property that all important variables survive after applying a variable screening procedure with probability tending to 1.”
N.B.: This method is most suitable if the number of features is very large compared to the number of observations, as it saves computation times relative to other methods such as stepwise regression.
“T-stat-based [selection]: each regressor is ranked based on the absolute value of the t-statistic associated with its coefficient estimates in a univariate regression on the target variable. The univariate regression also includes four lags of the dependent variable to control for endogenous dynamics.”
“Least-Angle Regression (LARS): this [method] accounts for the presence of the other predictors. The LARS is an iterative forward selection algorithm. Starting with no predictors, it adds the predictor most correlated with the target variable and then moves [its regression] coefficient in the direction of its least-squares estimate so that the correlation of the regressor with the residual [of the difference between target and regression prediction] gets lower. It does so until another predictor has a similar correlation with [the regression-based] residuals At this point [the second best regressor] is added to the active set and the procedure continues with now moving both coefficients equiangular in the direction of their least-squares estimates, until another predictor has as much correlation with the residual.”
“Iterated Bayesian Moving Averaging (BMA) also accounts for the presence of other regressors. This technique works by making repeated calls to a BMA procedure. BMA applies a Bayesian framework on all possible models using the set of variables; the Bayes rule then allows to compute the posterior mode probability for each model… The BMA returns the set of models whose posterior model probability is the highest…When used for pre-selection, BMA runs iteratively through regressors by groups and following a pre-determined pecking order. Starting with the first [set of] regressors in pecking order, the BMA determines the posterior inclusion probability of each regressor. Those with probabilities higher than a threshold are kept while the others are replaced by the next regressors in the pecking order. The BMA is then run on this new batch of regressors, and so on and so forth until all regressors have been assessed.”

Step 2: Factor estimation methods

“Factor extraction has been shown to be a potent way to summarize an extensive amount of information in order to achieve parsimony and expel idiosyncratic noise from the data, ultimately leading to better performances in nowcasting…Machine learning techniques are more accurate when used in a factor model rather than when applied directly on all individual series…In addition, factor extraction produces orthogonal variables, thereby alleviating collinearity and enhancing the accuracy.”

“The econometric framework relies on a factor model. Formally, we assume that the pre-selected dataset X_t can be represented by a factor structure with an r-dimensional factor vector f_t, a loadings matrix L and an idiosyncratic component e_t unexplained by the common factors.”

“[Typically] static factors are extracted via Principal Components Analysis (PCA). PCA assumes that factors and idiosyncratic errors are independent and identically distributed. Factors can be estimated through maximum likelihood and are consistent estimators as long as factors are pervasive and the idiosyncratic dependence and cross-correlation in idiosyncratic error terms is weak. However, if common factors can no longer be assumed to be i.i.d. – most notably when they are serially dependent – the PCA might not be the most efficient factor extraction method as it will ignore this serial dependence. For this reason, alternative techniques for factor extraction are evaluated. [However, the evaluation] shows that accuracy is relatively similar across different techniques. Therefore, we elect PCA which has the double advantage of simplicity and lower computational need.”

Step 3: Predictive regression methods

“We distinguish between tree-based and regression-based techniques. The first category – tree-based – includes random forest and gradient boosting and is the most popular in the literature.”

“Non-linear techniques – including innovative machine learning techniques – have been used to improve accuracy relative to their linear counterparts particularly during crisis episodes or for volatile variables.”

“The focus [of regressions is] on non-linear techniques…

Markov-switching (MS) that allows model parameters to differ across regimes. MS assumes that unobserved states are determined by a Markov-chain. The framework is characterized by transition probabilities describing the likelihood to stay in the same regime or to switch to another. Model parameters are estimated using maximum likelihood…based on expectation-maximization. In the first step, the path of the unobserved variable (latent variable) is estimated. In the second, given the unobserved regime estimated in first step, model parameters and transition probabilities are estimated. Both steps are iterated until convergence. Owing to its capacity to estimate the state of the business cycle, MS has been widely used in nowcasting…
The quantile regression (QR)…The non-linearity comes from the fact that it estimates conditional quantiles of interest of the dependent variable. The framework differs from the OLS in two main ways: (i) coefficients depend on the quantile , and (ii) rather than the sum of squared residuals as in OLS, the QR…gives asymmetric weights to the error depending on the quantile and the sign of the error. This is an extension of OLS which can be used when some conditions of the linear regression are not met (e.g. homoscedasticity, independence, normality). This method is notably employed in the growth-at-risk framework…
Random forest (RF) is an ensemble method using a large number of decision trees…Then, by averaging predictions over multiple noisy trees variance of the aggregate prediction is reduced. And since trees can also have relatively low bias, the aggregate prediction can exhibit both low variance and low bias. Key in this technique is the low correlation among trees: this is ensured by (i) growing each tree on a bootstrapped subsample of the initial dataset, and (ii) by restricting the number of variables considered at each node – only a random subset of variables is allowed…
Gradient Boosting (GB-T) is another…tree-based method which uses a combination of un-correlated weak learners. But contrary to random forest which averages multiple trees, GB-T works by adding a tree at each iteration. More specifically, the tree is added following the direction that minimizes the loss from the prior model (i.e. following the gradient). A pre-determined number of trees are added, or the algorithm stops once the loss falls below a threshold or no longer improves. Overfitting is alleviated by constraining trees and on stochastic gradient descent in which, at each iteration, only a subsample is used…
Macroeconomic random forest (MRF)…exploits the idea that [standard] random forests are too flexible and therefore might be inefficient for macroeconomic time series with a limited number of observations [and underlying economic cycles and events]. Instead of applying trees on the full sample – as in a random forest – the MRF sets a linear regression…But unlike in linear regression, coefficients of the linear part can vary through time according to a random forest. Formally, coefficients…are estimated based on a set of variables potentially different from the regressors. This can be viewed as a way to discipline the flexibility of the random forest by ensuring some linearity in the model. This adaptation combining random forest with linear regression can also be interpreted as a ‘generalized time-varying parameters’’… The “generalized” comes from the fact that no law of motion (random walk, Markov process) has to be assumed a priori by the forecaster for the time-varying parameters…
We use the linear gradient boosting (GB-L) version [of] gradient boosting…The framework is the same as above for the GB-T, but a linear regression is used as the basic weak learner instead of a decision tree. To prevent over-fitting – that could arise more quickly with a linear regression than with a decision tree – the algorithm can include L1 and L2 regularizations [i.e. penalties on large coefficients].”

A practical application

“Estimates for global trade in volumes…are widely used among economists, but…are published around eight weeks after month end [by] the Dutch Centraal Plan Bureau (CPB). [We] exploit…early available information to provide advance estimates of trade in volumes ahead of the CPB releases. To this end, we assemble a large dataset of 600 variables based on the literature on nowcasting trade. Given publication delays for the CPB data, the purpose is not only to predict trade for the current month (“nowcasting”: prediction for month at which the forecaster is) but also in previous months (“back-casting” at months −2 and −1 for which CPB data have not been released yet). We also “forecast” at +1 to assess the informative content of our dataset about future developments.”

“Variables included in our dataset cover broad aspects of the trade outlook. Our target variable is the year-on-year growth rate of world trade from the CPB. Our set of explanatory variables is composed of 536 variables detailed in Annex 2 [of the paper]. To build this dataset, we have taken all variables included at some point in the literature on nowcasting trade.”

“In real-time, asynchronous publication dates across the different variables lead to a ‘ragged-edge’ pattern at the bottom of the dataset. To address this issue, we apply the ‘vertical realignment’ technique to variables that do not have values at the intended date of the forecast. For each variable, the last available point is taken as the contemporaneous value and the entire series is realigned accordingly… This straightforward procedure has been used in various nowcasting applications and has [in some cases] been shown to perform as well as other techniques… In addition to the baseline vertical realignment…we adjust for variables with observations available after the intended date of the forecast.”
N.B.: This vertical alignment is non-standard and would not be suitable if it is important that timestamps or observation periods are aligned. The more common approach is to fill the jagged edge of data sets with predictions of the missing data based on the available ones. This way information is combined according to the periods of their origin.

“The comparison across techniques is run out-of-sample on post-GFC [great financial crisis] trade, from January 2012 to April 2022 in a close-to-real-time set-up, as pre-selection, factor extraction and model parameters are re-estimated at each point. This aims at mimicking what a forecaster would have been capable to achieve with the information at his disposal at the time of the forecast. Hence, pre-selection is performed only with the in-sample data.”

“It should be noted that for machine learning methods, such a real-time set-up entails also the re-calibration of hyper-parameters – i.e. parameters that are not estimated by the model, but instead set by the forecaster, e.g. the number of trees in a random forest – at each date. To do so, we perform a cross-validation adapted for time series: the in-sample period is split between a “test” sample (the last 12 monthly observations) and a ‘train’ sample (the rest of the data).”

Empirical findings

“The three-step approach outperforms benchmarks significantly and consistently…It outperforms in particular the widely used ‘diffusion index’’ method of Stock and Watson (2002) that uses two steps: factors extraction via Principal Components Analysis (PCA) and OLS regression on these factors. Our approach also outperforms a dynamic factor model, a technique widely used in the nowcasting literature. We also show that both pre-selection and factor extraction improves the accuracy of machine-learning techniques.”

“Looking at the gains step by step, we show that (i) pre-selecting regressors can enhance performances by around 10-15% on average and up to 40%; (ii) factor extraction entails around 10-15% further gains, including also for machine learning techniques despite the ability of such methods to accommodate for high-dimensional datasets; and (iii) using machine learning techniques can further improve the accuracy by around 15-20%.”

“The best-performing triplet is formed by the Least Angle Regression (LARS) for pre-selection, principal component analysis (PCA) for factor extraction, and the Macroeconomic Random Forest (MRF) for prediction.”

“The regression-based techniques – macroeconomic random forest and linear gradient boosting – provide the most accurate predictions…Regression-based ML methods significantly outperform tree-based methods despite sharing a similar framework, as the regression-based ML methods are generally an adaptation of their tree-based counterparts. This suggests that in the short time samples common in macroeconomics, forecasts might be better when relying on regression-based ML methods…Regression-based ML methods outperform…also more ‘traditional’ non-linear techniques (Markov-switching and quantile regression) and Ordinary Least Squares (OLS). They do so significantly and consistently across different horizons, real-time datasets, and states of the economy.”