# Backtesting (rolling/expanding) This page documents the `srvar backtest` command, which runs a reproducible rolling/expanding refit + forecast evaluation loop from a YAML config. ## Why backtest? A backtest answers questions like: - How stable are forecasts across time? - Are predictive intervals well calibrated (coverage / PIT)? - How does forecast accuracy evolve with horizon (CRPS/RMSE/MAE)? ## CLI usage From the repository root: ```bash # Validate the config (including backtest/evaluation blocks) srvar validate config/backtest_demo_config.yaml # Run backtest srvar backtest config/backtest_demo_config.yaml # Override output directory srvar backtest config/backtest_demo_config.yaml --out outputs/my_backtest # Useful flags srvar backtest config/backtest_demo_config.yaml --quiet srvar backtest config/backtest_demo_config.yaml --no-color srvar backtest config/backtest_demo_config.yaml --verbose # Paired legacy-vs-canonical Minnesota comparison python scripts/compare_minnesota_backtests.py config/backtest_demo_config.yaml \ --out-root outputs/minnesota_comparison ``` The comparison script writes paired `baseline/` and `candidate/` backtest directories plus a `metrics_comparison.csv` file built from the two `metrics.csv` tables. To consolidate multiple paired comparison bundles into one table, run: ```bash python scripts/summarize_minnesota_comparisons.py \ --root outputs/minnesota_comparison \ --out-csv outputs/minnesota_comparison/summary.csv \ --out-md outputs/minnesota_comparison/summary.md ``` This writes one row per benchmark with baseline/candidate metric means and deltas. To inspect one paired comparison bundle variable by variable across horizons, run: ```bash python scripts/summarize_metrics_comparison_by_variable.py \ outputs/minnesota_comparison/vintage_macro15_homoskedastic/metrics_comparison.csv ``` This writes `variable_summary.csv` and `variable_summary.md` alongside the input comparison file. If the paired backtests were run with saved forecast artifacts, you can also compare predictive dispersion directly: ```bash python scripts/compare_minnesota_backtests.py config/vintage_macro15_backtest_diagonal_sv.yaml \ --out-root outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv \ --save-forecasts python scripts/compare_forecast_dispersion.py \ outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/baseline/forecasts \ outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/candidate/forecasts ``` This writes `forecast_dispersion_summary.csv` and `forecast_dispersion_summary.md` with variable/horizon averages for predictive standard deviation and central interval widths. To compare forecast means against realized outcomes origin by origin from saved forecast bundles: ```bash python scripts/compare_forecast_means_to_realized.py \ outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/baseline \ outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/candidate \ --variables EXUSUK PPIACO CPIAUCSL M2SL ``` This writes `forecast_mean_summary.csv` and `forecast_mean_summary.md` with mean forecast centers, signed errors, absolute errors, and counts of origins where the candidate is closer to realized. For a narrow origin-level deep dive, filter specific cases and write the raw rows: ```bash python scripts/compare_forecast_means_to_realized.py \ outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/baseline \ outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/candidate \ --cases EXUSUK:1 EXUSUK:2 CPIAUCSL:4 PPIACO:2 M2SL:2 \ --out-detail-csv outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/case_detail.csv \ --out-detail-md outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/case_detail.md ``` To reproduce one scheduled origin as paired baseline/candidate fits, save the fit artifacts, and compare coefficient means plus end-of-sample volatility states directly: ```bash python scripts/diagnose_minnesota_origin.py \ config/vintage_macro15_backtest_diagonal_sv.yaml \ --out-root outputs/minnesota_origin_diagnostics/vintage_macro15_diagonal_sv_2009q1 \ --origin-date 2009-01-01 \ --variables EXUSUK PPIACO CPIAUCSL M2SL \ --horizons 1 2 4 ``` This writes paired `baseline/` and `candidate/` run directories with `fit_result.npz` and `forecast_result.npz`, plus `state_comparison.csv`, `forecast_comparison.csv`, `beta_mean_comparison.csv`, and compact Markdown summaries for the state table, forecast table, and top coefficient deltas. To compare the full posterior draw distributions for selected coefficients from those paired fit artifacts, run: ```bash python scripts/compare_fit_coefficients.py \ outputs/minnesota_origin_diagnostics/vintage_macro15_diagonal_sv_2009q1/baseline \ outputs/minnesota_origin_diagnostics/vintage_macro15_diagonal_sv_2009q1/candidate \ --cases \ EXUSUK:const EXUSUK:HOUST_lag1 EXUSUK:GS10_lag1 EXUSUK:FEDFUNDS_lag1 \ PPIACO:const PPIACO:HOUST_lag2 PPIACO:GS10_lag1 PPIACO:CPIAUCSL_lag1 ``` This writes `fit_coefficient_summary.csv` and `fit_coefficient_summary.md` with posterior means, central quantiles, sign probabilities, and simple overlap/flip diagnostics. Optional detail outputs can be requested to export the long-format draw table. To inspect the prior-scale driver directly at one scheduled origin, compare the legacy and canonical Minnesota coefficient variances: ```bash python scripts/diagnose_minnesota_prior_scales.py \ config/vintage_macro15_backtest_diagonal_sv.yaml \ --out-root outputs/minnesota_prior_scale_diagnostics/vintage_macro15_diagonal_sv_2009q1 \ --origin-date 2009-01-01 \ --cases \ EXUSUK:const EXUSUK:HOUST_lag1 EXUSUK:GS10_lag1 EXUSUK:FEDFUNDS_lag1 EXUSUK:EXUSUK_lag1 \ PPIACO:const PPIACO:HOUST_lag2 PPIACO:GS10_lag1 PPIACO:CPIAUCSL_lag1 PPIACO:PPIACO_lag1 ``` This writes `prior_scale_comparison.csv` plus compact Markdown summaries showing how much looser or tighter the canonical prior is for each selected coefficient, along with the corresponding `sigma2` scale terms and the closed-form expected ratio implied by the legacy-vs-canonical construction. For a bounded mitigation experiment, you can also run a three-way origin comparison with a tempered bridge prior between legacy and canonical Minnesota: ```bash python scripts/experiment_tempered_minnesota_origin.py \ config/vintage_macro15_backtest_diagonal_sv.yaml \ --out-root outputs/minnesota_tempered_origin/vintage_macro15_diagonal_sv_2009q1_alpha05 \ --origin-date 2009-01-01 \ --alpha 0.5 \ --variables EXUSUK PPIACO CPIAUCSL M2SL \ --horizons 1 2 4 ``` This writes paired `baseline/`, `canonical/`, and `tempered/` fit/forecast artifacts plus three-way `forecast_comparison.csv`, `state_comparison.csv`, and `beta_comparison.csv` tables. `alpha=0` reproduces the legacy variance map, while `alpha=1` reproduces the canonical variance map. This diagnostic experiment currently requires a diagonal-SV backtest config. For a larger fully local benchmark dataset, first build the quarterly term-spread/NFCI panel and then run the tracked config: ```bash python scripts/prepare_term_nfci_benchmark.py --out data/cache/term_nfci_quarterly.csv srvar backtest config/term_nfci_backtest.yaml --out outputs/term_nfci_backtest srvar backtest config/term_nfci_backtest_homoskedastic.yaml \ --out outputs/term_nfci_backtest_homoskedastic python scripts/prepare_term_nfci_wuxia_benchmark.py --out data/cache/term_nfci_wuxia_quarterly.csv srvar backtest config/term_nfci_wuxia_backtest.yaml --out outputs/term_nfci_wuxia_backtest python scripts/prepare_vintage_macro15_benchmark.py \ --vintage 2022Q3 \ --out data/cache/vintage_macro15_quarterly.csv srvar backtest config/vintage_macro15_backtest_homoskedastic.yaml \ --out outputs/vintage_macro15_backtest_homoskedastic srvar backtest config/vintage_macro15_backtest_diagonal_sv.yaml \ --out outputs/vintage_macro15_backtest_diagonal_sv ``` The vintage-based benchmark uses the repo’s local quarterly vintage workbooks, fixes the source vintage at `2022Q3`, and applies simple benchmark transforms before backtesting. Tracked companion configs are provided for both homoskedastic and diagonal-SV comparisons. ## YAML schema (high level) Backtesting uses the standard model blocks plus two additional sections: - `backtest`: defines origins, refit policy, horizons, and forecast draw settings - `evaluation`: defines which diagnostics/plots/exports to produce See the comment-rich template: - `config/backtest_demo_config.yaml` ### `backtest` Common keys: - `mode`: `expanding` or `rolling` - `min_obs`: minimum training sample size at the first origin - `step`: how far to advance the origin each iteration - `horizons`: list of horizons to evaluate - `draws`: predictive simulation draws per origin - `quantile_levels`: quantiles to compute from draws ### `evaluation` Common keys: - `coverage`: empirical interval coverage by horizon - `pit`: PIT histograms for selected variables/horizons - `crps`: CRPS-by-horizon diagnostic - `elb_censor`: ELB-censored evaluation (floor realized values and optionally forecasts) - `metrics_table`: if true, writes `metrics.csv` For details on scoring conventions (ELB censoring, latent-vs-observed scoring, and horizon semantics), see {doc}`evaluation`. ## Outputs Backtest artifacts are written under `output.out_dir` (or `--out`): - `config.yml`: exact config used - `metrics.csv`: per-variable, per-horizon summary metrics - `coverage_all.png`: coverage vs horizon averaged across variables - `coverage_.png`: coverage vs horizon for each variable - `pit__h.png`: PIT histograms for selected variables/horizons - `crps_by_horizon.png`: CRPS aggregated by horizon - `backtest_summary.json`: summary metadata (mode, origins, horizons, elapsed time) Notes: - `metrics.csv` includes horizons `1..max(backtest.horizons)` (even if `backtest.horizons` is sparse). ## ELB-censored evaluation Some macro forecast evaluations treat interest rates as **censored at an effective lower bound (ELB)** when scoring forecasts (e.g., to match shadow-rate VAR conventions in the literature). Backtesting supports an **evaluation-time** ELB censoring step via `evaluation.elb_censor`. This is distinct from `model.elb`: - `model.elb`: affects estimation/forecasting (latent shadow rate + observed floor) - `evaluation.elb_censor`: affects *only* how realized values and/or forecast draws are scored Example: ```yaml evaluation: elb_censor: enabled: true bound: 0.25 variables: ["FEDFUNDS"] censor_realized: true censor_forecasts: false ``` Notes: - When `censor_realized: true`, realized values are replaced by `max(y, bound)` for the selected variables. - When `censor_forecasts: true`, forecast draws are floored at `bound` for the selected variables *before* metrics/plots are computed. ## Disabling diagnostics The `enabled` flags control both **computation** and **outputs**: - `evaluation.coverage.enabled: false` skips coverage computation and omits `coverage_*` columns/plots. - `evaluation.crps.enabled: false` skips CRPS computation and writes `crps=NaN` in `metrics.csv`. - `evaluation.pit.enabled: false` skips PIT plots. ## Memory and scaling Backtests can be memory-intensive if you retain all per-origin forecasts in RAM (especially with many origins, draws, variables, and horizons). Control retention with: ```yaml output: # If true, keeps all per-origin forecasts in memory (required for plots). # If false, metrics are computed in a streaming way (plots not supported yet). store_forecasts_in_memory: false ``` Recommended patterns: - Heavy runs: `save_plots: false` + `store_forecasts_in_memory: false` (fast + memory-light; writes `metrics.csv`). - Diagnostic runs: `save_plots: true` + `store_forecasts_in_memory: true` (plots + full retention). ## Interpreting the diagnostics ### Coverage Coverage plots compare **empirical** coverage (y-axis) against the **nominal interval** in the legend. - Above nominal: intervals are conservative (too wide) - Below nominal: intervals are too narrow (overconfident) ### PIT PIT histograms should be approximately uniform under calibration. - U-shaped: predictive distribution too narrow - Inverted-U: too wide - Skew: biased forecasts ### CRPS CRPS is a proper scoring rule for probabilistic forecasts. - Lower is better - Plot typically increases with horizon