Backtesting (rolling/expanding)

This page documents the srvar backtest command, which runs a reproducible rolling/expanding refit + forecast evaluation loop from a YAML config.

Why backtest?

A backtest answers questions like:

  • How stable are forecasts across time?

  • Are predictive intervals well calibrated (coverage / PIT)?

  • How does forecast accuracy evolve with horizon (CRPS/RMSE/MAE)?

CLI usage

From the repository root:

# Validate the config (including backtest/evaluation blocks)
srvar validate config/backtest_demo_config.yaml

# Run backtest
srvar backtest config/backtest_demo_config.yaml

# Override output directory
srvar backtest config/backtest_demo_config.yaml --out outputs/my_backtest

# Useful flags
srvar backtest config/backtest_demo_config.yaml --quiet
srvar backtest config/backtest_demo_config.yaml --no-color
srvar backtest config/backtest_demo_config.yaml --verbose

# Paired legacy-vs-canonical Minnesota comparison
python scripts/compare_minnesota_backtests.py config/backtest_demo_config.yaml \
  --out-root outputs/minnesota_comparison

The comparison script writes paired baseline/ and candidate/ backtest directories plus a metrics_comparison.csv file built from the two metrics.csv tables.

To consolidate multiple paired comparison bundles into one table, run:

python scripts/summarize_minnesota_comparisons.py \
  --root outputs/minnesota_comparison \
  --out-csv outputs/minnesota_comparison/summary.csv \
  --out-md outputs/minnesota_comparison/summary.md

This writes one row per benchmark with baseline/candidate metric means and deltas.

To inspect one paired comparison bundle variable by variable across horizons, run:

python scripts/summarize_metrics_comparison_by_variable.py \
  outputs/minnesota_comparison/vintage_macro15_homoskedastic/metrics_comparison.csv

This writes variable_summary.csv and variable_summary.md alongside the input comparison file.

If the paired backtests were run with saved forecast artifacts, you can also compare predictive dispersion directly:

python scripts/compare_minnesota_backtests.py config/vintage_macro15_backtest_diagonal_sv.yaml \
  --out-root outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv \
  --save-forecasts

python scripts/compare_forecast_dispersion.py \
  outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/baseline/forecasts \
  outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/candidate/forecasts

This writes forecast_dispersion_summary.csv and forecast_dispersion_summary.md with variable/horizon averages for predictive standard deviation and central interval widths.

To compare forecast means against realized outcomes origin by origin from saved forecast bundles:

python scripts/compare_forecast_means_to_realized.py \
  outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/baseline \
  outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/candidate \
  --variables EXUSUK PPIACO CPIAUCSL M2SL

This writes forecast_mean_summary.csv and forecast_mean_summary.md with mean forecast centers, signed errors, absolute errors, and counts of origins where the candidate is closer to realized.

For a narrow origin-level deep dive, filter specific cases and write the raw rows:

python scripts/compare_forecast_means_to_realized.py \
  outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/baseline \
  outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/candidate \
  --cases EXUSUK:1 EXUSUK:2 CPIAUCSL:4 PPIACO:2 M2SL:2 \
  --out-detail-csv outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/case_detail.csv \
  --out-detail-md outputs/minnesota_diagnostics/vintage_macro15_diagonal_sv/case_detail.md

To reproduce one scheduled origin as paired baseline/candidate fits, save the fit artifacts, and compare coefficient means plus end-of-sample volatility states directly:

python scripts/diagnose_minnesota_origin.py \
  config/vintage_macro15_backtest_diagonal_sv.yaml \
  --out-root outputs/minnesota_origin_diagnostics/vintage_macro15_diagonal_sv_2009q1 \
  --origin-date 2009-01-01 \
  --variables EXUSUK PPIACO CPIAUCSL M2SL \
  --horizons 1 2 4

This writes paired baseline/ and candidate/ run directories with fit_result.npz and forecast_result.npz, plus state_comparison.csv, forecast_comparison.csv, beta_mean_comparison.csv, and compact Markdown summaries for the state table, forecast table, and top coefficient deltas.

To compare the full posterior draw distributions for selected coefficients from those paired fit artifacts, run:

python scripts/compare_fit_coefficients.py \
  outputs/minnesota_origin_diagnostics/vintage_macro15_diagonal_sv_2009q1/baseline \
  outputs/minnesota_origin_diagnostics/vintage_macro15_diagonal_sv_2009q1/candidate \
  --cases \
    EXUSUK:const EXUSUK:HOUST_lag1 EXUSUK:GS10_lag1 EXUSUK:FEDFUNDS_lag1 \
    PPIACO:const PPIACO:HOUST_lag2 PPIACO:GS10_lag1 PPIACO:CPIAUCSL_lag1

This writes fit_coefficient_summary.csv and fit_coefficient_summary.md with posterior means, central quantiles, sign probabilities, and simple overlap/flip diagnostics. Optional detail outputs can be requested to export the long-format draw table.

To inspect the prior-scale driver directly at one scheduled origin, compare the legacy and canonical Minnesota coefficient variances:

python scripts/diagnose_minnesota_prior_scales.py \
  config/vintage_macro15_backtest_diagonal_sv.yaml \
  --out-root outputs/minnesota_prior_scale_diagnostics/vintage_macro15_diagonal_sv_2009q1 \
  --origin-date 2009-01-01 \
  --cases \
    EXUSUK:const EXUSUK:HOUST_lag1 EXUSUK:GS10_lag1 EXUSUK:FEDFUNDS_lag1 EXUSUK:EXUSUK_lag1 \
    PPIACO:const PPIACO:HOUST_lag2 PPIACO:GS10_lag1 PPIACO:CPIAUCSL_lag1 PPIACO:PPIACO_lag1

This writes prior_scale_comparison.csv plus compact Markdown summaries showing how much looser or tighter the canonical prior is for each selected coefficient, along with the corresponding sigma2 scale terms and the closed-form expected ratio implied by the legacy-vs-canonical construction.

For a bounded mitigation experiment, you can also run a three-way origin comparison with a tempered bridge prior between legacy and canonical Minnesota:

python scripts/experiment_tempered_minnesota_origin.py \
  config/vintage_macro15_backtest_diagonal_sv.yaml \
  --out-root outputs/minnesota_tempered_origin/vintage_macro15_diagonal_sv_2009q1_alpha05 \
  --origin-date 2009-01-01 \
  --alpha 0.5 \
  --variables EXUSUK PPIACO CPIAUCSL M2SL \
  --horizons 1 2 4

This writes paired baseline/, canonical/, and tempered/ fit/forecast artifacts plus three-way forecast_comparison.csv, state_comparison.csv, and beta_comparison.csv tables. alpha=0 reproduces the legacy variance map, while alpha=1 reproduces the canonical variance map. This diagnostic experiment currently requires a diagonal-SV backtest config.

For a larger fully local benchmark dataset, first build the quarterly term-spread/NFCI panel and then run the tracked config:

python scripts/prepare_term_nfci_benchmark.py --out data/cache/term_nfci_quarterly.csv
srvar backtest config/term_nfci_backtest.yaml --out outputs/term_nfci_backtest
srvar backtest config/term_nfci_backtest_homoskedastic.yaml \
  --out outputs/term_nfci_backtest_homoskedastic
python scripts/prepare_term_nfci_wuxia_benchmark.py --out data/cache/term_nfci_wuxia_quarterly.csv
srvar backtest config/term_nfci_wuxia_backtest.yaml --out outputs/term_nfci_wuxia_backtest
python scripts/prepare_vintage_macro15_benchmark.py \
  --vintage 2022Q3 \
  --out data/cache/vintage_macro15_quarterly.csv
srvar backtest config/vintage_macro15_backtest_homoskedastic.yaml \
  --out outputs/vintage_macro15_backtest_homoskedastic
srvar backtest config/vintage_macro15_backtest_diagonal_sv.yaml \
  --out outputs/vintage_macro15_backtest_diagonal_sv

The vintage-based benchmark uses the repo’s local quarterly vintage workbooks, fixes the source vintage at 2022Q3, and applies simple benchmark transforms before backtesting. Tracked companion configs are provided for both homoskedastic and diagonal-SV comparisons.

YAML schema (high level)

Backtesting uses the standard model blocks plus two additional sections:

  • backtest: defines origins, refit policy, horizons, and forecast draw settings

  • evaluation: defines which diagnostics/plots/exports to produce

See the comment-rich template:

  • config/backtest_demo_config.yaml

backtest

Common keys:

  • mode: expanding or rolling

  • min_obs: minimum training sample size at the first origin

  • step: how far to advance the origin each iteration

  • horizons: list of horizons to evaluate

  • draws: predictive simulation draws per origin

  • quantile_levels: quantiles to compute from draws

evaluation

Common keys:

  • coverage: empirical interval coverage by horizon

  • pit: PIT histograms for selected variables/horizons

  • crps: CRPS-by-horizon diagnostic

  • elb_censor: ELB-censored evaluation (floor realized values and optionally forecasts)

  • metrics_table: if true, writes metrics.csv

For details on scoring conventions (ELB censoring, latent-vs-observed scoring, and horizon semantics), see Evaluation and scoring conventions.

Outputs

Backtest artifacts are written under output.out_dir (or --out):

  • config.yml: exact config used

  • metrics.csv: per-variable, per-horizon summary metrics

  • coverage_all.png: coverage vs horizon averaged across variables

  • coverage_<var>.png: coverage vs horizon for each variable

  • pit_<var>_h<h>.png: PIT histograms for selected variables/horizons

  • crps_by_horizon.png: CRPS aggregated by horizon

  • backtest_summary.json: summary metadata (mode, origins, horizons, elapsed time)

Notes:

  • metrics.csv includes horizons 1..max(backtest.horizons) (even if backtest.horizons is sparse).

ELB-censored evaluation

Some macro forecast evaluations treat interest rates as censored at an effective lower bound (ELB) when scoring forecasts (e.g., to match shadow-rate VAR conventions in the literature).

Backtesting supports an evaluation-time ELB censoring step via evaluation.elb_censor. This is distinct from model.elb:

  • model.elb: affects estimation/forecasting (latent shadow rate + observed floor)

  • evaluation.elb_censor: affects only how realized values and/or forecast draws are scored

Example:

evaluation:
  elb_censor:
    enabled: true
    bound: 0.25
    variables: ["FEDFUNDS"]
    censor_realized: true
    censor_forecasts: false

Notes:

  • When censor_realized: true, realized values are replaced by max(y, bound) for the selected variables.

  • When censor_forecasts: true, forecast draws are floored at bound for the selected variables before metrics/plots are computed.

Disabling diagnostics

The enabled flags control both computation and outputs:

  • evaluation.coverage.enabled: false skips coverage computation and omits coverage_* columns/plots.

  • evaluation.crps.enabled: false skips CRPS computation and writes crps=NaN in metrics.csv.

  • evaluation.pit.enabled: false skips PIT plots.

Memory and scaling

Backtests can be memory-intensive if you retain all per-origin forecasts in RAM (especially with many origins, draws, variables, and horizons).

Control retention with:

output:
  # If true, keeps all per-origin forecasts in memory (required for plots).
  # If false, metrics are computed in a streaming way (plots not supported yet).
  store_forecasts_in_memory: false

Recommended patterns:

  • Heavy runs: save_plots: false + store_forecasts_in_memory: false (fast + memory-light; writes metrics.csv).

  • Diagnostic runs: save_plots: true + store_forecasts_in_memory: true (plots + full retention).

Interpreting the diagnostics

Coverage

Coverage plots compare empirical coverage (y-axis) against the nominal interval in the legend.

  • Above nominal: intervals are conservative (too wide)

  • Below nominal: intervals are too narrow (overconfident)

PIT

PIT histograms should be approximately uniform under calibration.

  • U-shaped: predictive distribution too narrow

  • Inverted-U: too wide

  • Skew: biased forecasts

CRPS

CRPS is a proper scoring rule for probabilistic forecasts.

  • Lower is better

  • Plot typically increases with horizon