Methodology
The methodology — walk-forward, leakage-clean, deterministic — is the project's deliverable. Every detail below was tightened in response to a concrete failure documented in the carry-forward learnings (L1-L54).
Walk-forward only
No random train/test splits. Models are trained on a strict time prefix and evaluated on the strictly later holdout. Walk-forward retrains every 5 days during the holdout, so each prediction sees only information available at decision time.
Holdout sizes have varied through the phases — 21 days, 60 days, 180 days, and combinations evaluated together. Window stability across nested holdouts is itself a leakage signal (see L54 below).
The six leakage tests
Every strategy must pass all six before its verdict is trusted. They live in src/stock_core/spikes/spike_09_walk_forward/leakage_tests.py and stability_tests.py.
shuffled_target— scramble the target returns in time. A real time-edge should collapse to noise. Non-zero Sharpe means leakage somewhere. Bound is empirically calibrated, not closed-form (L53).look_ahead_cheat— plant tomorrow's price movement as an explicit input column. Sharpe should rocket. If it doesn't, the test framework itself is broken — we can't detect leaks. Cheat columns travel through the same standardisation transform as honest features (L30).future_news_cheat— shift news features +1 day so the model sees tomorrow's news today. Same logic: Sharpe should rocket. Uses a ratio thresholdsharpe_cheat / sharpe_honest >= 5, not absolute (L27).bit_exact_reproducibility— two runs with the same seed must produce identical P&L. If not, randomness has crept in and any result is suspect.static_features_test(Phase 8.A) — replace every feature column with its per-(stock) time-mean, held constant across the holdout. A time-edge model produces near-zero positions on constant inputs; a ticker-memoriser keeps its baseline allocations. Pass iff|sharpe| < 2·SE. Directly catches the L43 cross-stock memorisation thatshuffled_targetonly hints at.permutation_invariance(Phase 11.B) — train on(features, returns), evaluate on(features with permuted ticker labels, returns). Sharpe should drop to noise.
Plus the window-stability gate (window_robustness_test, Phase 6.B) runs walk-forward on two holdout sizes against the same data and gates on |sharpe_short - sharpe_long| < 2·sqrt(SE²_short + SE²_long). A real edge survives the nested-window comparison; a window-position artifact does not.
Co-conditions on the cheat tests
In addition to a Sharpe lift, both look_ahead_cheat and future_news_cheat require |mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10 (L31). This catches the negative_sharpe scale-invariance pathology where honest positions collapse to near-zero, making the ratio of degenerate-to-degenerate uninformative.
Position-aware losses
The loss_fn(pnl) contract was structurally pnl-only — no loss could regulate position magnitude when it couldn't see positions. This was the L32 finding after 160 rejected nodes. Phase 6.E lifted it: walk_forward._call_loss(loss_fn, pnl, pos) dispatches via inspect.signature so legacy losses keep working but new losses can declare pos and get position tensors directly.
The canonical position-aware loss is sharpe_with_position_floor(pnl, pos, floor=0.05, alpha=0.5) (lib/losses.py). The penalty α·ReLU(floor − mean|pos|) has a direct gradient on positions, breaking the scale-invariance saddle. α=0.5 (softened from 5.0 in L37) keeps the penalty meaningful without overwhelming the Sharpe objective.
Per-stock z-score (Phase 8)
To attack cross-stock memorisation (L43), features are preprocessed with per_stock_rolling_zscore(features, window=60) (lib/preprocessing.py) — a causal 60-day rolling z-score per (stock, feature). This removes the level component of stock-identity but not the covariance structure, so it is necessary but not sufficient on its own (L44).
Phase 9.A added a per-stock fixed-prefix variant. L47 documents a subtle bug in the original implementation (clamping at 1e-8 blew up warmup-period features into 10^7+ z-scores that then drowned cheat columns) — fixed by clamp(min=1e-3) and by passing RAW features to the cheat tests inside _run_per_stock.
Deterministic by default
Seed everything that can be seeded — numpy, torch, the data shuffle path. Spike 9 explicitly tests this. Same seed → identical numbers, every time.
Realistic Sharpe ceiling — L23
Any reported Sharpe greater than 1.0 at this scale is treated as a bug to investigate, not a result to celebrate. This bar shaped every Phase 6+ design call.