Methodology

The methodology — walk-forward, leakage-clean, deterministic — is the project's deliverable. Every detail below was tightened in response to a concrete failure documented in the carry-forward learnings (L1-L54).

Walk-forward only

No random train/test splits. Models are trained on a strict time prefix and evaluated on the strictly later holdout. Walk-forward retrains every 5 days during the holdout, so each prediction sees only information available at decision time.

Holdout sizes have varied through the phases — 21 days, 60 days, 180 days, and combinations evaluated together. Window stability across nested holdouts is itself a leakage signal (see L54 below).

The six leakage tests

Every strategy must pass all six before its verdict is trusted. They live in src/stock_core/spikes/spike_09_walk_forward/leakage_tests.py and stability_tests.py.

shuffled_target — scramble the target returns in time. A real time-edge should collapse to noise. Non-zero Sharpe means leakage somewhere. Bound is empirically calibrated, not closed-form (L53).
look_ahead_cheat — plant tomorrow's price movement as an explicit input column. Sharpe should rocket. If it doesn't, the test framework itself is broken — we can't detect leaks. Cheat columns travel through the same standardisation transform as honest features (L30).
future_news_cheat — shift news features +1 day so the model sees tomorrow's news today. Same logic: Sharpe should rocket. Uses a ratio threshold sharpe_cheat / sharpe_honest >= 5, not absolute (L27).
bit_exact_reproducibility — two runs with the same seed must produce identical P&L. If not, randomness has crept in and any result is suspect.
static_features_test (Phase 8.A) — replace every feature column with its per-(stock) time-mean, held constant across the holdout. A time-edge model produces near-zero positions on constant inputs; a ticker-memoriser keeps its baseline allocations. Pass iff |sharpe| < 2·SE. Directly catches the L43 cross-stock memorisation that shuffled_target only hints at.
permutation_invariance (Phase 11.B) — train on (features, returns), evaluate on (features with permuted ticker labels, returns). Sharpe should drop to noise.

Plus the window-stability gate (window_robustness_test, Phase 6.B) runs walk-forward on two holdout sizes against the same data and gates on |sharpe_short - sharpe_long| < 2·sqrt(SE²_short + SE²_long). A real edge survives the nested-window comparison; a window-position artifact does not.

Co-conditions on the cheat tests

In addition to a Sharpe lift, both look_ahead_cheat and future_news_cheat require |mean_pnl_cheat| / max(|mean_pnl_honest|, 1e-8) >= 10 (L31). This catches the negative_sharpe scale-invariance pathology where honest positions collapse to near-zero, making the ratio of degenerate-to-degenerate uninformative.

Position-aware losses

The loss_fn(pnl) contract was structurally pnl-only — no loss could regulate position magnitude when it couldn't see positions. This was the L32 finding after 160 rejected nodes. Phase 6.E lifted it: walk_forward._call_loss(loss_fn, pnl, pos) dispatches via inspect.signature so legacy losses keep working but new losses can declare pos and get position tensors directly.

The canonical position-aware loss is sharpe_with_position_floor(pnl, pos, floor=0.05, alpha=0.5) (lib/losses.py). The penalty α·ReLU(floor − mean|pos|) has a direct gradient on positions, breaking the scale-invariance saddle. α=0.5 (softened from 5.0 in L37) keeps the penalty meaningful without overwhelming the Sharpe objective.

Per-stock z-score (Phase 8)

To attack cross-stock memorisation (L43), features are preprocessed with per_stock_rolling_zscore(features, window=60) (lib/preprocessing.py) — a causal 60-day rolling z-score per (stock, feature). This removes the level component of stock-identity but not the covariance structure, so it is necessary but not sufficient on its own (L44).

Phase 9.A added a per-stock fixed-prefix variant. L47 documents a subtle bug in the original implementation (clamping at 1e-8 blew up warmup-period features into 10^7+ z-scores that then drowned cheat columns) — fixed by clamp(min=1e-3) and by passing RAW features to the cheat tests inside _run_per_stock.

Deterministic by default

Seed everything that can be seeded — numpy, torch, the data shuffle path. Spike 9 explicitly tests this. Same seed → identical numbers, every time.

Realistic Sharpe ceiling — L23

Any reported Sharpe greater than 1.0 at this scale is treated as a bug to investigate, not a result to celebrate. This bar shaped every Phase 6+ design call.

Methodology ​

Walk-forward only ​

The six leakage tests ​

Co-conditions on the cheat tests ​

Position-aware losses ​

Per-stock z-score (Phase 8) ​

Deterministic by default ​

Realistic Sharpe ceiling — L23 ​