Fix: Validation leakage and unfair baseline causing inflated metrics
Four bugs combined to produce artificially inflated optimization scores (e.g. baseline 89% → optimized 100% with no real improvement):
1. Validation data leak in sequential field optimization (critical)
_optimize_single_field always set val_single = train_single due to an off-by-one guard condition — DSPy's optimizer trained and validated on identical data, biasing candidate selection toward memorization.
2. Silent empty-validation fallback
When too few examples existed for a proper train/val split, the code silently used the training set as validation. Now emits a UserWarning.
3. Prompt metric used wrong field descriptions
Phase 2 prompt optimization evaluated candidates with the original field descriptions instead of Phase 1's optimized ones, causing DSPy to pick prompts in the wrong evaluation context.
4. Unfair baseline (no few-shot demos)
Baseline score was computed without few-shot demos, while all subsequent evaluations included up to 8 demos in the extraction prompt. The apparent improvement was from demos alone, not from description optimization. Baseline now includes demos for a fair comparison.
Install
uv pip install dspydantic==0.1.6