Date: December 15, 2024 Study: 8 configurations tested on 2024 NFL season (256 games) Baseline: Week 9 Legacy Model
Individual features help, but combining them hurts!
- ✅ Vegas Spread alone: +1.2% (68.0% accuracy)
- ✅ Momentum alone: +0.8% (67.6% accuracy)
- ✅ Temporal Weighting alone: +0.4% (67.2% accuracy)
- ❌ ALL combined (Week 10+): -1.6% (65.2% accuracy)
Baseline: 66.8% accuracy on 2024 holdout
| Rank | Configuration | Accuracy | Delta | HC Accuracy | Brier | AUC | Impact |
|---|---|---|---|---|---|---|---|
| 1 | Vegas Spread | 68.0% | +1.2% | 86.7% | 0.222 | 0.715 | ✅ HELPFUL |
| 2 | Momentum Features | 67.6% | +0.8% | 92.3% | 0.221 | 0.724 | ✅ HELPFUL |
| 3 | Temporal Weighting | 67.2% | +0.4% | 76.0% | 0.221 | 0.715 | ⚪ NEUTRAL |
| 4 | BASELINE | 66.8% | 0.0% | 73.3% | 0.221 | 0.718 | - |
| 5 | Full Week 10+ | 65.2% | -1.6% | 85.7% | 0.219 | 0.717 | ❌ HARMFUL |
| 6 | Injury Estimates | 63.7% | -3.1% | 82.4% | 0.224 | 0.702 | ❌ HARMFUL |
| 7 | Defensive Stats | 62.9% | -3.9% | 83.3% | 0.223 | 0.701 | ❌ HARMFUL |
| 8 | Depth + 4th Model | 62.9% | -3.9% | 80.8% | 0.225 | 0.700 | ❌ HARMFUL |
Expected: -3 to -5% (circular dependency) Actual: +1.2% improvement
Why the discrepancy?
- In the ablation study,
vegas_spreadwas set to 0 (placeholder) - So the improvement came from the feature structure, not actual values
- Adding an extra feature may have helped RFE select better combinations
- OR: The feature name triggered different model behavior
Implication:
- Don't add vegas_spread with real values (still circular)
- But the feature SLOT helped model selection somehow
- This needs further investigation
Action: Test with real vegas spread values to see if still helpful
Expected: -2 to -4% (overfitted to 2024) Actual: +0.8% improvement, 92.3% HC accuracy (best!)
Why the discrepancy?
- 2024 was a stable year, momentum actually predictive
- Implementation in study may differ from Week 10-14 production
- Small sample (3 games) worked well in 2024 specifically
Caution:
- 92.3% HC accuracy in 2024 doesn't guarantee same in 2025
- Your Week 13-14 showed 33.3% HC accuracy (massive discrepancy)
- Something different between study and production implementation
Action:
- Verify momentum calculation matches between study and Week 10-14
- Test momentum on 2023 and 2022 data (not just 2024)
Expected: +1 to +2% (9.4% feature importance) Actual: -3.9% (WORST individual feature!)
Critical Bug Suspected:
# From ablation study:
features['defensive_ppg'] = defensive_plays['epa'].sum() * 6 / weeks
# Why multiply by 6???
# EPA is already in point units
# This is artificially inflating defensive points allowed by 6x!Likely Correct Implementation:
features['defensive_ppg'] = defensive_plays['epa'].sum() / weeks
# OR
features['defensive_epa_pg'] = defensive_plays['epa'].sum() / weeksAction:
- Fix defensive stats calculation
- Re-run test #6 with corrected formula
- Expected: Should flip from -3.9% to +1-2%
Individual feature deltas:
- Vegas: +1.2%
- Momentum: +0.8%
- Temporal: +0.4%
- Expected sum: ~+2.4%
Actual combined (Full Week 10+): -1.6%
Delta vs expected: -4.0 percentage points lost to interaction!
Possible Causes:
-
Feature Correlation (Multicollinearity):
- Vegas spread (even at 0) + momentum + temporal all capture "recent performance"
- Model gets confused with redundant signals
- RFE may select wrong subset with too many options
-
Overfitting:
- 25+ features for ~2,500 training games
- Rule of thumb: Need 10 samples per feature
- Should have max 250 features for this dataset
-
Interaction Effects:
- Momentum * Temporal Weighting may conflict
- Both trying to weight recent data differently
- Model can't reconcile contradictory signals
-
RFE Selection With Full Set:
- With 25 features, RFE may pick different (worse) subset
- Baseline with 10 features: RFE picks optimal from simple set
- Full with 25 features: RFE overwhelmed, picks suboptimal
Action: Test specific combinations:
- Vegas + Momentum (test if +2.0% or lower)
- Momentum + Temporal (test if +1.2% or lower)
- Identify which pairs conflict
- Full Week 10+: 65.2% accuracy
- HC picks: 85.7% accuracy
- Overall: 53.1% accuracy
- HC picks: 55.0% accuracy
This is a HUGE discrepancy! Something is very different between:
- The ablation study implementation
- Your actual Week 10-14 Model.ipynb
Possible Explanations:
-
Different Feature Engineering:
- Ablation study uses simplified momentum (fantasy points proxy)
- Production uses actual win/loss records (if implemented)
- Defensive stats calculated differently
-
Different Data:
- Study uses 2024 data (complete season)
- Production uses 2025 data (incomplete, early weeks)
- 2025 season dynamics may differ from 2024
-
RFE Selection Different:
- Study: RFE runs on 2015-2024 data
- Production: RFE runs on 2015-2025 partial data
- Different features selected
-
Implementation Bugs:
- Defensive stats formula bug (multiply by 6)
- Momentum calculation error
- Temporal weighting not actually applied
Action:
- Compare Week10/Model.ipynb cell-by-cell with ablation study
- Find exact differences in feature engineering code
- Run Week 10 model on 2024 holdout to verify matches study
Option A: Enhanced Baseline (RECOMMENDED)
Start with proven features from ablation study:
config_week16 = {
'use_momentum': True, # +0.8%, 92.3% HC accuracy
'use_vegas': False, # Skip (placeholder=0 in study, not real)
'use_temporal': False, # Only +0.4%, not worth complexity
'use_injuries': False, # -3.1%, harmful
'use_defensive': False, # -3.9%, harmful (likely bug)
'use_4th_model': False, # -3.9%, harmful
'tree_max_depth': 5, # Keep simple
}Expected: 67-68% accuracy (baseline 66.8% + momentum 0.8%)
Pros:
- Momentum proven helpful on 2024 holdout
- 92.3% HC accuracy in study (excellent calibration)
- Conservative improvement
Cons:
- Momentum may not generalize to 2025
- Your Week 10-14 showed poor HC performance (need to debug why)
Option B: Pure Baseline (SAFEST)
Use Week 9 legacy model with NO enhancements:
config_week16 = {
'use_momentum': False,
'use_vegas': False,
'use_temporal': False,
'use_injuries': False,
'use_defensive': False,
'use_4th_model': False,
'tree_max_depth': 5,
}Expected: 66-67% accuracy
Pros:
- Most conservative
- Known stable performance
- No risk of feature interaction
Cons:
- Leaves +0.8% momentum improvement on table
- Doesn't learn from ablation study findings
Option C: Investigate First, Deploy Later (MOST RIGOROUS)
Before Week 16, complete these investigations:
-
Fix defensive stats bug:
# Change from: features['defensive_ppg'] = defensive_plays['epa'].sum() * 6 / weeks # To: features['defensive_ppg'] = defensive_plays['epa'].sum() / weeks
Re-test to see if flips from -3.9% to positive
-
Test feature combinations:
- Momentum + Temporal
- Vegas (real values) + Momentum
- Defensive (fixed) + Baseline
-
Replicate Week 10-14 on 2024 holdout:
- Run exact Week10/Model.ipynb code on 2024 data
- Should match study's 65.2% if implementation identical
- If not, find the difference
-
Validate momentum on 2023:
- Re-run test #3 but with TEST_YEAR = 2023
- See if momentum still +0.8% or if 2024-specific
Then deploy best validated configuration to Week 16.
Expected: Unknown (but highest confidence)
Pros:
- Most thorough
- Identifies root causes
- Prevents future regressions
Cons:
- Takes 10-20 hours
- May miss Week 16 deadline
1. Momentum Features (+0.8%)
- Evidence: 67.6% accuracy, 92.3% HC accuracy on 2024
- Caution: Verify generalizes to 2025 and other years
- Action: Deploy cautiously, monitor Week 16 performance
2. Vegas Spread (+1.2%)
- Evidence: Helped in study, BUT was placeholder (0 value)
- Caution: Real vegas values may create circular dependency
- Action: Test with real spread values on 2024 holdout
3. Temporal Weighting (+0.4%)
- Evidence: Slight improvement, low risk
- Caution: Only +0.4%, may not justify complexity
- Action: Skip for now, revisit if other features fail
4. Injury Estimates (-3.1%)
- Evidence: Consistently harmful across all tests
- Reason: Circular logic (using performance to predict performance)
- Action: Remove permanently unless real injury reports available
5. Defensive Stats (-3.9%)
- Evidence: Worst individual feature
- Likely Cause: Bug in formula (multiply by 6)
- Action: Fix bug and re-test, then re-evaluate
6. Increased Depth + 4th Model (-3.9%)
- Evidence: No benefit, increased overfitting
- Reason: Dataset too small for depth=15
- Action: Keep depth=5, keep 3-model ensemble
7. Full Week 10+ Combination (-1.6%)
- Evidence: Paradox - individuals help, combination hurts
- Reason: Feature interaction, multicollinearity
- Action: Never deploy all features together without validation
Decision Required: Choose Option A, B, or C above
If choosing Option A (Enhanced Baseline):
- Copy Week 9 Model.ipynb → Week 16
- Add momentum features only
- Test predictions for reasonability
- Deploy
If choosing Option B (Pure Baseline):
- Copy Week 15 (already reverted) → Week 16
- Update week number and schedule
- Deploy
If choosing Option C (Investigate First):
- Skip Week 16 predictions
- Complete investigation tasks below
- Deploy to Week 17 with high confidence
Investigation Tasks:
-
Debug defensive stats (2 hours):
- Fix formula (remove × 6)
- Re-run test #6 on 2024 holdout
- Expected: Should flip to +1-2% if bug fixed
-
Verify momentum implementation (1 hour):
- Compare ablation study momentum formula to Week 10 production
- Identify differences
- Test corrected version
-
Test feature combinations (3 hours):
- Momentum + Temporal on 2024 → expect ~+1.0%?
- Momentum + Defensive (fixed) on 2024 → expect ~+2.0%?
- Vegas (real) + Momentum on 2024 → helpful or harmful?
-
Replicate Week 10-14 on 2024 (2 hours):
- Run exact Week10/Model.ipynb on 2024 data
- Should get 65.2% if implementation matches study
- If not, find discrepancy
-
Cross-validate momentum on 2023 (2 hours):
- Change TEST_YEAR = 2023 in ablation study
- Re-run test #3 (momentum features)
- If still positive, more confidence it generalizes
Proceed to Phase 3: Build Validation Framework
With ablation study findings, now create proper testing:
- Create 2024 holdout validator (from plan)
- Establish baseline: 66.8% (from this study)
- Set threshold: +2% improvement required (new features must achieve ≥68.8%)
- Test EPA metrics: Expected +8-12%, will it actually deliver?
- Deploy only validated features
- Get real injury data: Replace estimated injury_pct
- Fix defensive calculations: Use proper EPA formulas
- Test weather features: Temperature, wind, precipitation
- Implement player-level data: QB ratings, key injuries
- Build neural network: LSTM for sequential dependencies
- Individual features can help, but combinations can hurt due to interaction effects
- Momentum features are promising (+0.8%, 92.3% HC) but need cross-validation
- Defensive stats formula has likely bug (multiply by 6)
- Your Week 10-14 production differs significantly from study (12.1 point gap)
- Baseline model is quite strong (66.8% on 2024 holdout)
- Why 12.1 point gap between study (65.2%) and Week 10-14 production (53.1%)?
- Does momentum generalize to 2025, or was 2024-specific?
- Would defensive stats help if formula fixed?
- Do feature pairs interact negatively? Which ones?
- Would real vegas spread values still help, or create circular dependency?
Before deploying momentum features to Week 16, we must answer:
"Why did momentum show 92.3% HC accuracy in the 2024 study but your Week 10-14 production showed 55.0% HC accuracy?"
Until this is resolved, we can't trust that momentum will actually help in production.
- ✅
ablation_study_results.csv- Raw metrics for all 8 tests - ✅
Week10_Enhancement_Postmortem.md- Summary report - 📄
ABLATION_RESULTS_ANALYSIS.md- This comprehensive analysis (you are here)
- This file - Comprehensive analysis
- Week10_Enhancement_Postmortem.md - Quick summary
- ablation_study_results.csv - Raw data
- debug_enhanced_model.ipynb - Full study code and outputs
Status: Phase 2 Complete ✅
Next Decision: Choose Week 16 deployment option (A, B, or C)
Next Phase: Phase 3 - Build Validation Framework (after Week 16 deployed)