Skip to content

Commit 87724b7

Browse files
author
miranov25
committed
feat: Add realistic TPC distortion synthetic data and validation
- Implement § 7.4 synthetic data specification - Physical model with 8 ground truth parameters - Alarm system with df.eval() validation - Three-tier QA: OK / WARNING / ALARM - Unit test and benchmark ready
1 parent 194142d commit 87724b7

File tree

4 files changed

+845
-144
lines changed

4 files changed

+845
-144
lines changed
Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
# § 7.4 Synthetic-Data Test Specification (Realistic TPC Distortion Model)
2+
3+
**Version:** 2.1.0
4+
**Phase:** M7.1 Sliding Window Regression
5+
**Status:** APPROVED
6+
7+
---
8+
9+
## 7.4.1 Purpose
10+
11+
The synthetic dataset emulates the behavior of TPC distortion maps under controlled yet realistic conditions. It provides ground-truth relationships among drift length, radial coordinate, and sector offset to test:
12+
13+
1. **Correctness** of the sliding-window aggregation and fitting logic
14+
2. **Recovery** of known calibration parameters
15+
3. **Dependence** of statistical precision on neighborhood size and kernel width
16+
4. **Alarm system** for quality assurance gates
17+
18+
This test constitutes the primary validation that M7.1 can recover true distortion fields from noisy measurements, as required for production TPC calibration.
19+
20+
---
21+
22+
## 7.4.2 Physical Model and Variable Definitions
23+
24+
Each synthetic entry represents a tracklet measurement within one TPC sector. All variables are generated with the same naming convention as real calibration data to ensure seamless integration with production workflows.
25+
26+
| Symbol | Column name | Definition | Typical range / units |
27+
|--------|-------------|------------|----------------------|
28+
| $r$ | `r` | Radius at pad row | 82–250 cm |
29+
| $\mathrm{dr}$ | `xBin` | Discrete radial bin index (~1 cm spacing) | 0–170 |
30+
|| `z2xBin` | Discrete drift coordinate (0=readout, 20=cathode) | 0–20 |
31+
|| `y2xBin` | Sector coordinate index | 0–20 |
32+
| $\mathrm{drift}$ | `drift` | Drift length along $z$ | $250 - \frac{z2xBin}{20} \cdot r$ [cm] |
33+
| $\mathrm{dsec}$ | `dsec` | Relative position to sector centre | $\frac{y2xBin - 10}{20}$ |
34+
|| `meanIDC` | Mean current density indicator | random $\sim \mathcal{N}(0, 1)$ |
35+
|| `dX_true` | True distortion along $x$ | defined below |
36+
|| `dX_meas` | Measured distortion (with noise) | defined below |
37+
|| `weight` | Entry weight for fitting | 1.0 (uniform) |
38+
39+
---
40+
41+
## 7.4.3 Distortion Model
42+
43+
The true distortion is modeled as a combination of linear and parabolic dependencies in the key physical variables:
44+
45+
$$
46+
\begin{aligned}
47+
dX_{\text{true}} &= dX_0 + a_{\text{drift}} \cdot \mathrm{drift} \cdot \big(a_{1,\text{dr}} \cdot \mathrm{dr} + a_{2,\text{dr}} \cdot \mathrm{dr}^2\big) \\
48+
&\quad + a_{\text{drift-dsec}} \cdot \mathrm{drift} \cdot \big(a_{1,\text{dsec}} \cdot \mathrm{dsec} + a_{1,\text{dsec-dr}} \cdot \mathrm{dsec} \cdot \mathrm{dr}\big) \\
49+
&\quad + a_{1,\text{IDC}} \cdot \mathrm{meanIDC}
50+
\end{aligned}
51+
$$
52+
53+
### Typical Parameter Values
54+
55+
These parameters are chosen to emulate realistic TPC distortion magnitudes and dependencies observed in ALICE O² production data:
56+
57+
| Parameter | Description | Example value |
58+
|-----------|-------------|---------------|
59+
| $dX_0$ | Global offset | 0.0 |
60+
| $a_{\text{drift}}$ | Drift-scale factor | $1.0 \times 10^{-3}$ |
61+
| $a_{1,\text{dr}}$, $a_{2,\text{dr}}$ | Linear / quadratic radial coefficients | $(1.5 \times 10^{-2}, -4 \times 10^{-5})$ |
62+
| $a_{\text{drift-dsec}}$ | Drift-sector coupling | $5 \times 10^{-4}$ |
63+
| $a_{1,\text{dsec}}$, $a_{1,\text{dsec-dr}}$ | Sector offset and radial coupling | $(0.8, 0.3)$ |
64+
| $a_{1,\text{IDC}}$ | Mean-current sensitivity | $2 \times 10^{-3}$ |
65+
66+
### Measured Quantity
67+
68+
A measured quantity is obtained by adding Gaussian noise:
69+
70+
$$
71+
dX_{\text{meas}} = dX_{\text{true}} + \mathcal{N}(0, \sigma_{\text{meas}}), \quad \sigma_{\text{meas}} \approx 0.02 \text{ cm}
72+
$$
73+
74+
The noise level $\sigma_{\text{meas}} = 0.02$ cm is representative of single-tracklet measurement resolution in ALICE TPC.
75+
76+
### DataFrame Structure
77+
78+
The synthetic DataFrame includes:
79+
80+
```python
81+
columns = [
82+
'xBin', 'y2xBin', 'z2xBin', # Discrete bin indices (grouping)
83+
'r', 'dr', 'dsec', 'drift', # Physical coordinates (predictors)
84+
'meanIDC', # Current density (predictor)
85+
'dX_true', 'dX_meas', # Ground truth and measurement
86+
'weight' # Entry weights
87+
]
88+
```
89+
90+
Ground truth parameters are stored in `df.attrs['ground_truth_params']` for automated validation.
91+
92+
---
93+
94+
## 7.4.4 Evaluation Metrics
95+
96+
For each tested configuration of `window_spec` (neighborhood size) and kernel width (weighting), the following metrics are computed:
97+
98+
### Primary Metrics
99+
100+
1. **Fit coefficients** ($\hat{a}_i$) and their estimated uncertainties ($\sigma_{\hat{a}_i}$)
101+
2. **Residuals**: $\Delta = dX_{\text{true}} - dX_{\text{pred}}$
102+
3. **Normalized residuals**: $\Delta / \sigma_{\text{fit}}$
103+
4. **RMS residuals**: $\text{RMS}(\Delta) = \sqrt{\langle \Delta^2 \rangle}$
104+
105+
### Derived Metrics
106+
107+
5. **Pull distribution**: $\text{Pull} = (dX_{\text{meas,mean}} - dX_{\text{true,mean}}) / \sigma_{\text{fit}}$
108+
6. **Recovery precision**: Fraction of bins where $|\Delta| \leq 4\sigma_{\text{meas}}$
109+
7. **Statistical error scaling**: $\sigma(\Delta)$ vs. effective sample size
110+
111+
### Diagnostic Outputs
112+
113+
- Scatter plots: $dX_{\text{true}}$ vs. $dX_{\text{pred}}$
114+
- Residual distributions: $\Delta$ histograms
115+
- RMS($\Delta$) vs. window size
116+
- Normalized residual distributions (should be $\mathcal{N}(0,1)$)
117+
- Evolution of coefficient uncertainties with neighborhood size
118+
119+
---
120+
121+
## 7.4.5 Validation Rules and Alarm System
122+
123+
Quality validation uses a three-tier alarm system based on statistical significance levels. The alarm dictionary is computed using `df.eval()` for efficient vectorized checks.
124+
125+
### Alarm Criteria
126+
127+
| Check | Criterion | Status | Action |
128+
|-------|-----------|--------|--------|
129+
| **OK Range** | $\|\Delta\| \leq 4\sigma_{\text{meas}}$ | `OK` | No action |
130+
| **Warning Range** | $4\sigma_{\text{meas}} < \|\Delta\| \leq 6\sigma_{\text{meas}}$ | `WARNING` | Monitor, report if >1% of bins |
131+
| **Alarm Range** | $\|\Delta\| > 6\sigma_{\text{meas}}$ | `ALARM` | Investigation required |
132+
133+
### Additional Checks
134+
135+
| Check | Criterion | Purpose |
136+
|-------|-----------|---------|
137+
| Normalized residuals | $\mu \approx 0, \sigma \approx 1$ | Verify error estimation |
138+
| RMS residuals | $\text{RMS}(\Delta) < 2 \times \sigma_{\text{expected}}$ | Check overall precision |
139+
| Worst-case bins | Identify bins with $\max(\|\Delta\|)$ | Locate systematic issues |
140+
141+
When violations occur systematically, the alarm system emits warnings indicating possible:
142+
- Local non-linearity in the distortion field
143+
- Underestimated fit uncertainties
144+
- Insufficient neighborhood size
145+
- Edge effects or boundary artifacts
146+
147+
### Implementation
148+
149+
```python
150+
# Example alarm check using df.eval()
151+
ok_mask = df.eval('abs(delta) <= 4 * @sigma_meas')
152+
warning_mask = df.eval('(abs(delta) > 4 * @sigma_meas) & (abs(delta) <= 6 * @sigma_meas)')
153+
alarm_mask = df.eval('abs(delta) > 6 * @sigma_meas')
154+
155+
alarms = {
156+
'residuals_ok': {'count': ok_mask.sum(), 'fraction': ok_mask.mean()},
157+
'residuals_warning': {'count': warning_mask.sum(), 'fraction': warning_mask.mean()},
158+
'residuals_alarm': {'count': alarm_mask.sum(), 'fraction': alarm_mask.mean()}
159+
}
160+
```
161+
162+
---
163+
164+
## 7.4.6 Test Cases and Requirements
165+
166+
### Minimal Test (Unit Test)
167+
168+
**Grid size:** 50 × 10 × 10 bins
169+
**Entries per bin:** 50
170+
**Window spec:** `{'xBin': 3, 'y2xBin': 2, 'z2xBin': 2}`
171+
**Min entries:** 20
172+
**Expected runtime:** <10 seconds
173+
174+
**Pass criteria:**
175+
- ✅ No bins in ALARM range
176+
- ✅ <1% bins in WARNING range
177+
- ✅ Normalized residuals: $|\mu| < 0.1$, $|1 - \sigma| < 0.2$
178+
- ✅ RMS residuals: $< 2\times$ expected
179+
180+
### Full Benchmark Test
181+
182+
**Grid size:** 170 × 20 × 20 bins (production scale)
183+
**Entries per bin:** 200
184+
**Window spec:** Multiple configurations
185+
**Expected runtime:** <5 minutes (numpy backend)
186+
187+
**Pass criteria:**
188+
- ✅ All unit test criteria
189+
- ✅ Parameter recovery within 1$\sigma$ accuracy
190+
- ✅ Scaling of errors with effective sample size
191+
- ✅ Performance: >10k rows/sec
192+
193+
---
194+
195+
## 7.4.7 Integration with Test Suite
196+
197+
### File Structure
198+
199+
```
200+
dfextensions/groupby_regression/
201+
├── synthetic_tpc_distortion.py # Data generator
202+
├── tests/
203+
│ ├── test_tpc_distortion_recovery.py # Unit test (alarm-based)
204+
│ ├── test_sliding_window_*.py # Other unit tests
205+
│ └── benchmark_tpc_distortion.py # Full benchmark
206+
└── validation/
207+
└── alarm_system.py # Reusable alarm utilities
208+
```
209+
210+
### Usage in Unit Tests
211+
212+
```python
213+
from synthetic_tpc_distortion import make_synthetic_tpc_distortion
214+
from dfextensions.groupby_regression import make_sliding_window_fit
215+
216+
def test_distortion_recovery():
217+
# Generate data
218+
df = make_synthetic_tpc_distortion(...)
219+
220+
# Run fit
221+
result = make_sliding_window_fit(df, ...)
222+
223+
# Validate with alarms
224+
alarms = validate_with_alarms(result, df)
225+
226+
# Assert
227+
assert alarms['summary']['status'] in ['OK', 'WARNING']
228+
```
229+
230+
### Benchmark Usage
231+
232+
```python
233+
# Benchmark both speed and correctness
234+
df = make_synthetic_tpc_distortion(n_bins_dr=170, entries_per_bin=200)
235+
236+
start = time.time()
237+
result = make_sliding_window_fit(df, ...)
238+
elapsed = time.time() - start
239+
240+
# Check speed
241+
assert len(df) / elapsed > 10000 # rows/sec
242+
243+
# Check correctness
244+
alarms = validate_with_alarms(result, df)
245+
assert alarms['summary']['status'] == 'OK'
246+
```
247+
248+
---
249+
250+
## 7.4.8 Outcome and Deliverables
251+
252+
The synthetic-data tests will:
253+
254+
1.**Confirm recovery** of known coefficients within 1$\sigma$ accuracy
255+
2.**Demonstrate scaling** of parameter errors with effective sample size
256+
3.**Provide benchmark plots** for documentation and calibration validation
257+
4.**Supply reproducible ground-truth** reference files (`synthetic_tpc_distortion.parquet`) for continuous-integration tests
258+
5.**Validate alarm system** for production QA gates
259+
260+
### Expected Test Results
261+
262+
| Metric | Expected Value | Unit Test | Benchmark |
263+
|--------|---------------|-----------|-----------|
264+
| Bins in OK range | >99% |||
265+
| Bins in WARNING range | <1% |||
266+
| Bins in ALARM range | 0% |||
267+
| RMS residuals | <2× expected |||
268+
| Normalized residuals | $\mu=0 \pm 0.1$, $\sigma=1 \pm 0.2$ |||
269+
| Performance | >10k rows/sec |||
270+
271+
---
272+
273+
## 7.4.9 Future Extensions (M7.2+)
274+
275+
- **Weighted fits**: Test with non-uniform entry weights
276+
- **Boundary conditions**: Test edge/corner bins explicitly
277+
- **Missing data**: Test with sparse/missing bins
278+
- **Non-Gaussian noise**: Test robustness to outliers
279+
- **Multi-target fits**: Test multiple distortion components simultaneously
280+
- **Numba acceleration**: Benchmark speed improvements
281+
282+
---
283+
284+
**Status:** ✅ Specification approved, implementation ready
285+
**Implementation files:** `synthetic_tpc_distortion.py`, `test_tpc_distortion_recovery.py`
286+
**Integration:** Phase M7.1 unit tests and benchmark suite
287+
288+
---
289+
290+
## References
291+
292+
- Phase 7 M7.1 Implementation Plan
293+
- ALICE O² TPC Calibration Framework Documentation
294+
- Statistical Methods for Physics Analysis (Cowan, 1998)
295+
- Pandas DataFrame.eval() Documentation

0 commit comments

Comments
 (0)