Skip to content

Commit eef2087

Browse files
author
miranov25
committed
docs: apply Section 1 corrections (scope, measurements)
1 parent c21812d commit eef2087

File tree

2 files changed

+323
-0
lines changed

2 files changed

+323
-0
lines changed

UTILS/dfextensions/groupby_regression/docs/Q&A.md

Whitespace-only changes.
Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
# Sliding Window GroupBy Regression - Specification Document
2+
3+
**Authors:** Marian Ivanov (GSI/ALICE), Claude (Anthropic)
4+
**Reviewers:** GPT-4, Gemini
5+
**Date:** 2025-10-27
6+
**Version:** 0.1 (Draft)
7+
8+
---
9+
10+
## 1. Motivation
11+
12+
### 1.1 The Core Challenge: Probability Density Function Estimation in High-Dimensional Spaces
13+
14+
In high-energy physics and detector calibration, we face a fundamental challenge: **estimating probability density functions (PDFs) and their statistical properties** (quantiles, moments, correlations) from data distributed across high-dimensional parameter spaces. This is not merely a function fitting problem—we must characterize the full statistical behavior of observables as they vary across multiple dimensions simultaneously.
15+
16+
**Note:** While examples in this specification are drawn from ALICE tracking and calibration (including TPC distortions, tracking performance, and combined detector calibration), the underlying statistical challenge—estimating local PDFs in high-dimensional sparse data—is generic to many scientific domains including medical imaging, climate modeling, and financial risk analysis.
17+
18+
**The statistical estimation problem:** Given measurements distributed in an *d*-dimensional binned space, we need to extract reliable statistical estimators (mean, median, RMS (Root Mean Square), MAD (Median Absolute Deviation), quantiles, higher moments) for each bin. However, as dimensionality increases, the **curse of dimensionality** manifests in two critical ways:
19+
20+
1. **Exponential sparsity:** With *n* bins per dimension, we face *n^d* total bins. Even with billions of events (e.g., ALICE collects 5×10^6 tracks/second × 10-15 hours = 180-270 billion tracks/day), many bins remain empty or contain insufficient statistics for reliable PDF characterization.
21+
22+
2. **Unbalanced distributions:** Physical observables often follow highly skewed distributions (exponential mass spectra, power-law transverse momentum), making naive sampling wasteful and leaving critical regions of parameter space under-represented.
23+
24+
**Example from ALICE TPC calibration:**
25+
```
26+
Spatial distortion map binning:
27+
- 3D spatial bins: 152 (x) × 20 (y/x) × 28 (z/x) × 18 (sectors) = ~1.5M bins
28+
- Time evolution: × 90 time slices = 135M total bins
29+
- Target observables: dX, dY, dZ corrections (vector field)
30+
- Even with 270 billion tracks/day, average statistics per bin: ~2000 events
31+
- After quality cuts and balanced sampling: O(10-100) events per bin
32+
```
33+
34+
**Example from performance parameterization:**
35+
```
36+
Track segment resolution as function of (pT, η, φ, occupancy, time):
37+
- 5D parameter space: 50 × 40 × 36 × 20 × 100 = 144M bins
38+
- Measurements: TPC-ITS track difference (bias and resolution),
39+
TPC-vertex (bias and resolution)
40+
- Common approach: TPC-vertex and angular matching for QA parameterization
41+
- Similar challenges: V0 reconstruction, PID (Particle IDentification) resolution
42+
- Used for MC-to-data remapping and QA (Quality Assurance) variable calibration
43+
```
44+
45+
For bins with <10 events, standard statistical estimators (mean, RMS) have large uncertainties, making robust PDF characterization impossible without additional assumptions.
46+
47+
**Figure 1: Sparse 3D Spatial Bins with ±1 Neighborhood Aggregation**
48+
```
49+
[Placeholder for figure showing:
50+
- 3D grid of spatial bins (xBin × y2xBin × z2xBin)
51+
- Center bin highlighted with sparse data (<10 events)
52+
- ±1 neighbors in each dimension (3×3×3 = 27 bins total)
53+
- Aggregated data providing sufficient statistics
54+
- Visual representation of local smoothness assumption]
55+
```
56+
*Figure to be added: Illustration of how sliding window aggregates sparse neighboring bins to enable reliable PDF estimation.*
57+
58+
### 1.2 The Local Smoothness Assumption and Functional Approximation
59+
60+
To overcome statistical sparsity, we must incorporate **prior knowledge** about the physical behavior of our observables. The fundamental assumption is **local smoothness**: physical quantities vary continuously in parameter space, exhibiting correlations between neighboring regions.
61+
62+
This assumption enables **functional approximation** through sliding window aggregation:
63+
64+
**Approach 1: Local constant approximation**
65+
Aggregate statistics from neighboring bins assuming the PDF properties are approximately constant within a local neighborhood:
66+
$$\mu(\mathbf{x}_0) \approx \text{mean}\{y_i \mid \mathbf{x}_i \in \text{neighborhood}(\mathbf{x}_0)\}$$
67+
68+
**Approach 2: Weighted smoothing**
69+
Assign distance-based weights to neighbors, giving higher influence to bins closer to the center:
70+
$$\mu(\mathbf{x}_0) \approx \frac{\sum_i w_i(\|\mathbf{x}_i - \mathbf{x}_0\|) \cdot y_i}{\sum_i w_i(\|\mathbf{x}_i - \mathbf{x}_0\|)}$$
71+
where common weight functions include Gaussian: $w(d) = \exp(-d^2/\sigma^2)$ or inverse distance: $w(d) = 1/(1+d)$.
72+
73+
**Approach 3: Local kernel regression**
74+
Fit parametric functions (linear, polynomial) within the neighborhood, capturing local trends:
75+
$$y(\mathbf{x}) \approx \beta_0 + \beta_1 \cdot (\mathbf{x} - \mathbf{x}_0) + \ldots \quad \text{within neighborhood}(\mathbf{x}_0)$$
76+
where $\beta$ coefficients are fit using weighted least squares over the local window.
77+
78+
This sliding window methodology transforms the problem from:
79+
- **"Estimate PDF at each isolated bin"** (fails in sparse regions)
80+
to:
81+
- **"Estimate smooth PDF field using local information"** (succeeds with local smoothness)
82+
83+
### 1.3 Beyond Simple Smoothing: PDF Estimation and Model Factorization
84+
85+
The sliding window approach serves a deeper purpose in the **RootInteractive** framework [[Ivanov et al. 2024, arXiv:2403.19330]](https://arxiv.org/abs/2403.19330): enabling iterative, multidimensional PDF estimation and analytical model validation.
86+
87+
#### 1.3.1 Balanced Semi-Stratified Sampling
88+
89+
To handle massive ALICE data volumes (>100TB/day) while maintaining statistical power across parameter space:
90+
91+
1. **Original data:** Highly unbalanced (exponential/power-law distributions in mass, pT, PID)
92+
2. **Balanced sampling:** Pre-sample using **"balanced semi-stratified sampling"** (density-aware resampling that flattens highly imbalanced distributions such as pT or particle identification, enabling uniform coverage of the full parameter space)
93+
3. **Volume reduction:** 10× to 10^4× reduction (typical: 10^2-10^3) depending on use case
94+
- Distortion maps: ~10× reduction (need high spatial statistics)
95+
- Performance parameterization: ~10^3× reduction (broader phase space coverage)
96+
4. **Store weights:** Enable post-hoc reweighting to original distribution
97+
98+
**Example:** For track resolution studies across 5D phase space (pT, η, occupancy, time, PID), sampling from 10^11 tracks to 10^8 events provides sufficient statistics per bin while enabling interactive analysis with <4GB memory footprint.
99+
100+
**Note on sampling schemes:** For distortion map creation, uniform spatial sampling is under development; current production primarily uses time-based balanced sampling. For performance studies and particle identification, balanced sampling across kinematic variables is standard practice.
101+
102+
**Result:** Process 0.01-10% of data with full statistical coverage, enabling iterative analysis and rapid feedback cycles essential for calibration workflows.
103+
104+
#### 1.3.2 Functional Decomposition and Factorization
105+
106+
Real-world calibrations rarely have simple analytical models for full multidimensional behavior. However, we often have models for **normalized deltas** and **factorized components**.
107+
108+
**Example: TPC distortion modeling**
109+
```
110+
Full model (unknown): d(x, y, z, t, φ, rate, ...)
111+
112+
Factorization approach:
113+
1. Extract spatial base map: d₀(x, y, z) [from sliding window fits]
114+
2. Model temporal delta: δd(t) = A·exp(-t/τ₁) + B·exp(-t/τ₂) [analytical]
115+
- Typical temporal resolution: 5-10 minute averaged maps (90 samples/day)
116+
- For fast fluctuations: O(1s) resolution requires coarser spatial binning
117+
3. Exploit symmetry: φ-independence for space charge (electric charge accumulation from ionization) effects
118+
4. Rate dependence: Normalize by IDC (Integrator Drift Current, a proxy for detector occupancy and space charge density)
119+
120+
Composed model: d(x,y,z,t,φ,rate) = d₀(x,y,z) · δd(t) · f(IDC) + symmetry checks
121+
```
122+
123+
**Sliding window role:** Extract the non-parametric base functions (d₀) from sparse data, then validate factorization assumptions and fit parametric delta models on normalized residuals.
124+
125+
**Note on RootInteractive:** The RootInteractive tool [[Ivanov et al. 2024, arXiv:2403.19330]](https://arxiv.org/abs/2403.19330) provides interactive visualization and client-side analysis of the extracted aggregated data. Sliding window regression is the *server-side* preprocessing step that prepares binned statistics and fit parameters for subsequent interactive exploration and model validation.
126+
127+
#### 1.3.3 Symmetries, Invariants, and Alarm Systems
128+
129+
After normalization and factorization, physical symmetries should be restored:
130+
- **Temporal invariance:** Corrections stable across runs (after rate normalization)
131+
- **Spatial symmetry:** φ-independence for space charge effects
132+
- **Magnetic field symmetry:** Consistent behavior for ±B fields
133+
134+
**Alarm logic:** If `(data - model) / σ > N` for expected symmetries, either:
135+
- Data quality issue → flag for investigation
136+
- Model inadequacy → symmetry-breaking effect discovered
137+
- Calibration drift → update correction maps
138+
139+
**Sliding window enables:** Compute local statistics needed for σ estimation and symmetry validation across all dimensions.
140+
141+
### 1.4 The Software Engineering Challenge: A Generic Solution
142+
143+
While the statistical methodology is well-established (kernel regression, local polynomial smoothing), applying it to real-world detector calibration requires:
144+
145+
**Dimensional flexibility:**
146+
- Integer bin indices (xBin, y2xBin, z2xBin)
147+
- Float coordinates (time, momentum, angles)
148+
- Mixed types in same analysis
149+
- Dimensions ranging from 3D to 6D+ (typical use cases)
150+
- **Note:** Actual dimensionality and bin counts depend on use case and memory constraints (e.g., Grid central productions have memory limits affecting maximum binning)
151+
152+
**Boundary conditions:**
153+
- Spatial boundaries: mirror/truncate/extrapolate
154+
- Periodic dimensions (φ angles): wrap-around
155+
- Physical boundaries: zero padding
156+
- Per-dimension configuration
157+
158+
**Integration with existing tools:**
159+
- Must work with pandas DataFrames (standard scientific Python)
160+
- Leverage existing groupby-regression engines (v4 with Numba JIT)
161+
- Support pre-aggregated data from batch jobs
162+
- Enable client-side interactive analysis (RootInteractive dashboards)
163+
164+
**Performance requirements:**
165+
- Process 405k rows × 5 maps with ±1 window: <1 minute (typical TPC spatial case)
166+
- Scale to 7M rows × 90 maps: <30 minutes (full temporal evolution)
167+
- Memory efficient: avoid 27-125× expansion where possible; <4GB per session target
168+
- Parallel execution across cores
169+
- **Note:** Specific targets depend on use case, hardware, and dataset characteristics
170+
171+
**Reusability imperative:**
172+
- One implementation for TPC distortions, particle ID, mass spectra, ...
173+
- User-defined fit functions (linear, polynomial, non-linear, simple statistics)
174+
- Configurable weighting schemes
175+
- Documented, tested, maintainable
176+
177+
**Translating theory into practice:** Translating these statistical concepts into practice requires a software framework that maintains dimensional flexibility while remaining computationally efficient and memory-bounded (<4GB per analysis session). Past C++ and Python implementations demonstrated the value of this approach but had limitations in extensibility and performance (see Section 5 for detailed history). This specification defines requirements for a production-ready, general-purpose solution that addresses these limitations.
178+
179+
### 1.5 Scope and Goals of This Specification
180+
181+
This document defines a **Sliding Window GroupBy Regression** framework that:
182+
183+
1. **Supports arbitrary dimensionality** (3D-6D typical, extensible to higher)
184+
2. **Handles mixed data types** (integer bins, float coordinates, categorical groups)
185+
3. **Flexible window configuration** (per-dimension sizes, asymmetric, distance-based)
186+
4. **Systematic boundary handling** (mirror, truncate, periodic, per-dimension rules)
187+
5. **User-defined aggregations** (linear fits, statistics, custom functions)
188+
6. **Performance at scale** (millions of rows, thousands of bins, <30 min runtime)
189+
7. **Integration with RootInteractive** (pandas I/O, client-side visualization)
190+
8. **Production-quality implementation** (tested, documented, maintainable)
191+
192+
**Primary use cases:**
193+
- **ALICE TPC distortion maps:** Spatial corrections with temporal evolution
194+
- **ALICE tracking performance:** Combined detector calibration and tracking quality
195+
- Track segment resolution: TPC-ITS, TPC-vertex matching (bias and resolution)
196+
- Angular matching and vertex constraints
197+
- V0 reconstruction resolution and biases
198+
- PID (Particle Identification) resolution and systematic uncertainties
199+
- Efficiency maps for various reconstruction algorithms
200+
- QA variables (χ², cluster counts, dE/dx) across parameter space
201+
- MC-to-data remapping corrections
202+
- **Future development:** Combined tracking performance parameterization and ALICE calibration integration
203+
- **Particle physics:** Invariant mass spectra in multi-dimensional kinematic bins
204+
- **Generic:** Any binned analysis requiring PDF estimation in high dimensions (3D-6D+)
205+
206+
**Success criteria:**
207+
- Replaces existing C++ implementations with cleaner API
208+
- Enables new analyses previously infeasible (6D+ spaces)
209+
- Reduces analysis time from hours/days to minutes
210+
- Becomes standard tool in ALICE calibration workflow
211+
212+
**Intended audience:**
213+
- ALICE tracking and calibration experts (primary: TPC, ITS, tracking performance)
214+
- Particle physics data analysts (secondary)
215+
- Scientific Python community (general reusability)
216+
217+
**Next steps:** Section 2 describes the representative datasets and validation scenarios that illustrate these concepts with concrete examples from ALICE TPC calibration and performance studies.
218+
219+
---
220+
221+
## 2. Example Data
222+
223+
[To be written in next iteration]
224+
225+
---
226+
227+
## 3. Example Use Cases
228+
229+
[To be written in next iteration]
230+
231+
---
232+
233+
## 4. Goal - Functional Representation
234+
235+
[To be written in next iteration]
236+
237+
---
238+
239+
## 5. Past Implementations
240+
241+
### 5.1 C++ Implementation (2015-2024)
242+
243+
**Overview:** The original sliding window implementation was developed in C++ within the ALICE AliRoot/O2 framework, using N-dimensional histograms as input structures.
244+
245+
**Key features:**
246+
- Multi-dimensional histogram-based approach using ROOT's THnSparse
247+
- Efficient kernel lookups via histogram bin navigation
248+
- Support for various boundary conditions (mirror, truncate, periodic)
249+
- Integrated with ALICE offline analysis framework
250+
251+
**Strengths:**
252+
- Proven in production for TPC calibration (distortion maps, 2015-2024)
253+
- Computationally efficient for large datasets
254+
- Well-tested and reliable
255+
256+
**Limitations:**
257+
- Rigid configuration: adding new fit functions required C++ code changes
258+
- Complex API: required deep knowledge of ROOT histogram internals
259+
- Limited extensibility: difficult to prototype new methods
260+
- Tight coupling to ALICE-specific data structures
261+
- Challenging for non-experts to use or modify
262+
263+
### 5.2 Python Implementation v1 (2024)
264+
265+
**Overview:** Initial Python prototype using DataFrame expansion to aggregate neighboring bins.
266+
267+
**Approach:**
268+
```python
269+
# For ±1 window in 3D:
270+
# Replicate each row to all neighbor combinations
271+
# (xBin±1) × (y2xBin±1) × (z2xBin±1) = 3³ = 27 copies per row
272+
# Then use standard pandas groupby on expanded DataFrame
273+
```
274+
275+
**Strengths:**
276+
- Simple conceptual model
277+
- Leverages existing pandas/numpy ecosystem
278+
- Easy to prototype and modify
279+
- Works with standard groupby-regression tools (v4 engine)
280+
281+
**Limitations:**
282+
- **Memory explosion:** 27× expansion for ±1 window, 125× for ±2 window
283+
- **Performance:** Slow for large datasets due to data replication overhead
284+
- **Scalability:** Infeasible for ±3 windows (343×) or high-dimensional spaces
285+
- Not production-ready for ALICE scale (7M rows × 90 maps × 27 = 17B rows)
286+
287+
### 5.3 Lessons Learned
288+
289+
**From C++ experience:**
290+
- Kernel-based approaches are computationally efficient
291+
- N-dimensional histogram indexing provides fast neighbor lookups
292+
- Flexibility for user-defined fit functions is essential
293+
- API complexity limits adoption and experimentation
294+
295+
**From Python v1 experience:**
296+
- DataFrame-native approach integrates well with scientific Python ecosystem
297+
- Expansion method is intuitive but not scalable
298+
- Need balance between simplicity and performance
299+
300+
**Requirements for this specification:**
301+
- Combine C++ performance with Python flexibility
302+
- Efficient aggregation without full DataFrame expansion
303+
- User-definable fit functions and weighting schemes
304+
- Clean API accessible to non-experts
305+
- Production-scale performance (<4GB memory, <30 min runtime)
306+
307+
---
308+
309+
## 6. Specifications - Requirements
310+
311+
[To be written in next iteration]
312+
313+
---
314+
315+
## References
316+
317+
- Ivanov, M., Ivanov, M., Eulisse, G. (2024). "RootInteractive tool for multidimensional statistical analysis, machine learning and analytical model validation." arXiv:2403.19330v1 [hep-ex]
318+
- [ALICE TPC references to be added]
319+
- [Statistical smoothing references to be added]
320+
321+
---
322+
323+
**End of Section 1 Draft**

0 commit comments

Comments
 (0)