@@ -220,7 +220,239 @@ This document defines a **Sliding Window GroupBy Regression** framework that:
220220
221221## 2. Example Data
222222
223- [ To be written in next iteration]
223+ This section describes representative datasets used to motivate and validate the sliding window regression framework. These examples span ALICE tracking, calibration, and performance studies, illustrating the range of dimensionalities, bin structures, and statistical challenges the framework must address.
224+
225+ ### 2.1 Dataset Overview
226+
227+ Three primary dataset categories demonstrate the framework's applicability:
228+
229+ 1 . ** TPC Spatial Distortion Maps** (current test data)
230+ 2 . ** TPC Temporal Evolution** (production scale)
231+ 3 . ** Tracking Performance Parameterization** (multi-dimensional)
232+
233+ Each dataset exhibits the characteristic challenges of high-dimensional sparse data requiring local aggregation through sliding window techniques.
234+
235+ ---
236+
237+ ### 2.2 Dataset A: TPC Spatial Distortion Maps (Test Data)
238+
239+ ** Purpose:** Validate spatial sliding window aggregation with realistic detector calibration data.
240+
241+ ** Data source:** ALICE TPC sector 3 distortion corrections from 5 time slices example fordistertion vs integrated digital current (IDC) calibration
242+
243+ #### 2.2.1 Structure
244+
245+ ** File:** ` tpc_realistic_test.parquet ` (14 MB parquet for 1 sector - 5 maps-tome slices for distortion vs curent fits)
246+
247+ ** Dimensions:**
248+ ```
249+ Rows: 405,423
250+ Columns: O(20)
251+
252+ Spatial binning:
253+ - xBin: 152 bins [0 to 151] (radial direction in TPC)
254+ - y2xBin: 20 bins [0 to 19] (pad-row normalized y)
255+ - z2xBin: 28 bins [0 to 27] (drift-direction normalized z)
256+ - bsec: 1 value [3] (sector 3 only in test data)
257+
258+
259+ Temporal structure:
260+ - run: 1 unique run
261+ - medianTimeMS: 5 unique time points
262+ - firstTFTime: 5 time slices
263+ ```
264+
265+ #### 2.2.2 Target Variables (Fit Targets)
266+
267+ ** Distortion corrections (primary):**
268+ - ` dX ` : Radial distortion [ -4.4 to +5.0 cm]
269+ - ` dY ` : Pad-row direction distortion [ -1.4 to +2.0 cm]
270+ - ` dZ ` : Drift direction distortion [ -2.0 to +3.6 cm]
271+
272+ ** Derived quantities:**
273+ - ` EXYCorr ` : Combined XY correction magnitude [ -0.84 to +0.89]
274+ - ` D3 ` : 3D distortion magnitude [ 0.23 to 4.85 cm]
275+
276+ All target variables are fully populated (405,423 non-null values).
277+
278+ #### 2.2.3 Features (Fit Predictors)
279+
280+ ** Detector state:**
281+ - ` meanIDC ` : Mean Integrator Drift Current [ mean: 1.89, median: 1.97]
282+ - ` medianIDC ` : Median IDC [ mean: 1.89, median: 1.97]
283+ - ` deltaIDC ` : IDC variation in respect to fill average
284+ - ` meanCTP ` , ` medianCTP ` : QA variable. -independent current proxy
285+
286+
287+ ** Statistics:**
288+ - ` entries ` : Entries per bin [ median: 2840]
289+ - ` weight ` : Statistical weight
290+
291+ ** Quality:**
292+ - ` flags ` : Quality flags (value: 7 in test data)
293+
294+
295+ ** Memory footprint:** using per sector splitting
296+ - In-memory (pandas): 45.6 MB
297+ - Per-row overhead: 113 bytes
298+
299+ #### 2.2.5 Use Case
300+
301+ This dataset validates:
302+ - ** Spatial sliding window** aggregation (±1 in xBin, y2xBin, z2xBin)
303+ - ** Integer bin indexing** with boundary handling
304+ - ** Linear regression** within sliding windows (dX, dY, dZ ~ meanIDC)
305+ - ** Multi-target fitting** (simultaneous fits for dX, dY, dZ)
306+
307+
308+ ** Expected workflow:**
309+ 1 . For each center bin (xBin, y2xBin, z2xBin)
310+ 2 . Aggregate data from ±1 neighbors (3×3×3 = 27 bins)
311+ 3 . Fit linear model: ` dX ~ meanIDC ` (and similarly for dY, dZ)
312+ 4 . Extract coefficients, uncertainties, and diagnostics per center bin
313+ 5 . Result: Smoothed distortion field with improved statistics
314+
315+ ---
316+
317+
318+ ### 2.4 Dataset C: Tracking Performance Parameterization
319+
320+ ** Purpose:** Multi-dimensional performance metrics requiring combined spatial, kinematic, and temporal aggregation.
321+
322+ #### 2.4.1 Track Segment Resolution
323+ To provide comprehensive tracking performance characterization,
324+ we analyze track segment residuals and QA variabels as functions of multiple kinematic and detector conditions.
325+ Varaibles are usualy transmed e.g instead of binnin in pt we use q/pt for better linearity, and to miinmize amout of bins
326+ resp. to get enough statistics per bin.
327+ ** Measurement:** TPC-ITS matching and TPC-vertex constraints
328+
329+ ** Dimensions:**
330+ ```
331+ 5D parameter space:
332+ - q/Pt 200 bins [-8 to +8 c/GeV] (charge over pT)
333+ - η: 20 bins [-1.0 to +1.0] (pseudorapidity)
334+ - φ: 180 bins [0 to 2π] (azimuthal angle)
335+ - sqrt(occupancy): -510 bins (number of track in TPC volume)
336+ - rate (kHz): 5-10 bins [0 to 50 kHz] (detector load)
337+
338+ Total bins: 200 × 20 × 180 × 10 × 10 = 144,000,000
339+
340+ ```
341+
342+ ** Targets:**
343+ - Track segment residuals: mean bias, RMS, quantiles (10%, 50%, 90%)
344+ - Angular matching: Δθ, Δφ at vertex
345+ - DCA (Distance of Closest Approach): XY and Z components
346+ - χ² distributions per track type
347+ - efficinecy
348+ - PID- dEdx, dEdx per region and per specie
349+
350+
351+
352+
353+ ### 2.5 Dataset Comparison Summary
354+ <!-- MI-SECTION: Note for later review --> To be updated by Claude.
355+
356+ Data volume here is approacimate. Usually I mal limitted by the 2 GBy limit THN sizein ROOT
357+
358+ | ** Dataset** | ** Dimensions** | ** Bins** | ** Rows** | ** Memory** | ** Sparsity** | ** Window Type** |
359+ | -------------| ---------------| ----------| ----------| ------------| --------------| -----------------|
360+ | ** A: TPC Spatial** | 3D (x,y,z) | 85k | 405k | 46 MB | 26% occupied | Integer ±1-2 |
361+ | ** B: TPC Temporal** | 4D (x,y,z,t) | 1.5M | 7-10M | 0.8-1.5 GB | 20-30% | Integer + time |
362+ | ** C: Track Resolution** | 5D (pT,η,φ,occ,t) | 144M | 100M-1B | 10-100 GB | 50-70% sparse | Float ±1-3 |
363+ | ** C: Efficiency** | 4D (pT,η,φ,occ) | 3.2M | 10M-100M | 1-10 GB | 30-50% | Float ±1-2 |
364+ | ** C: PID** | 3D (p,dE/dx,occ) | 200k | 1M-10M | 0.1-1 GB | 40-60% | Float ±2-5 |
365+
366+ ** Key observations:**
367+ - ** Dimensionality:** 3D to 6D (if combining parameters)
368+ - ** Bin counts:** 10⁴ to 10⁸ (memory and compute constraints vary)
369+ - ** Sparsity:** 20-70% of bins have insufficient individual statistics
370+ - ** Window types:** Integer (spatial bins), float (kinematic variables), mixed
371+ - ** Memory range:** 50 MB (test) to 100 GB (full production without sampling)
372+
373+ ---
374+
375+ ### 2.6 Data Characteristics Relevant to Sliding Window Design
376+
377+ #### 2.6.1 Bin Structure Types
378+ <!-- MI-SECTION: Note for later review --> To be updated by Claude.
379+
380+ ** Observed in ALICE data:**
381+
382+ 1 . ** Uniform integer grids** (TPC spatial bins)
383+ - Regular spacing, known bin IDs
384+ - Efficient neighbor lookup: bin ± 1, ± 2
385+ - Example: xBin ∈ [ 0, 151] , step=1
386+
387+ 2 . ** Non-uniform float coordinates** (kinematic variables, time)
388+ - Variable bin widths (e.g., logarithmic pT binning)
389+ - Neighbors defined by distance, not index
390+ - Example: pT bins = [ 0.1, 0.15, 0.2, 0.3, 0.5, 0.7, 1.0, ...]
391+
392+ 3 . ** Periodic dimensions** (φ angles)
393+ - Wrap-around at boundaries: φ=0 ≡ φ=2π
394+ - Requires special boundary handling
395+
396+ 4 . ** Mixed types** (combined analyses)
397+ - Spatial (integer) + kinematic (float) + temporal (float)
398+ - Requires flexible window specification per dimension
399+
400+ #### 2.6.2 Statistical Properties
401+
402+ ** From Dataset A analysis:**
403+
404+ ``` python
405+ # Bin-level statistics (before sliding window):
406+ entries_per_bin = [1 , 1 , 1 , 2 , 1 , 1 , ... ] # median: 1
407+ mean_IDC = [1.89 , 1.92 , 1.88 , ... ] # varies per bin
408+ dX_values = [- 2.1 , 0.5 , - 1.8 , ... ] # target distortions
409+
410+ # Challenge: Cannot reliably fit dX ~ meanIDC with n=1-2 points per bin
411+ # Solution: Sliding window aggregates 27-125 neighbors → sufficient stats
412+ ```
413+
414+ ** Statistical needs:**
415+ - ** Minimum for mean/median:** ~ 10 points (robust estimates)
416+ - ** Minimum for RMS/quantiles:** ~ 30 points (stable tail estimates)
417+ - ** Minimum for linear fit:** ~ 50 points (reliable slope, uncertainty)
418+ - ** Typical window provides:** 27 (±1 in 3D) to 343 (±3 in 3D) potential bins
419+
420+ ** Reality check:** Not all neighbor bins are populated, effective N often 20-60% of theoretical maximum due to sparsity.
421+
422+ #### 2.6.3 Boundary Effects
423+
424+ ** Spatial boundaries (TPC geometry):**
425+ - xBin=0: Inner field cage (mirror or truncate)
426+ - xBin=151: Outer field cage (mirror or truncate)
427+ - z2xBin=0,27: Readout planes (asymmetric, truncate)
428+ - 3 internal boundaries (stacks edges at rows 63,100,...): (non smoothing across)
429+ - φ: Periodic (wrap-around)
430+
431+
432+ ** Implications for sliding window:**
433+ - Must support per-dimension boundary rules
434+ - Cannot use one-size-fits-all approach
435+ - Boundary bins have fewer neighbors → adjust weighting or normalization
436+
437+ ---
438+
439+ ### 2.7 Data Availability and Access for bencmarkings
440+
441+ ** Test dataset (Dataset A):**
442+ - File: ` benchmarks/data/tpc_realistic_test.parquet ` (14 MB)
443+ - Format: Apache Parquet (optimized) or pickle (compatibility)
444+ - Source: ALICE TPC sector 3, 5 time slices, anonymized for testing
445+ - Public: Yes (within O2DPG repository for development and validation)
446+
447+
448+ ** Synthetic data generation:**
449+ - For testing and benchmarking: Can generate representative synthetic data
450+ - Preserves statistical structure without real detector specifics
451+ - Script: ` benchmarks/data/generate_synthetic_tpc_data.py ` (to be added)
452+
453+ ---
454+
455+ ** Next steps:** Section 3 describes concrete use cases and workflows that leverage these datasets to demonstrate the sliding window framework's capabilities.
224456
225457---
226458
0 commit comments