Skip to content

Commit 6a797c7

Browse files
author
miranov25
committed
docs: Section 5 accuracy corrections
- Updated project status (v2.0 complete, Phase 7 in progress) - Corrected implementation timeline - Updated feature status tracking - Added Q&A historical content Reviewed-by: MI Section: 5
1 parent 5b4ddb5 commit 6a797c7

File tree

2 files changed

+283
-12
lines changed

2 files changed

+283
-12
lines changed
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Sliding Window GroupBy Regression - Q&A Document
2+
3+
**Status:** Living document
4+
**Last updated:** 2025-10-27
5+
**Purpose:** Track complex concepts, design decisions, and review feedback
6+
7+
---
8+
9+
## 2. Example Data - itteration 1 (27.10,2025 11:00)
10+
Your version is too long and includes parts that do not reflect the reality of the project. The main purpose of the document is to motivate the development of a generic interface.
11+
12+
I am not sure how to proceed. I suggest asking GPT and Gemini to review the conceptual part of section 2. Please provide a question based on my considerations below. Before proceeding, we need to resolve the issues with the scope, purpose, and length of this section.
13+
14+
Additionally, in this particular case, it may be simpler if I edit it directly. Should I do that?
15+
16+
Section Dataset A: TPC Spatial Distortion Maps (Test Data) was based on my example, so it closely matches our actual situation.
17+
18+
2.3 Dataset B: TPC Temporal Evolution (Production Scale) was not described by me, so it does not reflect reality. I can prepare a shortened version. In this section, I want to highlight one important aspect from real practice: I use modified variables of interest – for example, instead of pt, I use q/pt, as many QA variables are more linear in q/pt.
19+
20+
21+
22+
## Motivation - Iteration 1 (2025-10-27 07:00)
23+
24+
Before answering the questions, I would like to describe in more detail what is being done and why.
25+
26+
* 0.) We are trying not only to describe a multidimensional function but also to estimate statistical
27+
properties of the probability density function (PDF) itself (e.g. using quantiles).
28+
* 1.) LHC/my specific: We are working with both unbinned and binned data, as well as machine learning
29+
algorithms, depending on data availability. In the case of ALICE, we usually have a huge amount of data.
30+
For example, for tracks we have 500 kHz × 10 → 5 × 10^6 tracks per second, measuring for O(10–15 hours) per
31+
day. This data is either histogrammed in multidimensional histograms or, by default, we sample it using
32+
"balanced semi-stratified" sampling, populating the variables of interest homogeneously (e.g. flat pt, flat PID).
33+
This is very important as PDF of Pt and PID is highly unbalanced (exponential, power-law, etc).
34+
With this approach, we reduce the input data volume by an order of magnitude and enable iterative refinement
35+
of the PDF estimation.
36+
* 2.) Extracting PDF properties in multidimensional space has the advantage of enabling post-fitting of
37+
analytical models for normalised data. Quite often, we do not have analytical models for the full distortion
38+
in (3D+time), but we can have an analytical model for the delta distortion time evolution.
39+
In my current studies, for example, we are fitting a two- exponential phi-symmetric model of distortion
40+
due to common electric field modification.
41+
42+
### Initial Questions (Iteration 1)
43+
44+
**Q1:** Does this capture your motivation accurately?
45+
**A:** Several factors must be considered. Often we have large data but are limited by memory/CPU. Using >4GB in memory is problematic. Pre-sampling helps as original data is statistically highly unbalanced. The problem is not only sparsity - data is "random" and we need substantial statistics per bin.
46+
47+
**Q2:** Should I emphasize more?
48+
**A:** Rewrite to emphasize statistical/mathematical considerations - PDF estimation and functional decomposition using partial models and factorization. Show ALICE examples. Software must be reusable.
49+
50+
**Q3:** Tone - mathematical vs practical?
51+
**A:** Will ask GPT/Gemini. Some mathematics would be good but need balance.
52+
53+
**Q4:** Missing key points?
54+
**A:** Emphasize statistical estimation problem. Motivation should be grounded in defined problems with ALICE examples. Highlight reusability and API design. Note: presented at forums but difficult to explain - people didn't understand statistical estimation, factorization, and usage in analytical model fitting with data renormalization.
55+
56+
**Q5:** Add diagram?
57+
**A:** Yes, sparse 3D bins with ±1 neighborhood would help.
58+
59+
---
60+
61+
## Motivation - Iteration 2 (2025-10-27 09:00)
62+
63+
### Additional Use Cases Added
64+
65+
* Distortion maps (already in use)
66+
* Performance parameterization (e.g. track pT resolution as function of pT, eta, occupancy, time)
67+
* Track matching resolution and biases
68+
* V0 resolution and biases
69+
* PID resolution and biases
70+
* Efficiency maps
71+
* QA variables (chi2, number of clusters, etc.)
72+
* Usage in MC-to-Data remapping
73+
* Note: RootInteractive is only a small subproject for interactive visualisation of extracted data
74+
75+
### Review Questions (Iteration 2)
76+
77+
**Q1: Does Section 1 now accurately capture the key concepts?**
78+
79+
*PDF estimation focus?*
80+
- More or less OK ✓
81+
82+
*Balanced sampling strategy?*
83+
- Mentioned but need more details
84+
- In some use cases we sample down by factor of 10³–10⁴ to obtain manageable data size
85+
- **Action:** Added range 10×-10⁴× with typical 10²-10³× in Section 1.3.1 ✓
86+
87+
*Factorization approach?*
88+
- Explained with TPC example
89+
- **Action:** Added note about temporal resolution (5-10 min maps vs O(s) for fluctuations) ✓
90+
91+
*Connection to RootInteractive?*
92+
- RootInteractive is just one subproject for interactive visualization
93+
- **Action:** Added clarification that sliding window is server-side preprocessing ✓
94+
95+
**Q2: Tone and depth**
96+
97+
*Is mathematical level appropriate?*
98+
- Will ask GPT/Gemini for feedback → **See REVIEW_REQUEST_SECTION1.md**
99+
100+
*Should I add equations?*
101+
- Yes, would enhance clarity
102+
- But ask GPT/Gemini first → **See REVIEW_REQUEST_SECTION1.md**
103+
104+
*Is ALICE example clear?*
105+
- Need distortion map AND performance parameterization examples
106+
- **Action:** Added performance parameterization example in Section 1.1 ✓
107+
- **Action:** Expanded use cases in Section 1.5 ✓
108+
109+
**Q3: Missing elements**
110+
111+
*Key concepts still missed?*
112+
- Performance parameterization case added at beginning
113+
- Can mention in motivation categories and later in example sections
114+
- **Action:** Added to Section 1.1 and 1.5 ✓
115+
116+
**Q4: Structure**
117+
118+
*Are subsections (1.1-1.5) logical?*
119+
- Structure OK for now
120+
- Will ask GPT/Gemini → **See REVIEW_REQUEST_SECTION1.md**
121+
122+
**Q5: Next steps**
123+
124+
*Send to GPT/Gemini or continue to Section 2?*
125+
- **Decision:** Need GPT/Gemini review BEFORE proceeding to Section 2
126+
- **Action:** Created REVIEW_REQUEST_SECTION1.md with detailed questions ✓
127+
128+
---
129+
130+
## Status Summary
131+
132+
**Section 1 - Motivation:**
133+
- Iteration 2 draft complete
134+
- Incorporates all user feedback from 2025-10-27 09:00
135+
- Ready for external review
136+
137+
**Next Steps:**
138+
1. Send to GPT-4 for review
139+
2. Send to Gemini for review
140+
3. Address critical issues from both reviewers
141+
4. Finalize Section 1
142+
5. Proceed to Section 2 (Example Data)
143+
144+
**Files:**
145+
- `SLIDING_WINDOW_SPEC_DRAFT.md` - Main specification document
146+
- `REVIEW_REQUEST_SECTION1.md` - Review questions for GPT/Gemini
147+
- `Q_A.md` - This file (Q&A tracking)
148+
149+
---
150+
151+
## Active Questions for Next Iterations
152+
153+
[None currently - awaiting GPT/Gemini feedback]
154+
155+
---
156+
157+
## Design Decisions Log
158+
159+
[To be populated during Section 6 discussion]
160+
161+
---
162+
163+
## Archived Questions
164+
165+
[To be populated as questions are resolved]

UTILS/dfextensions/groupby_regression/docs/SLIDING_WINDOW_SPEC_DRAFT.md

Lines changed: 118 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -472,25 +472,130 @@ dX_values = [-2.1, 0.5, -1.8, ...] # target distortions
472472

473473
### 5.1 C++ Implementation (2015-2024)
474474

475-
**Overview:** The original sliding window implementation was developed in C++ within the ALICE AliRoot/O2 framework, using N-dimensional histograms as input structures.
475+
**Overview:** The original sliding window implementation was developed in C++ within the ALICE AliRoot framework,
476+
using N-dimensional histograms as input structures. The code has not yet been ported to the Run 3 O2 framework,
477+
and until recently it was used for Run 3 data with AliRoot as a side package.
478+
479+
It was used for performance and dE/dx parameterisation, as well as the initial implementation of the TPC distortion
480+
maps in 2015. Q/q, track delta, and efficiency variables were grouped into histograms with the same binning.
481+
Several versions of binning with different granularity and focus were used, in order to bypass the ROOT internal
482+
limitation of 1 GB.
483+
484+
Detector-based summary binning versions:
485+
* Kinematical variables (q/pt, tgl)
486+
* ~ occupancy
487+
* Phi/sector modulation (90 or 180 bins in the full phi range, or 10–20 bins per sector assuming sector symmetry)
488+
476489

477490
**Key features:**
478-
- Multi-dimensional histogram-based approach using ROOT's THnSparse
479-
- Efficient kernel lookups via histogram bin navigation
480-
- Support for various boundary conditions (mirror, truncate, periodic)
491+
- Multi-dimensional histogram-based approach using ROOT's THnSparse binned (1GBy limit)
492+
- O(10) varaiblae types x 5 biining types used (see comment above)
493+
- aggregation using smapled data on server (bash parallel comand), or farm if larger production
494+
- Sliding window implmentation as a proposprocessing step together with groupby regression
495+
- Kernel-based neighbor aggregation using histogram bin indexing
496+
- In addition to calluating sldiing window statistcs (mean,median, std,mad LTM) of variables of interest
497+
(dEdx,efficency,track deltai) aslo mean of varaibles used for binning (q/pt,eta,phi,occupancy)
498+
- Weighting schemes: uniform, distance-based (inverse distance, Gaussian)
499+
- User-defined fit functions (linear, polynomial, custom)
481500
- Integrated with ALICE offline analysis framework
482501

502+
#### 5.1 C++ Function Signature
503+
504+
```C++
505+
/// Create list of histograms specified by selection
506+
/// Should be rough equivalent of the "ALICE train" TTree->Draw();
507+
/// a.) Data are read only once
508+
/// b.) values expression are reused (evaluated only once)
509+
/// c.) Axis labelling and names of variables extracted from the tree metadata (.AxisTitle)
510+
/// * default cut
511+
/// * default selection applied common for all histograms (can be empty)
512+
///
513+
/// * hisString : - semicolomn separated string
514+
/// * his0;his1; ...; hisN
515+
/// * histogram syntax:
516+
/// * var0:var1:...:<#weight>>>hisName(bins0,min0,max0,bins1,min0,min, minValue,maxValue)
517+
/// * Syntax:
518+
/// * vari are histogramming expression
519+
/// * weight (or cut) entry is optional
520+
/// * default cut is always applied, weight is applied on top
521+
/// * ranges syntax:
522+
/// * nbins,max,min where max and min are double or format strings
523+
/// * in case format string % specified using (Fraction, mean,meanFraction, rms, rmsFraction)
524+
/// * %fraction.sigma
525+
/// * #cumulant
526+
/// * range for bin content can be specified in the same format (by default is not set)
527+
/*!
528+
##### CPU time to process one histogram or set of histograms (in particular case of esdTrack queries) is the same - and it is determined (90 %) by tree->GetEntry
529+
\code
530+
THn * his0= (THn*)hisArray->At(0);
531+
his0->Projection(0)->Draw("");
532+
tree->SetLineColor(2);
533+
TStopwatch timer; tree->Draw("esdTrack.Pt()","(esdTrack.fFlags&0x40)>0&&esdTrack.fTPCncls>70","same",60000); timer.Print();
534+
\endcode
535+
*/
536+
537+
/// \param tree - input tree
538+
/// \param hisString - selection string
539+
/// \param defaultCut - default selection applied common for all histograms (can be empty)
540+
/// \param firstEntry - first entry to process
541+
/// \param lastEntry - last entry to process
542+
/// \param chunkSize - chunk size
543+
/// \param verbose - verbosity
544+
/// \return - TObjArray of N-dimensional histograms
545+
/*!
546+
#### Example usage:
547+
\code
548+
chunkSize=10000;
549+
verbose=7;
550+
chinput=gSystem->ExpandPathName("$NOTES/JIRA/PWGPP-227/data/2016/LHC16t/000267161/pass1_CENT_wSDD/filteredLocal.list");
551+
TString defaultCut="esdTrack.fTPCncls>70";
552+
TTree *tree=(TTree*)AliXRDPROOFtoolkit::MakeChain(chinput, "highPt", 0, 1000000000,0);
553+
TString hisString="";
554+
hisString+="esdTrack.Pt():#esdTrack.fTPCncls>70>>hisPtAll(100,0,30);";
555+
hisString+="esdTrack.GetAlpha():#esdTrack.fTPCncls>70>>hisAlpha(90,-3.2,3.2);";
556+
hisString+="esdTrack.GetTgl():#esdTrack.fTPCncls>70>>hisTgl(20,-1.2,1.2);";
557+
hisString+="esdTrack.Pt():esdTrack.GetAlpha():esdTrack.GetTgl():#esdTrack.fTPCncls>70>>hisPtPhiThetaAll(100,0,30,90,-3.2,3.2,20,-1.2,1.2);";
558+
hisString+="esdTrack.Pt():#(esdTrack.fFlags&0x4)>0>>hisPtITS(100,1,10);";
559+
hisString+="esdTrack.fIp.Pt():#(esdTrack.fFlags&0x4)>0>>hisPtTPCOnly(100,1,10);";
560+
TStopwatch timer; hisArray = AliTreePlayer::MakeHistograms(tree, hisString, "(esdTrack.fFlags&0x40)>0&&esdTrack.fTPCncls>70",0,60000,100000); timer.Print();
561+
\endcode
562+
*/
563+
TObjArray * AliTreePlayer::MakeHistograms(TTree * tree, TString hisString, TString defaultCut, Int_t firstEntry, Int_t lastEntry, Int_t chunkSize, Int_t verbose){
564+
```
565+
```C++
566+
/// TStatToolkit::MakePDFMap function to calculate statistics form the N dimensnal PDF map
567+
/// Original implementation - a copy of the MakeDistortionMapFast
568+
/// \param histo - input n dimsnional histogram
569+
/// \param pcstream - output stream to store tree with PDF statistic maps
570+
/// \param projectionInfo -
571+
/// \param options - option - parameterize statistic to extract
572+
/// \param verbose - verbosity of extraction
573+
/// Example:
574+
/// options["exportGraph"]="1";
575+
/// options["exportGraphCumulative"]="1";
576+
/// options["LTMestimators"]="0.6:0.5:0.4";
577+
// options["LTMFitRange"]="0.6:5:1";
578+
void TStatToolkit::MakePDFMap(THnBase *histo, TTreeSRedirector *pcstream, TMatrixD &projectionInfo, std::map<std::string, std::string> pdfOptions, Int_t verbose)
579+
580+
581+
```
582+
583+
483584
**Strengths:**
484-
- Proven in production for TPC calibration (distortion maps, 2015-2024)
585+
- Proven in production for globale trackin and calibration QA
485586
- Computationally efficient for large datasets
486587
- Well-tested and reliable
588+
- Used for expert QAs
487589

488590
**Limitations:**
489-
- Rigid configuration: adding new fit functions required C++ code changes
490-
- Complex API: required deep knowledge of ROOT histogram internals
491-
- Limited extensibility: difficult to prototype new methods
492-
- Tight coupling to ALICE-specific data structures
493-
- Challenging for non-experts to use or modify
591+
- Tight coupling with ROOT - addopting ROT string based configuration for describing histograms
592+
- Using C++11 - not easy configuration - preferied not to rely on templates
593+
- Rigid configuration: string based API to define histograms and mapping (in pythyo using dictionaries)
594+
- Limited extensibility: difficult to add new fit functions
595+
- Relying on the AliRoot framework - not directly usable in O2 or scientific Python ecosystem
596+
597+
598+
494599

495600
### 5.2 Python Implementation v1 (2024)
496601

@@ -508,7 +613,7 @@ dX_values = [-2.1, 0.5, -1.8, ...] # target distortions
508613
- Simple conceptual model
509614
- Leverages existing pandas/numpy ecosystem
510615
- Easy to prototype and modify
511-
- Works with standard groupby-regression tools (v4 engine)
616+
- Works with standard groupby-regression tools
512617

513618
**Limitations:**
514619
- **Memory explosion:** 27× expansion for ±1 window, 125× for ±2 window
@@ -536,7 +641,8 @@ dX_values = [-2.1, 0.5, -1.8, ...] # target distortions
536641
- Clean API accessible to non-experts
537642
- Production-scale performance (<4GB memory, <30 min runtime)
538643

539-
---
644+
645+
540646

541647
## 6. Specifications - Requirements
542648

0 commit comments

Comments
 (0)