docs: Section 5 accuracy corrections

miranov25 · miranov25 · commit 6a797c7a9835 · 2025-10-27T14:55:05.000+01:00
- Updated project status (v2.0 complete, Phase 7 in progress)
- Corrected implementation timeline
- Updated feature status tracking
- Added Q&amp;A historical content

Reviewed-by: MI
Section: 5
diff --git a/UTILS/dfextensions/groupby_regression/docs/Q_A.md b/UTILS/dfextensions/groupby_regression/docs/Q_A.md
@@ -0,0 +1,165 @@
+# Sliding Window GroupBy Regression - Q&A Document
+
+**Status:** Living document  
+**Last updated:** 2025-10-27  
+**Purpose:** Track complex concepts, design decisions, and review feedback
+
+---
+
+## 2. Example Data - itteration 1 (27.10,2025 11:00)
+Your version is too long and includes parts that do not reflect the reality of the project. The main purpose of the document is to motivate the development of a generic interface.
+
+I am not sure how to proceed. I suggest asking GPT and Gemini to review the conceptual part of section 2. Please provide a question based on my considerations below. Before proceeding, we need to resolve the issues with the scope, purpose, and length of this section.
+
+Additionally, in this particular case, it may be simpler if I edit it directly. Should I do that?
+
+Section Dataset A: TPC Spatial Distortion Maps (Test Data) was based on my example, so it closely matches our actual situation.
+
+2.3 Dataset B: TPC Temporal Evolution (Production Scale) was not described by me, so it does not reflect reality. I can prepare a shortened version. In this section, I want to highlight one important aspect from real practice: I use modified variables of interest – for example, instead of pt, I use q/pt, as many QA variables are more linear in q/pt.
+
+
+
+## Motivation - Iteration 1 (2025-10-27 07:00)
+
+Before answering the questions, I would like to describe in more detail what is being done and why.
+
+* 0.) We are trying not only to describe a multidimensional function but also to estimate statistical 
+   properties of the probability density function (PDF) itself (e.g. using quantiles).
+* 1.) LHC/my specific: We are working with both unbinned and binned data, as well as machine learning 
+   algorithms, depending on data availability. In the case of ALICE, we usually have a huge amount of data. 
+   For example, for tracks we have 500 kHz × 10 → 5 × 10^6 tracks per second, measuring for O(10–15 hours) per 
+   day. This data is either histogrammed in multidimensional histograms or, by default, we sample it using 
+   "balanced semi-stratified" sampling, populating the variables of interest homogeneously (e.g. flat pt, flat PID).
+   This is very important as PDF of Pt and PID is highly unbalanced (exponential, power-law, etc).
+   With this approach, we reduce the input data volume by an order of magnitude and enable iterative refinement  
+   of the PDF estimation.
+* 2.) Extracting PDF properties in multidimensional space has the advantage of enabling post-fitting of 
+   analytical models for normalised data. Quite often, we do not have analytical models for the full distortion 
+   in (3D+time), but we can have an analytical model for the delta distortion time evolution. 
+   In my current studies, for example, we are fitting a two- exponential phi-symmetric model of distortion 
+   due to common electric field modification.
+
+### Initial Questions (Iteration 1)
+
+**Q1:** Does this capture your motivation accurately?
+**A:** Several factors must be considered. Often we have large data but are limited by memory/CPU. Using >4GB in memory is problematic. Pre-sampling helps as original data is statistically highly unbalanced. The problem is not only sparsity - data is "random" and we need substantial statistics per bin.
+
+**Q2:** Should I emphasize more?
+**A:** Rewrite to emphasize statistical/mathematical considerations - PDF estimation and functional decomposition using partial models and factorization. Show ALICE examples. Software must be reusable.
+
+**Q3:** Tone - mathematical vs practical?
+**A:** Will ask GPT/Gemini. Some mathematics would be good but need balance.
+
+**Q4:** Missing key points?
+**A:** Emphasize statistical estimation problem. Motivation should be grounded in defined problems with ALICE examples. Highlight reusability and API design. Note: presented at forums but difficult to explain - people didn't understand statistical estimation, factorization, and usage in analytical model fitting with data renormalization.
+
+**Q5:** Add diagram?
+**A:** Yes, sparse 3D bins with ±1 neighborhood would help.
+
+---
+
+## Motivation - Iteration 2 (2025-10-27 09:00)
+
+### Additional Use Cases Added
+
+* Distortion maps (already in use)
+* Performance parameterization (e.g. track pT resolution as function of pT, eta, occupancy, time)
+  * Track matching resolution and biases
+  * V0 resolution and biases
+  * PID resolution and biases
+  * Efficiency maps
+  * QA variables (chi2, number of clusters, etc.)
+  * Usage in MC-to-Data remapping
+* Note: RootInteractive is only a small subproject for interactive visualisation of extracted data
+
+### Review Questions (Iteration 2)
+
+**Q1: Does Section 1 now accurately capture the key concepts?**
+
+*PDF estimation focus?*
+- More or less OK ✓
+
+*Balanced sampling strategy?*
+- Mentioned but need more details
+- In some use cases we sample down by factor of 10³–10⁴ to obtain manageable data size
+- **Action:** Added range 10×-10⁴× with typical 10²-10³× in Section 1.3.1 ✓
+
+*Factorization approach?*
+- Explained with TPC example
+- **Action:** Added note about temporal resolution (5-10 min maps vs O(s) for fluctuations) ✓
+
+*Connection to RootInteractive?*
+- RootInteractive is just one subproject for interactive visualization
+- **Action:** Added clarification that sliding window is server-side preprocessing ✓
+
+**Q2: Tone and depth**
+
+*Is mathematical level appropriate?*
+- Will ask GPT/Gemini for feedback → **See REVIEW_REQUEST_SECTION1.md**
+
+*Should I add equations?*
+- Yes, would enhance clarity
+- But ask GPT/Gemini first → **See REVIEW_REQUEST_SECTION1.md**
+
+*Is ALICE example clear?*
+- Need distortion map AND performance parameterization examples
+- **Action:** Added performance parameterization example in Section 1.1 ✓
+- **Action:** Expanded use cases in Section 1.5 ✓
+
+**Q3: Missing elements**
+
+*Key concepts still missed?*
+- Performance parameterization case added at beginning
+- Can mention in motivation categories and later in example sections
+- **Action:** Added to Section 1.1 and 1.5 ✓
+
+**Q4: Structure**
+
+*Are subsections (1.1-1.5) logical?*
+- Structure OK for now
+- Will ask GPT/Gemini → **See REVIEW_REQUEST_SECTION1.md**
+
+**Q5: Next steps**
+
+*Send to GPT/Gemini or continue to Section 2?*
+- **Decision:** Need GPT/Gemini review BEFORE proceeding to Section 2
+- **Action:** Created REVIEW_REQUEST_SECTION1.md with detailed questions ✓
+
+---
+
+## Status Summary
+
+**Section 1 - Motivation:**
+- Iteration 2 draft complete
+- Incorporates all user feedback from 2025-10-27 09:00
+- Ready for external review
+
+**Next Steps:**
+1. Send to GPT-4 for review
+2. Send to Gemini for review  
+3. Address critical issues from both reviewers
+4. Finalize Section 1
+5. Proceed to Section 2 (Example Data)
+
+**Files:**
+- `SLIDING_WINDOW_SPEC_DRAFT.md` - Main specification document
+- `REVIEW_REQUEST_SECTION1.md` - Review questions for GPT/Gemini
+- `Q_A.md` - This file (Q&A tracking)
+
+---
+
+## Active Questions for Next Iterations
+
+[None currently - awaiting GPT/Gemini feedback]
+
+---
+
+## Design Decisions Log
+
+[To be populated during Section 6 discussion]
+
+---
+
+## Archived Questions
+
+[To be populated as questions are resolved]
diff --git a/UTILS/dfextensions/groupby_regression/docs/SLIDING_WINDOW_SPEC_DRAFT.md b/UTILS/dfextensions/groupby_regression/docs/SLIDING_WINDOW_SPEC_DRAFT.md
@@ -472,25 +472,130 @@ dX_values = [-2.1, 0.5, -1.8, ...]         # target distortions
 
 ### 5.1 C++ Implementation (2015-2024)
 
-**Overview:** The original sliding window implementation was developed in C++ within the ALICE AliRoot/O2 framework, using N-dimensional histograms as input structures.
+**Overview:** The original sliding window implementation was developed in C++ within the ALICE AliRoot framework, 
+using N-dimensional histograms as input structures. The code has not yet been ported to the Run 3 O2 framework, 
+and until recently it was used for Run 3 data with AliRoot as a side package.
+
+It was used for performance and dE/dx parameterisation, as well as the initial implementation of the TPC distortion 
+maps in 2015. Q/q, track delta, and efficiency  variables were grouped into histograms with the same binning. 
+Several versions of binning with different granularity and focus were used, in order to bypass the ROOT internal 
+limitation of 1 GB.
+
+Detector-based summary binning versions:
+* Kinematical variables (q/pt, tgl)
+* ~ occupancy
+* Phi/sector modulation (90 or 180 bins in the full phi range, or 10–20 bins per sector assuming sector symmetry)
+
 
 **Key features:**
-- Multi-dimensional histogram-based approach using ROOT's THnSparse
-- Efficient kernel lookups via histogram bin navigation
-- Support for various boundary conditions (mirror, truncate, periodic)
+- Multi-dimensional histogram-based approach using ROOT's THnSparse binned (1GBy limit)
+  - O(10) varaiblae types x 5 biining types used (see comment above)  
+  - aggregation using smapled data on server (bash parallel comand), or farm if larger production
+- Sliding window implmentation as a proposprocessing step together with groupby regression
+  - Kernel-based neighbor aggregation using histogram bin indexing
+  - In addition to calluating sldiing window statistcs (mean,median, std,mad LTM) of variables  of interest 
+      (dEdx,efficency,track deltai) aslo mean of varaibles used for binning (q/pt,eta,phi,occupancy)
+  - Weighting schemes: uniform, distance-based (inverse distance, Gaussian)
+- User-defined fit functions (linear, polynomial, custom)
 - Integrated with ALICE offline analysis framework
 
+#### 5.1 C++ Function Signature
+
+```C++
+/// Create list of histograms specified by selection
+/// Should be rough equivalent of the "ALICE train" TTree->Draw();
+///  a.) Data are read only once
+///  b.) values expression are reused (evaluated only once)
+///  c.) Axis labelling and names of variables extracted from the tree metadata (.AxisTitle)
+/// * default cut
+///   * default selection applied common for all histograms (can be empty)
+///
+/// * hisString : - semicolomn separated string
+///   * his0;his1; ...; hisN
+/// * histogram syntax:
+///    * var0:var1:...:<#weight>>>hisName(bins0,min0,max0,bins1,min0,min, minValue,maxValue)
+///    * Syntax:
+///      * vari are histogramming expression
+///      * weight (or cut) entry is optional
+///        * default cut is always applied, weight is applied on top
+///    * ranges syntax:
+///      *  nbins,max,min where max and min are double or format strings
+///        * in case format string % specified using (Fraction, mean,meanFraction, rms, rmsFraction)
+///          *  %fraction.sigma
+///          *  #cumulant
+///          *  range for bin content can be specified in the same format (by default is not set)
+/*!
+##### CPU time to process one histogram or set of histograms (in particular case of esdTrack queries) is the same - and it is determined (90 %) by tree->GetEntry
+\code
+  THn * his0= (THn*)hisArray->At(0);
+  his0->Projection(0)->Draw("");
+  tree->SetLineColor(2);
+  TStopwatch timer; tree->Draw("esdTrack.Pt()","(esdTrack.fFlags&0x40)>0&&esdTrack.fTPCncls>70","same",60000); timer.Print();
+\endcode
+*/
+
+/// \param tree         - input tree
+/// \param hisString    - selection string
+/// \param defaultCut   - default selection applied common for all histograms (can be empty)
+/// \param firstEntry   - first entry to process
+/// \param lastEntry    - last entry to process
+/// \param chunkSize    - chunk size
+/// \param verbose      - verbosity
+/// \return             - TObjArray of N-dimensional histograms
+/*!
+#### Example usage:
+\code
+    chunkSize=10000;
+    verbose=7;
+    chinput=gSystem->ExpandPathName("$NOTES/JIRA/PWGPP-227/data/2016/LHC16t/000267161/pass1_CENT_wSDD/filteredLocal.list");
+    TString defaultCut="esdTrack.fTPCncls>70";
+    TTree *tree=(TTree*)AliXRDPROOFtoolkit::MakeChain(chinput, "highPt", 0, 1000000000,0);
+    TString hisString="";
+    hisString+="esdTrack.Pt():#esdTrack.fTPCncls>70>>hisPtAll(100,0,30);";
+    hisString+="esdTrack.GetAlpha():#esdTrack.fTPCncls>70>>hisAlpha(90,-3.2,3.2);";
+    hisString+="esdTrack.GetTgl():#esdTrack.fTPCncls>70>>hisTgl(20,-1.2,1.2);";
+    hisString+="esdTrack.Pt():esdTrack.GetAlpha():esdTrack.GetTgl():#esdTrack.fTPCncls>70>>hisPtPhiThetaAll(100,0,30,90,-3.2,3.2,20,-1.2,1.2);";
+    hisString+="esdTrack.Pt():#(esdTrack.fFlags&0x4)>0>>hisPtITS(100,1,10);";
+    hisString+="esdTrack.fIp.Pt():#(esdTrack.fFlags&0x4)>0>>hisPtTPCOnly(100,1,10);";
+    TStopwatch timer; hisArray = AliTreePlayer::MakeHistograms(tree, hisString, "(esdTrack.fFlags&0x40)>0&&esdTrack.fTPCncls>70",0,60000,100000); timer.Print();
+\endcode
+ */
+TObjArray  * AliTreePlayer::MakeHistograms(TTree * tree, TString hisString, TString defaultCut, Int_t firstEntry, Int_t lastEntry, Int_t chunkSize, Int_t verbose){
+```
+```C++
+/// TStatToolkit::MakePDFMap function to calculate statistics form the N dimensnal PDF map
+/// Original implementation - a copy of the MakeDistortionMapFast
+/// \param histo              -  input n dimsnional histogram
+/// \param pcstream           -  output stream to store tree with PDF statistic maps
+/// \param projectionInfo     -
+/// \param options            - option - parameterize statistic to extract
+/// \param verbose            - verbosity of extraction
+/// Example:
+/// options["exportGraph"]="1";
+///  options["exportGraphCumulative"]="1";
+///  options["LTMestimators"]="0.6:0.5:0.4";
+//  options["LTMFitRange"]="0.6:5:1";
+void TStatToolkit::MakePDFMap(THnBase *histo, TTreeSRedirector *pcstream, TMatrixD &projectionInfo, std::map<std::string, std::string> pdfOptions, Int_t verbose)
+
+
+```
+
+
 **Strengths:**
-- Proven in production for TPC calibration (distortion maps, 2015-2024)
+- Proven in production for globale trackin and calibration QA
 - Computationally efficient for large datasets
 - Well-tested and reliable
+- Used for expert QAs
 
 **Limitations:**
-- Rigid configuration: adding new fit functions required C++ code changes
-- Complex API: required deep knowledge of ROOT histogram internals
-- Limited extensibility: difficult to prototype new methods
-- Tight coupling to ALICE-specific data structures
-- Challenging for non-experts to use or modify
+- Tight coupling with ROOT - addopting ROT string based configuration for describing histograms
+- Using C++11 - not easy configuration - preferied not to rely on templates
+- Rigid configuration: string based API to define histograms and mapping (in pythyo using dictionaries)
+- Limited extensibility: difficult to add new fit functions
+- Relying on the AliRoot framework - not directly usable in O2 or scientific Python ecosystem
+
+
+
 
 ### 5.2 Python Implementation v1 (2024)
 
@@ -508,7 +613,7 @@ dX_values = [-2.1, 0.5, -1.8, ...]         # target distortions
 - Simple conceptual model
 - Leverages existing pandas/numpy ecosystem
 - Easy to prototype and modify
-- Works with standard groupby-regression tools (v4 engine)
+- Works with standard groupby-regression tools 
 
 **Limitations:**
 - **Memory explosion:** 27× expansion for ±1 window, 125× for ±2 window
@@ -536,7 +641,8 @@ dX_values = [-2.1, 0.5, -1.8, ...]         # target distortions
 - Clean API accessible to non-experts
 - Production-scale performance (<4GB memory, <30 min runtime)
 
----
+
+
 
 ## 6. Specifications - Requirements