Skip to content

Commit 5fd11a5

Browse files
author
miranov25
committed
git add docs/SLIDING_WINDOW_SPEC_DRAFT.md
git commit -m "docs: move glossary to Appendix A + corrections Structural: - Moved glossary from before Section 1 to Appendix A - Added footnote about glossary location Date corrections (MI feedback): - AliRoot: ~2000-present (not 2008-2024) - O2: 2022+ (not 2021+) Technical enhancements (GPT review): - Added C++03, FairRoot/DPL, TPC specs - Clarified THnSparse 2³¹ bins limit - Enhanced all definitions with technical detail - Improved formatting consistency Reviewed-by: MI, GPT-4, Gemini Applied-by: Claude"
1 parent 6a797c7 commit 5fd11a5

File tree

2 files changed

+215
-47
lines changed

2 files changed

+215
-47
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
2+
3+
4+
## Motivation - Iteration 1 (2025-10-27 07:00)
5+
6+
Before answering the questions, I would like to describe in more detail what is being done and why.
7+
8+
* 0.) We are trying not only to describe a multidimensional function but also to estimate statistical
9+
properties of the probability density function (PDF) itself (e.g. using quantiles).
10+
* 1.) LHC/my specific: We are working with both unbinned and binned data, as well as machine learning
11+
algorithms, depending on data availability. In the case of ALICE, we usually have a huge amount of data.
12+
For example, for tracks we have 500 kHz × 10 → 5 × 10^6 tracks per second, measuring for O(10–15 hours) per
13+
day. This data is either histogrammed in multidimensional histograms or, by default, we sample it using
14+
"balanced semi-statfied" sampling, populating the variables of interest homogeneously (e.g. flat pt, flat PID).
15+
This is very important as PDF of Pt and PID is highly unbalanced (exponential, power-law, etc).
16+
With this approach, we reduce the input data volume by an order of magnitude and enable iterative refinement
17+
of the PDF estimation.
18+
* 2.) Extracting PDF properties in multidimensional space has the advantage of enabling post-fitting of
19+
analytical models for normalised data. Quite often, we do not have analytical models for the full distortion
20+
in (3D+time), but we can have an analytical model for the delta distortion time evolution.
21+
In my current studies, for example, we are fitting a two- exponential phi-symmetric model of distortion
22+
due to common electric field modification.
23+
24+
25+
### Q1: Does this capture your motivation accurately?
26+
27+
- There are several factors we must consider, as described above.
28+
- Quite often (but not always), we have a large amount of data. Frequently, we are limited by memory and
29+
CPU for processing (see above). Normally, I try to parallelise if the data sets are independent,
30+
but using more than 4 GB of data in memory is problematic. Using pre-sampling for unbinned data scenarios
31+
helps, as the original data are statistically highly unbalanced (Exponential(mass) - PID, Power-law(pt), etc.).
32+
- In many cases, the problem is not only the sparsity of the data. Our data are "random".
33+
To obtain a reasonable estimate of the characterisation of the corresponding PDF, we need substantial
34+
statistics for each bin. That is our major obstacle, which we are trying to address.
35+
36+
### Q2: GPT question Should I emphasize more??
37+
38+
The statistics/sparsity problem (mathematical angle)
39+
The physics context (ALICE TPC, particle physics)
40+
The software engineering angle (reusability, API design)
41+
Balance is good as-is
42+
* After my comments above, I think the motivation section will be rewritten. We have to emphasise
43+
statistical and mathematical considerations as I described above – estimation of the PDF and later
44+
functional decomposition using partial models and some kind of factorisation.
45+
* We should show examples from ALICE.
46+
* The software has to be reusable, as the problem is generic, and we need a generic solution.
47+
48+
49+
### Q3: The tone is currently technical but general. Should it be: (Qestion for Gemini and GPT)
50+
51+
More mathematical (equations, formal notation)
52+
More practical (concrete examples upfront)
53+
Current level is appropriate
54+
55+
I am not sure; I will ask GPT and Gemini about this. Some mathematics would be good, but I have a markdown file with limited mathematical capabilities.
56+
I think we should balance mathematics and practical examples.
57+
58+
59+
### Q4: Any missing key points or mis-characterizations?
60+
61+
* We should place greater emphasis on the statistical estimation problem; refer to my introduction.
62+
63+
* The motivation should be grounded in these defined problems, with the ALICE examples serving to support this.
64+
65+
* For software aspects, we should highlight reusability and API design, as the problem is generic and requires a
66+
generic solution.
67+
68+
* I presented the problem previously in several forums – internal meetings, discussions with the ROOT team, and ML
69+
conferences several times – but it was difficult to explain. People did not understand the statistical estimation
70+
problem, possible factorisation, and later usage in analytical (physical model fitting) using some data
71+
renormalisation as I described above.
72+
73+
* We do not have models for everything, but quite often we have models for normalised dlas-ratios in multidimensional space.
74+
75+
76+
Q5: Should I add a diagram/figure placeholder (e.g., "Figure 1: Sparse 3D bins with ±1 neighborhood")?
77+
- Yes, a diagram would be helpful.
78+
- A figure illustrating sparse 3D bins with ±1 neighborhood would effectively convey the concept
79+
of sparsity and the challenges associated with estimating PDF properties in such scenarios. But I am not sure how to do it.
80+
81+
82+
## Motivation - Iteration 1 (2025-10-27 09:00)
83+
84+
Before answering the questions, I would like to add some use cases:
85+
* Distortion maps already in use
86+
* Performance parameterisation (e.g. track pT resolution as a function of pT, eta, occupancy, time)
87+
* track matching resolution and biases
88+
* V0 resolution and biases
89+
* PID resolution and biases
90+
* Efficiency maps
91+
* QA variables (chi2, number of clusters, etc.)
92+
* Usage in MCto Data remapping
93+
94+
* Keep in mind that RootInteractive is only a small subproject for interactive visualisation of the data.
95+
96+
### Q1: Does Section 1 now accurately capture:
97+
* The PDF estimation focus?
98+
* Balanced sampling strategy?
99+
* Factorization approach?
100+
* Connection to RootInteractive?
101+
102+
===>
103+
104+
* I think it is more or less OK.
105+
* A balanced sampling strategy is mentioned, but we need more details. In some use cases, we sample down by a factor
106+
of \(10^3\)\(10^4\) to obtain a manageable data size, making further processing feasible.
107+
* RootInteractive is just one subproject for interactive visualisation of extracted data.
108+
*Comment on the current version example: In a particular case, I use 90 samples for distortion maps – in reality,
109+
we use 5–10 minute maps, but in some cases, we have to go to O(s) to follow fluctuations. Obviously, we cannot do
110+
this with full spatial granularity, so some factorisation will be used.
111+
112+
### Q2: Tone and depth:
113+
114+
* Is the mathematical level appropriate?
115+
* I will ask GPT/Gemini for feedback on this.
116+
* Should I add equations (e.g., kernel weighting formula)?
117+
* Yes, adding equations would enhance clarity. However, we should ask PT and Gemini.
118+
* Is the ALICE example clear and compelling?
119+
* We need distortion map examples and performance parameterisation examples to make it clearer.
120+
121+
### Q3: Missing elements:
122+
123+
* Any key concepts I still missed?
124+
* Should I reference specific equations from your paper?
125+
* Need more or less technical detail?
126+
127+
128+
129+
I included something at the beginning (performance parametrisation case), but I am not sure how much we
130+
can emphasise it without losing the audience. However, it can be mentioned in the motivation section –
131+
categories – and later in the example sections.
132+
133+
### Q4: Structure:
134+
135+
Are the subsections (1.1-1.5) logical?
136+
Should I reorganize anything?
137+
* I think the structure is OK for now. We can also ask GPT/Gemini for feedback on this.
138+
139+
### Q5: Next steps:
140+
141+
* Should we send Section 1 to GPT/Gemini now?
142+
* Or continue to Section 2 first?
143+
144+
We need GPT/Gemini review before proceeding to Section 2.

0 commit comments

Comments
 (0)