|
| 1 | + |
| 2 | + |
| 3 | + |
| 4 | +## Motivation - Iteration 1 (2025-10-27 07:00) |
| 5 | + |
| 6 | +Before answering the questions, I would like to describe in more detail what is being done and why. |
| 7 | + |
| 8 | +* 0.) We are trying not only to describe a multidimensional function but also to estimate statistical |
| 9 | + properties of the probability density function (PDF) itself (e.g. using quantiles). |
| 10 | +* 1.) LHC/my specific: We are working with both unbinned and binned data, as well as machine learning |
| 11 | + algorithms, depending on data availability. In the case of ALICE, we usually have a huge amount of data. |
| 12 | + For example, for tracks we have 500 kHz × 10 → 5 × 10^6 tracks per second, measuring for O(10–15 hours) per |
| 13 | + day. This data is either histogrammed in multidimensional histograms or, by default, we sample it using |
| 14 | + "balanced semi-statfied" sampling, populating the variables of interest homogeneously (e.g. flat pt, flat PID). |
| 15 | + This is very important as PDF of Pt and PID is highly unbalanced (exponential, power-law, etc). |
| 16 | + With this approach, we reduce the input data volume by an order of magnitude and enable iterative refinement |
| 17 | + of the PDF estimation. |
| 18 | +* 2.) Extracting PDF properties in multidimensional space has the advantage of enabling post-fitting of |
| 19 | + analytical models for normalised data. Quite often, we do not have analytical models for the full distortion |
| 20 | + in (3D+time), but we can have an analytical model for the delta distortion time evolution. |
| 21 | + In my current studies, for example, we are fitting a two- exponential phi-symmetric model of distortion |
| 22 | + due to common electric field modification. |
| 23 | + |
| 24 | + |
| 25 | +### Q1: Does this capture your motivation accurately? |
| 26 | + |
| 27 | +- There are several factors we must consider, as described above. |
| 28 | +- Quite often (but not always), we have a large amount of data. Frequently, we are limited by memory and |
| 29 | + CPU for processing (see above). Normally, I try to parallelise if the data sets are independent, |
| 30 | + but using more than 4 GB of data in memory is problematic. Using pre-sampling for unbinned data scenarios |
| 31 | + helps, as the original data are statistically highly unbalanced (Exponential(mass) - PID, Power-law(pt), etc.). |
| 32 | +- In many cases, the problem is not only the sparsity of the data. Our data are "random". |
| 33 | + To obtain a reasonable estimate of the characterisation of the corresponding PDF, we need substantial |
| 34 | + statistics for each bin. That is our major obstacle, which we are trying to address. |
| 35 | + |
| 36 | +### Q2: GPT question Should I emphasize more?? |
| 37 | + |
| 38 | + The statistics/sparsity problem (mathematical angle) |
| 39 | + The physics context (ALICE TPC, particle physics) |
| 40 | + The software engineering angle (reusability, API design) |
| 41 | + Balance is good as-is |
| 42 | +* After my comments above, I think the motivation section will be rewritten. We have to emphasise |
| 43 | + statistical and mathematical considerations as I described above – estimation of the PDF and later |
| 44 | + functional decomposition using partial models and some kind of factorisation. |
| 45 | +* We should show examples from ALICE. |
| 46 | +* The software has to be reusable, as the problem is generic, and we need a generic solution. |
| 47 | + |
| 48 | + |
| 49 | +### Q3: The tone is currently technical but general. Should it be: (Qestion for Gemini and GPT) |
| 50 | + |
| 51 | + More mathematical (equations, formal notation) |
| 52 | + More practical (concrete examples upfront) |
| 53 | + Current level is appropriate |
| 54 | + |
| 55 | +I am not sure; I will ask GPT and Gemini about this. Some mathematics would be good, but I have a markdown file with limited mathematical capabilities. |
| 56 | +I think we should balance mathematics and practical examples. |
| 57 | + |
| 58 | + |
| 59 | +### Q4: Any missing key points or mis-characterizations? |
| 60 | + |
| 61 | +* We should place greater emphasis on the statistical estimation problem; refer to my introduction. |
| 62 | + |
| 63 | +* The motivation should be grounded in these defined problems, with the ALICE examples serving to support this. |
| 64 | + |
| 65 | +* For software aspects, we should highlight reusability and API design, as the problem is generic and requires a |
| 66 | + generic solution. |
| 67 | + |
| 68 | +* I presented the problem previously in several forums – internal meetings, discussions with the ROOT team, and ML |
| 69 | + conferences several times – but it was difficult to explain. People did not understand the statistical estimation |
| 70 | + problem, possible factorisation, and later usage in analytical (physical model fitting) using some data |
| 71 | + renormalisation as I described above. |
| 72 | + |
| 73 | +* We do not have models for everything, but quite often we have models for normalised dlas-ratios in multidimensional space. |
| 74 | + |
| 75 | + |
| 76 | +Q5: Should I add a diagram/figure placeholder (e.g., "Figure 1: Sparse 3D bins with ±1 neighborhood")? |
| 77 | +- Yes, a diagram would be helpful. |
| 78 | +- A figure illustrating sparse 3D bins with ±1 neighborhood would effectively convey the concept |
| 79 | + of sparsity and the challenges associated with estimating PDF properties in such scenarios. But I am not sure how to do it. |
| 80 | + |
| 81 | + |
| 82 | +## Motivation - Iteration 1 (2025-10-27 09:00) |
| 83 | + |
| 84 | +Before answering the questions, I would like to add some use cases: |
| 85 | +* Distortion maps already in use |
| 86 | +* Performance parameterisation (e.g. track pT resolution as a function of pT, eta, occupancy, time) |
| 87 | + * track matching resolution and biases |
| 88 | + * V0 resolution and biases |
| 89 | + * PID resolution and biases |
| 90 | + * Efficiency maps |
| 91 | + * QA variables (chi2, number of clusters, etc.) |
| 92 | + * Usage in MCto Data remapping |
| 93 | + |
| 94 | +* Keep in mind that RootInteractive is only a small subproject for interactive visualisation of the data. |
| 95 | + |
| 96 | +### Q1: Does Section 1 now accurately capture: |
| 97 | +* The PDF estimation focus? |
| 98 | +* Balanced sampling strategy? |
| 99 | +* Factorization approach? |
| 100 | +* Connection to RootInteractive? |
| 101 | + |
| 102 | +===> |
| 103 | + |
| 104 | +* I think it is more or less OK. |
| 105 | +* A balanced sampling strategy is mentioned, but we need more details. In some use cases, we sample down by a factor |
| 106 | +of \(10^3\)–\(10^4\) to obtain a manageable data size, making further processing feasible. |
| 107 | +* RootInteractive is just one subproject for interactive visualisation of extracted data. |
| 108 | +*Comment on the current version example: In a particular case, I use 90 samples for distortion maps – in reality, |
| 109 | +we use 5–10 minute maps, but in some cases, we have to go to O(s) to follow fluctuations. Obviously, we cannot do |
| 110 | +this with full spatial granularity, so some factorisation will be used. |
| 111 | + |
| 112 | +### Q2: Tone and depth: |
| 113 | + |
| 114 | +* Is the mathematical level appropriate? |
| 115 | + * I will ask GPT/Gemini for feedback on this. |
| 116 | +* Should I add equations (e.g., kernel weighting formula)? |
| 117 | + * Yes, adding equations would enhance clarity. However, we should ask PT and Gemini. |
| 118 | +* Is the ALICE example clear and compelling? |
| 119 | + * We need distortion map examples and performance parameterisation examples to make it clearer. |
| 120 | + |
| 121 | +### Q3: Missing elements: |
| 122 | + |
| 123 | +* Any key concepts I still missed? |
| 124 | +* Should I reference specific equations from your paper? |
| 125 | +* Need more or less technical detail? |
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +I included something at the beginning (performance parametrisation case), but I am not sure how much we |
| 130 | +can emphasise it without losing the audience. However, it can be mentioned in the motivation section – |
| 131 | +categories – and later in the example sections. |
| 132 | + |
| 133 | +### Q4: Structure: |
| 134 | + |
| 135 | +Are the subsections (1.1-1.5) logical? |
| 136 | +Should I reorganize anything? |
| 137 | +* I think the structure is OK for now. We can also ask GPT/Gemini for feedback on this. |
| 138 | + |
| 139 | +### Q5: Next steps: |
| 140 | + |
| 141 | +* Should we send Section 1 to GPT/Gemini now? |
| 142 | +* Or continue to Section 2 first? |
| 143 | + |
| 144 | +We need GPT/Gemini review before proceeding to Section 2. |
0 commit comments