Skip to content

Conversation

@Haoning-Sun
Copy link

This commit adds a new interface to the tracer, which can be used to dynamically update the sampling frequency.

@Haoning-Sun Haoning-Sun requested review from a team as code owners October 24, 2025 01:18
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 24, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.
Do you have a particular use case for this?


// Calculate the new sampling period in nanoseconds
// period = 1 second / frequency
period := uint64(time.Second.Nanoseconds() / int64(newSamplesPerSecond))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversation is tricky, I think. Can we be sure that the event rate is always (on x86 and arm64) in nano seconds?

Copy link
Author

@Haoning-Sun Haoning-Sun Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method for calculating the period in the perf_event initialization process is the same. It is a software event, and perhaps its definition is independent of the hardware architecture.
https://github.com/torvalds/linux/blob/6fab32bb6508abbb8b7b1c5498e44f0c32320ed5/kernel/events/core.c#L11850

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you verify the proposed changes?
When setting the sampling frequency initaly to 20Hz and printing the sampling frequency with every call in native_tracer_entry(), it shows me 50000000. Which is expected, as 1000000000/20 = 50000000.

But if updating the period some time later to 100 (samples per second) with the proposed function, I did expect to get a value like 10000000. Instead I got 50000000 (same value as for 20Hz) and the number of reported events did also not increase on the user space side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for me locally (15 cores):

WARN[0004] 374 EV/s                
WARN[0005] 300 EV/s                 
WARN[0006] 375 EV/s                  
WARN[0007] 374 EV/s          
WARN[0008] 300 EV/s              
INFO[0008] Sampling frequency updated to 120 Hz         
WARN[0009] 1845 EV/s               
WARN[0011] 2085 EV/s             
WARN[0012] 1920 EV/s                 
WARN[0013] 1875 EV/s               
WARN[0014] 1800 EV/s               
WARN[0015] 1845 EV/s                 
WARN[0016] 2264 EV/s             
WARN[0017] 2408 EV/s                
WARN[0018] 2006 EV/s          
WARN[0020] 1889 EV/s 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it in parça and used it to see the change in the number of samples.

@Haoning-Sun
Copy link
Author

Thanks for your contribution. Do you have a particular use case for this?

We hope to change the sampling frequency without recreating the tracer.


var updateErrors []error
for i, event := range *events {
if err := event.UpdatePeriod(period); err != nil {
Copy link
Member

@christos68k christos68k Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't change the size of the per-CPU buffers where the events are written, so we can expect to see lots of dropped events with higher sampling frequencies and full CPU utilization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that considering the overall situation, it is not appropriate to use the event method directly, right?

Copy link
Member

@christos68k christos68k Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way is to introduce a maxSamplingFrequency configuration option and size the per-CPU buffers by it (worst-case). Then in the UpdateSamplingFrequency method you would not allow the frequency to exceed the configured max.

Keep in mind though that the profiler isn't currently designed with very high (e.g. in the thousands) sampling frequencies in mind and there will be unexpected issues when that's the case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've added the configuration for it.

Haoning-Sun and others added 2 commits October 28, 2025 17:42
Use errors.Join to combine updateErrors

Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
@florianl
Copy link
Contributor

Please also make sure, that package reporter also updates its SamplesPerSecond and reports the correct frequency.

@Haoning-Sun
Copy link
Author

Please also make sure, that package reporter also updates its SamplesPerSecond and reports the correct frequency.

Added it in controller and reporter.

cli_flags.go Outdated
const (
// Default values for CLI flags
defaultArgSamplesPerSecond = 20
defaultArgMaxSamplesPerSecond = 40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not set such a low number of a maximum number of samples per second. Overall I think, it would be best to just not set a maximum number anyway. The maximum should be defined by the CPU core frequency and perf and not limited this way.

Suggested change
defaultArgMaxSamplesPerSecond = 40

On a regular base I'm running the profiler with 503Hz and it is totally fine. It generates, as expected, way more data than 19Hz but it also provides more insights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a maximum as it will be used to size the maps (at least for now, without restarting tracer). We can have the default fallback be zero which means set the internal maximum to args.SamplesPerSecond, essentially disallowing runtime sampling frequency changes (identical behavior to what we have today).

If the user sets a greater than zero value, we need to validate that it is also greater than args.SamplesPerSecond.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

Copy link
Contributor

@florianl florianl Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a different approach:
What about taking the configured sampling frequency as upper limit and only allow to reduce the sampling frequency?

Allowing only the reduction of the sampling frequency also makes sure that all caches are correctly sized for the highest configured frequency.

Copy link
Member

@christos68k christos68k Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sizing the caches by maxFrequency as in this PR has the same effect: the caches are correctly sized for the highest possible frequency but in addition, we don't default to the more expensive CPU-wise mode of operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user perspective it might be easier to just allow down sampling as described. It also reduces logic complexity in various places.

cli_flags.go Outdated
const (
// Default values for CLI flags
defaultArgSamplesPerSecond = 20
defaultArgMaxSamplesPerSecond = 40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a maximum as it will be used to size the maps (at least for now, without restarting tracer). We can have the default fallback be zero which means set the internal maximum to args.SamplesPerSecond, essentially disallowing runtime sampling frequency changes (identical behavior to what we have today).

If the user sets a greater than zero value, we need to validate that it is also greater than args.SamplesPerSecond.

Haoning-Sun and others added 3 commits October 30, 2025 09:10
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
}

// UpdateSamplingFrequency updates the sampling frequency used in profile generation.
func (p *Pdata) UpdateSamplingFrequency(samplesPerSecond int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If updating sampling frequency gets supported, some kind of report current frames now might be required.
Otherwise we come into the situation that package reporter already holds some data (with the old frequency), the sampling frequency gets updated, and with the next reporting cycle a mix of data with the old and new sampling frequency will be sent. This will introduce an unfortunate challenge for UIs and backends if they can't rely on the sampling frequency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Flush method was added to the reporter, but it doesn't completely solve the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Flush is the right solution here. It introduces semantic complexity regarding who initiates reporting operations. Previously, there was a clear owner of reporting operation, the runloop goroutine.

Now, we're introducing another owner who may call reportProfile.

A better way to solve the sampling frequency mismatch is to annotate the samples with the sampling frequency in the kernel. But I don't think we need to expand the scope of this PR to include that, it can be done separately. In many cases, the temporary sampling frequency mismatch resulting from UpdateFrequency will be noise.

if c.tracer == nil {
return fmt.Errorf("tracer is not initialized")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the controller be responsible to check the configuration before it is passed on to other parts, like tracer or reporter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check condition.

@Haoning-Sun Haoning-Sun force-pushed the update-sampling-frequency branch from ae81668 to 894e4cf Compare November 3, 2025 15:22
@Haoning-Sun Haoning-Sun force-pushed the update-sampling-frequency branch from 90e8e4f to 7a51bed Compare November 3, 2025 15:35
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants