Skip to content

Third round of OpenCL maintenance for 5.6#20383

Merged
TurboGit merged 6 commits intodarktable-org:masterfrom
jenshannoschwalm:opencl_maintenance_56_3
Feb 25, 2026
Merged

Third round of OpenCL maintenance for 5.6#20383
TurboGit merged 6 commits intodarktable-org:masterfrom
jenshannoschwalm:opencl_maintenance_56_3

Conversation

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

  1. b4ea535 introduces dt_opencl_enqueue_kernel_2d_local_args() as an equivalent to dt_opencl_enqueue_kernel_2d_args() for improved readability
  2. 84ad991 makes use of above macro
  3. In 1089656 we have various minor OpenCL kernel improvements with subtle performance gains or maintenance by using available functions.
  4. 9a7631e will help to investigate performance issues related to host<->cldevice memory interactions.

@jenshannoschwalm jenshannoschwalm added this to the 5.6 milestone Feb 23, 2026
@jenshannoschwalm jenshannoschwalm added scope: codebase making darktable source code easier to manage OpenCL Related to darktable OpenCL code scope: debugging labels Feb 23, 2026
This macro with it's backend dt_opencl_enqueue_kernel_2d_local_args_internal() is used
to call the kernel including locals, sizes and all kernel parameters for simplification.
Consequently using kernel calling _args() variants.
1. Use CLFARRAY for calc_Y0_mask (dual blend), denoiseprofile and hazeremoval for a simpler interface
2. Use clipf() macro in some more places
3. The weight function in atrous.cl gets some subtle performance boost by being an inline.
   Tested also for fast exp() variant, there seems to be no performance gain as the native function
   is equally performant.
4. Use OpenCL mix() function instead of _interpolatef()
5. Use two macros in rcd demosaicer
6. Make use of dt_fast_hypot()
1. For improved debugging of OpenCL performance we want information about clmem read/write/copy
   actions via the -d verbose switch
2. due to float->double conversion issues we sometimes had bad negative timings.
The maximum number of OpenCL events that can be handled within the driver/device is not
exposed in the OpenCL API, in case this exceeds device-internal resources an error
code would be returned which is handled and reported by darktable OpenCL interface.

We still try to keep event resources within sensible limits to reduce stress.

The per-device dt_opencl_device_t struct now only has a flag use_events, the max number
of events is defined as DT_OPENCL_EVENTS and the log has been updated.
The roundup magics for width&height are mainly relevant for kernels called without locals
as good values generally improve performance.

Profiling these magics is simply not worth the effort, we can do a very good guess based on
maximum workgroup size for the device.
Tests to do this per kernel via dt_opencl_get_kernel_work_group_size() shows that the
overhead decreases performance for those simple kernels so we go the easy way.

Please note:
We still write calculated data to the per-device conf for now to avoid confusion for people
reading that conf but very likely we will go for a cl version bump for less options offered.
@jenshannoschwalm
Copy link
Copy Markdown
Collaborator Author

Two commits added
5. 95507ef simplifies the OpenCL handling conf options, no problems expected at all
6. 7d191f2 is another simplification of the per-device conf, the roundups just don't need to be exposed.

Copy link
Copy Markdown
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, some comments.

What about the integration tests? Do you foresee some diff?

Comment thread data/kernels/atrous.cl
Comment thread src/common/bilateral.c
Comment thread data/kernels/demosaic_rcd.cl
@jenshannoschwalm
Copy link
Copy Markdown
Collaborator Author

What about the integration tests? Do you foresee some diff?

Unfortunately my integration tests on latest fedora has significant problems and i simply don't understand enough python to did into this and the depending libraries.

But - no, i would concider any new difference as a regression/bug. The PR is again pretty large but it's principally only about the way we call OpenCL kernels for maintenance. The late 2 commits don't interfere with results but might give a subtle perf gain if the user didn't tune ...

@TurboGit
Copy link
Copy Markdown
Member

Unfortunately my integration tests on latest fedora has significant problems and i simply don't understand enough python to did into this and the depending libraries.

No problem I have started the regression test. I'll report back.

@TurboGit
Copy link
Copy Markdown
Member

All tests are OK with previous testsuite.

I have added a new check to find regressions on the count of diff count pixels between the CPU & GPU run. In this new testsuite we have a slight increase of the count for the following tests:

  • 0064-demosaic-xtrans-vng
  • 0066-demosaic-mark3
  • 0068-rawdenoise-xtrans
  • 0100-invert-xtrans
  • 0145-lens-metadata-xtransIV-modversion-6
  • 0165-demosaic-markesteijn-vng
  • 0173-capture-dual-markesteijn

Anyway I have updated the baseline for pixel diff for now. We will be able to detect regressions now.

Fact is that we have some tests with a big pixel diff count (> 900000 on one test for example).

Copy link
Copy Markdown
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And sorry for my stupid questions :)

@TurboGit TurboGit merged commit a433d9d into darktable-org:master Feb 25, 2026
9 of 10 checks passed
@jenshannoschwalm jenshannoschwalm deleted the opencl_maintenance_56_3 branch February 25, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation: pending a documentation work is required OpenCL Related to darktable OpenCL code release notes: pending scope: codebase making darktable source code easier to manage scope: debugging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants