Third round of OpenCL maintenance for 5.6#20383
Third round of OpenCL maintenance for 5.6#20383TurboGit merged 6 commits intodarktable-org:masterfrom
Conversation
9a7631e to
78efa38
Compare
This macro with it's backend dt_opencl_enqueue_kernel_2d_local_args_internal() is used to call the kernel including locals, sizes and all kernel parameters for simplification.
Consequently using kernel calling _args() variants.
1. Use CLFARRAY for calc_Y0_mask (dual blend), denoiseprofile and hazeremoval for a simpler interface 2. Use clipf() macro in some more places 3. The weight function in atrous.cl gets some subtle performance boost by being an inline. Tested also for fast exp() variant, there seems to be no performance gain as the native function is equally performant. 4. Use OpenCL mix() function instead of _interpolatef() 5. Use two macros in rcd demosaicer 6. Make use of dt_fast_hypot()
1. For improved debugging of OpenCL performance we want information about clmem read/write/copy actions via the -d verbose switch 2. due to float->double conversion issues we sometimes had bad negative timings.
The maximum number of OpenCL events that can be handled within the driver/device is not exposed in the OpenCL API, in case this exceeds device-internal resources an error code would be returned which is handled and reported by darktable OpenCL interface. We still try to keep event resources within sensible limits to reduce stress. The per-device dt_opencl_device_t struct now only has a flag use_events, the max number of events is defined as DT_OPENCL_EVENTS and the log has been updated.
The roundup magics for width&height are mainly relevant for kernels called without locals as good values generally improve performance. Profiling these magics is simply not worth the effort, we can do a very good guess based on maximum workgroup size for the device. Tests to do this per kernel via dt_opencl_get_kernel_work_group_size() shows that the overhead decreases performance for those simple kernels so we go the easy way. Please note: We still write calculated data to the per-device conf for now to avoid confusion for people reading that conf but very likely we will go for a cl version bump for less options offered.
78efa38 to
7d191f2
Compare
TurboGit
left a comment
There was a problem hiding this comment.
First pass, some comments.
What about the integration tests? Do you foresee some diff?
Unfortunately my integration tests on latest fedora has significant problems and i simply don't understand enough python to did into this and the depending libraries. But - no, i would concider any new difference as a regression/bug. The PR is again pretty large but it's principally only about the way we call OpenCL kernels for maintenance. The late 2 commits don't interfere with results but might give a subtle perf gain if the user didn't tune ... |
No problem I have started the regression test. I'll report back. |
|
All tests are OK with previous testsuite. I have added a new check to find regressions on the count of diff count pixels between the CPU & GPU run. In this new testsuite we have a slight increase of the count for the following tests:
Anyway I have updated the baseline for pixel diff for now. We will be able to detect regressions now. Fact is that we have some tests with a big pixel diff count (> 900000 on one test for example). |
TurboGit
left a comment
There was a problem hiding this comment.
Thanks! And sorry for my stupid questions :)
dt_opencl_enqueue_kernel_2d_local_args()as an equivalent todt_opencl_enqueue_kernel_2d_args()for improved readability