Summary
The sample implementation's vxLaplacianPyramidKernel and vxLaplacianReconstructKernel are 4-line stubs that just return VX_SUCCESS. The actual pyramid computation has been hoisted into the corresponding *Initializer callbacks, which run once during vxVerifyGraph(). As a result:
- The output pyramid pixels are computed exactly once, against whatever input data was in the source image at the moment
vxVerifyGraph() ran.
- Every subsequent
vxProcessGraph() call returns a no-op (~1 µs on FHD).
- If the application mutates the input image between
vxProcessGraph() calls — the standard streaming/video pipeline pattern — the output pyramid is stale: it still reflects the input from vxVerifyGraph() time, not the current input.
This is a spec violation: the OpenVX 1.3 specification defines the kernel Initializer as the place to "initialize data once all the parameters have been validated" (per-node local buffers, child sub-graphs, etc.), not the place to perform the kernel's computation. vxProcessGraph() is the execution step and is required to process the graph against current input data on each call.
Spec citations
From api-docs/include/VX/vx_types.h:
/*!
* \brief The pointer to the kernel initializer. If the host code requires a call
* to initialize data once all the parameters have been validated, this function
* is called if not NULL.
*/
typedef vx_status(VX_CALLBACK *vx_kernel_initialize_f)(
vx_node node, const vx_reference *parameters, vx_uint32 num);
From the OpenVX 1.3 spec graph-execution lifecycle:
Call the vx_kernel_validate_f callback. Call the vx_kernel_initialize_f callback (if not NULL): if VX_KERNEL_LOCAL_DATA_SIZE == 0, the callback is allowed to set VX_NODE_LOCAL_DATA_SIZE and VX_NODE_LOCAL_DATA_PTR. […]
The Initializer's documented purpose is local-data setup, not kernel computation.
From the spec text for vxProcessGraph:
This function causes the synchronous processing of a graph. […] After the graph verifies successfully then processing occurs [*REQ-0606*]. If the graph was previously verified via vxVerifyGraph or vxProcessGraph then the graph is processed [*REQ-0607*].
"Processing occurs" / "the graph is processed" each call. There is no allowance for skipping execution when inputs haven't changed (and the implementation has no way to know they haven't — callers can mutate input pixels via vxMapImagePatch / vxCopyImagePatch between graph runs without notifying anything).
Where the bug lives
Bug 1: vxLaplacianPyramidKernel
sample/targets/c_model/vx_pyramid.c (and the identical pattern in sample/targets/venum/vx_laplacianpyramid.c):
// lines 692-702 (c_model)
static vx_status VX_CALLBACK vxLaplacianPyramidKernel(
vx_node node, const vx_reference parameters[], vx_uint32 num)
{
vx_status status = VX_FAILURE;
(void)node;
if (num == dimof(laplacian_pyramid_kernel_params))
{
status = VX_SUCCESS;
}
return status;
}
… while the matching Initializer (lines 859-948) does the entire computation:
static vx_status VX_CALLBACK vxLaplacianPyramidInitializer(...)
{
...
gaussian = vxCreatePyramid(context, levels + 1, VX_SCALE_PYRAMID_HALF, ...);
vxuGaussianPyramid(context, input, gaussian); // <-- builds the temp
// Gaussian pyramid
conv = vxCreateGaussian5x5Convolution(context);
for (lev = 0; lev < levels; lev++) {
pyr_gauss_curr_level_filtered = vxCreateImage(...);
upsampleImage(context, ..., gauss_next, conv, // <-- 5x5 convolve
pyr_gauss_curr_level_filtered, ...);
pyr_laplacian_curr_level = vxGetPyramidLevel(laplacian, ...);
status |= vxuSubtract(context, gauss_cur, // <-- L[i] = G[i] -
pyr_gauss_curr_level_filtered, // upsampled(G[i+1])
policy, pyr_laplacian_curr_level);
...
}
...
}
Bug 2: vxLaplacianReconstructKernel
Same pattern, same file (lines 989-998 stub kernel, 1141+ Initializer doing the work).
The same file already shows the correct pattern
Right next to these two buggy kernels, vxGaussianPyramidKernel does it the spec-compliant way (lines 464-476):
static vx_status VX_CALLBACK vxGaussianPyramidKernel(
vx_node node, const vx_reference parameters[], vx_uint32 num)
{
vx_status status = VX_FAILURE;
(void)parameters;
if (num == dimof(gaussian_pyramid_kernel_params))
{
vx_graph subgraph = ownGetChildGraphOfNode(node);
status = vxProcessGraph(subgraph); // <-- real work each call
}
return status;
}
The matching vxGaussianPyramidInitializer (lines 580-636) builds a child sub-graph (a chain of vxConvolveNode + vxScaleImageNode per level), stores it on the node via ownSetChildGraphOfNode, and returns. The kernel callback then re-executes that child sub-graph against the current input image on every vxProcessGraph() call. That is the correct use of the Initializer hook — it sets up the persistent data (the child sub-graph) once per node lifetime, while the actual kernel work happens in the kernel callback every call.
Reproducer
A minimal C program that fails on the unfixed sample:
// Build a graph with vxLaplacianPyramidNode.
// Fill input with all 100s, then run.
vxCopyImagePatch(input, &rect, 0, &addr, fill_100, VX_WRITE_ONLY, VX_MEMORY_TYPE_HOST);
vxVerifyGraph(graph);
vxProcessGraph(graph); // first run
vx_image lvl0 = vxGetPyramidLevel(laplacian_pyr, 0);
read_pixels(lvl0, &result_first);
// Mutate the input — overwrite every pixel with 200.
vxCopyImagePatch(input, &rect, 0, &addr, fill_200, VX_WRITE_ONLY, VX_MEMORY_TYPE_HOST);
vxProcessGraph(graph); // second run, *no* re-verify
read_pixels(lvl0, &result_second);
// Spec says the Laplacian pyramid is a function of the current input,
// so for two different inputs the two outputs MUST differ. With the
// Khronos sample, result_first == result_second byte-for-byte, because
// the per-iteration kernel callback was a no-op stub and the cached
// output from Initializer is still in place.
assert(memcmp(result_first.data(), result_second.data(),
result_first.size()) != 0); // FAILS on unfixed sample
The same reproducer with vxGaussianPyramidNode substituted (also same with vxBox3x3Node, vxAddNode, etc.) passes — only the two Laplacian kernels carry this bug.
Why CTS does not catch it
cts/test_conformance/test_laplacianpyramid.c calls vxProcessGraph(graph) exactly once per test:
ASSERT_VX_OBJECT(node = vxLaplacianPyramidNode(graph, input, laplacian, output), VX_TYPE_NODE);
VX_CALL(vxSetNodeAttribute(node, VX_NODE_BORDER, &border, sizeof(border)));
VX_CALL(vxVerifyGraph(graph));
VX_CALL(vxProcessGraph(graph)); // <-- only one call
VX_CALL(vxReleaseNode(&node));
VX_CALL(vxReleaseGraph(&graph));
Same shape in the LaplacianReconstruct test. Because the conformance test never invokes vxProcessGraph a second time on the same graph after mutating the input, the staleness of the cached output is invisible. A test that did process → mutate input via vxCopyImagePatch → process → diff outputs would fail on the current sample and would be the right way to gate any fix.
Practical impact: misleading benchmark numbers
Beyond the correctness issue, this bug also distorts every benchmark that compares another OpenVX implementation against the Khronos sample. With a typical for (i=0; i<N; i++) vxProcessGraph(g); measurement loop, the sample reports:
LaplacianPyramid on FHD (1920×1080): ~1 µs/iteration ≈ 1.4 million MP/s ← physically impossible; just the cost of the kernel-stub function call.
GaussianPyramid (correct sub-graph pattern) on the same FHD: ~159 ms/iteration ≈ 13 MP/s — the real cost.
Spec-compliant implementations that recompute the pyramid on every vxProcessGraph (e.g. rustVX) honestly report tens of milliseconds and end up looking many orders of magnitude slower than the sample, even though they are doing the same amount of work per iteration and producing fresh, correct output. The comparison row reads as "the Khronos sample is a million times faster" when in reality it is silently caching a stale result from verify-time.
This trips up:
- Implementers porting OpenVX to a new target: they see their kernel "underperforming" against the reference and may waste days trying to optimize it before realising the reference isn't doing the work in the timed window. The same
vx_perf snapshot that callers get via vxQueryNode(node, VX_NODE_PERFORMANCE, ...) is also 0.000 ms for these two kernels in the sample, which is the smoke signal — every other Khronos kernel where the callback contains real work has wall-clock and vx_perf agreeing to within 0.1 ms; only LaplacianPyramid and LaplacianReconstruct disagree by orders of magnitude.
- Benchmark authors / users: the openvx-mark FHD comparison was misreporting
Khronos LaplacianPyramid ≈ 1.4 million MP/s. We've worked around it on the bench side (kiritigowda/openvx-mark#4) by rebuilding+verifying+processing inside the timed window, but the underlying sample-impl bug is what makes the workaround necessary.
- Anyone reading the sample as reference code for how to write an OpenVX kernel: the file teaches two contradictory patterns side-by-side. New target back-ends derived from this sample (and there are several in the wild — AMD MIVisionX's
c_model-derived path is one) inherit the bug.
Suggested fix
Refactor vxLaplacianPyramidKernel/vxLaplacianPyramidInitializer (and vxLaplacianReconstruct*) to use the same child-sub-graph pattern that the same file already uses correctly for vxGaussianPyramidKernel:
- Initializer: build a child sub-graph (
vxGaussianPyramidNode for the temp pyramid, plus per-level vxConvolveNode for the 5×5 upsample and vxSubtractNode for L[i] = G[i] - upsampled(G[i+1]), with the appropriate scale/offset glue). Store it on the node with ownSetChildGraphOfNode(node, subgraph). Verify the sub-graph here. Do not write any output pixels at this step.
- Kernel callback:
return vxProcessGraph(ownGetChildGraphOfNode(node)); — exactly as vxGaussianPyramidKernel does today.
- Deinitializer: release the child sub-graph (
vxGaussianPyramidDeinitializer is a one-line template).
The same diff applies in both sample/targets/c_model/vx_pyramid.c and sample/targets/venum/vx_laplacianpyramid.c. Estimated size: ~80 lines per kernel, replacing the 90-line monolithic Initializer with a sub-graph builder of comparable size, and turning the no-op stub kernel into a one-liner.
Recommended companion change: extend cts/test_conformance/test_laplacianpyramid.c (and the equivalent reconstruct test) with a "process twice with different inputs and assert outputs differ" check, so this regression cannot reappear.
References
Summary
The sample implementation's
vxLaplacianPyramidKernelandvxLaplacianReconstructKernelare 4-line stubs that just returnVX_SUCCESS. The actual pyramid computation has been hoisted into the corresponding*Initializercallbacks, which run once duringvxVerifyGraph(). As a result:vxVerifyGraph()ran.vxProcessGraph()call returns a no-op (~1 µs on FHD).vxProcessGraph()calls — the standard streaming/video pipeline pattern — the output pyramid is stale: it still reflects the input fromvxVerifyGraph()time, not the current input.This is a spec violation: the OpenVX 1.3 specification defines the kernel Initializer as the place to "initialize data once all the parameters have been validated" (per-node local buffers, child sub-graphs, etc.), not the place to perform the kernel's computation.
vxProcessGraph()is the execution step and is required to process the graph against current input data on each call.Spec citations
From
api-docs/include/VX/vx_types.h:From the OpenVX 1.3 spec graph-execution lifecycle:
The Initializer's documented purpose is local-data setup, not kernel computation.
From the spec text for
vxProcessGraph:"Processing occurs" / "the graph is processed" each call. There is no allowance for skipping execution when inputs haven't changed (and the implementation has no way to know they haven't — callers can mutate input pixels via
vxMapImagePatch/vxCopyImagePatchbetween graph runs without notifying anything).Where the bug lives
Bug 1:
vxLaplacianPyramidKernelsample/targets/c_model/vx_pyramid.c(and the identical pattern insample/targets/venum/vx_laplacianpyramid.c):… while the matching Initializer (lines 859-948) does the entire computation:
Bug 2:
vxLaplacianReconstructKernelSame pattern, same file (lines 989-998 stub kernel, 1141+ Initializer doing the work).
The same file already shows the correct pattern
Right next to these two buggy kernels,
vxGaussianPyramidKerneldoes it the spec-compliant way (lines 464-476):The matching
vxGaussianPyramidInitializer(lines 580-636) builds a child sub-graph (a chain ofvxConvolveNode+vxScaleImageNodeper level), stores it on the node viaownSetChildGraphOfNode, and returns. The kernel callback then re-executes that child sub-graph against the current input image on everyvxProcessGraph()call. That is the correct use of the Initializer hook — it sets up the persistent data (the child sub-graph) once per node lifetime, while the actual kernel work happens in the kernel callback every call.Reproducer
A minimal C program that fails on the unfixed sample:
The same reproducer with
vxGaussianPyramidNodesubstituted (also same withvxBox3x3Node,vxAddNode, etc.) passes — only the two Laplacian kernels carry this bug.Why CTS does not catch it
cts/test_conformance/test_laplacianpyramid.ccallsvxProcessGraph(graph)exactly once per test:Same shape in the LaplacianReconstruct test. Because the conformance test never invokes
vxProcessGrapha second time on the same graph after mutating the input, the staleness of the cached output is invisible. A test that didprocess → mutate input via vxCopyImagePatch → process → diff outputswould fail on the current sample and would be the right way to gate any fix.Practical impact: misleading benchmark numbers
Beyond the correctness issue, this bug also distorts every benchmark that compares another OpenVX implementation against the Khronos sample. With a typical
for (i=0; i<N; i++) vxProcessGraph(g);measurement loop, the sample reports:LaplacianPyramidon FHD (1920×1080): ~1 µs/iteration ≈ 1.4 million MP/s ← physically impossible; just the cost of the kernel-stub function call.GaussianPyramid(correct sub-graph pattern) on the same FHD: ~159 ms/iteration ≈ 13 MP/s — the real cost.Spec-compliant implementations that recompute the pyramid on every
vxProcessGraph(e.g. rustVX) honestly report tens of milliseconds and end up looking many orders of magnitude slower than the sample, even though they are doing the same amount of work per iteration and producing fresh, correct output. The comparison row reads as "the Khronos sample is a million times faster" when in reality it is silently caching a stale result from verify-time.This trips up:
vx_perfsnapshot that callers get viavxQueryNode(node, VX_NODE_PERFORMANCE, ...)is also0.000 msfor these two kernels in the sample, which is the smoke signal — every other Khronos kernel where the callback contains real work has wall-clock andvx_perfagreeing to within 0.1 ms; onlyLaplacianPyramidandLaplacianReconstructdisagree by orders of magnitude.Khronos LaplacianPyramid ≈ 1.4 million MP/s. We've worked around it on the bench side (kiritigowda/openvx-mark#4) by rebuilding+verifying+processing inside the timed window, but the underlying sample-impl bug is what makes the workaround necessary.c_model-derived path is one) inherit the bug.Suggested fix
Refactor
vxLaplacianPyramidKernel/vxLaplacianPyramidInitializer(andvxLaplacianReconstruct*) to use the same child-sub-graph pattern that the same file already uses correctly forvxGaussianPyramidKernel:vxGaussianPyramidNodefor the temp pyramid, plus per-levelvxConvolveNodefor the 5×5 upsample andvxSubtractNodeforL[i] = G[i] - upsampled(G[i+1]), with the appropriate scale/offset glue). Store it on the node withownSetChildGraphOfNode(node, subgraph). Verify the sub-graph here. Do not write any output pixels at this step.return vxProcessGraph(ownGetChildGraphOfNode(node));— exactly asvxGaussianPyramidKerneldoes today.vxGaussianPyramidDeinitializeris a one-line template).The same diff applies in both
sample/targets/c_model/vx_pyramid.candsample/targets/venum/vx_laplacianpyramid.c. Estimated size: ~80 lines per kernel, replacing the 90-line monolithic Initializer with a sub-graph builder of comparable size, and turning the no-op stub kernel into a one-liner.Recommended companion change: extend
cts/test_conformance/test_laplacianpyramid.c(and the equivalent reconstruct test) with a "process twice with different inputs and assert outputs differ" check, so this regression cannot reappear.References
rustVX(a Rust reimplementation that respects the spec and recomputes pervxProcessGraph): https://github.com/kiritigowda/rustvx — the original observation came from rustVX'sLaplacianPyramidbenchmark looking "57 ms vs 0.001 ms" worse than the Khronos sample, which was the prompt to investigate.