vxLaplacianPyramidKernel and vxLaplacianReconstructKernel hoist computation into Initializer, returning stale output across vxProcessGraph calls

## Summary

The sample implementation's `vxLaplacianPyramidKernel` and `vxLaplacianReconstructKernel` are 4-line stubs that just return `VX_SUCCESS`. The actual pyramid computation has been hoisted into the corresponding `*Initializer` callbacks, which run **once** during `vxVerifyGraph()`. As a result:

- The output pyramid pixels are computed exactly once, against whatever input data was in the source image at the moment `vxVerifyGraph()` ran.
- Every subsequent `vxProcessGraph()` call returns a no-op (~1 µs on FHD).
- If the application mutates the input image between `vxProcessGraph()` calls — the standard streaming/video pipeline pattern — the output pyramid is **stale**: it still reflects the input from `vxVerifyGraph()` time, not the current input.

This is a spec violation: the OpenVX 1.3 specification defines the kernel Initializer as the place to "initialize data once all the parameters have been validated" (per-node local buffers, child sub-graphs, etc.), not the place to perform the kernel's computation. `vxProcessGraph()` is the execution step and is required to process the graph against current input data on each call.

## Spec citations

From `api-docs/include/VX/vx_types.h`:

```c
/*!
 * \brief The pointer to the kernel initializer. If the host code requires a call
 * to initialize data once all the parameters have been validated, this function
 * is called if not NULL.
 */
typedef vx_status(VX_CALLBACK *vx_kernel_initialize_f)(
    vx_node node, const vx_reference *parameters, vx_uint32 num);
```

From the OpenVX 1.3 spec graph-execution lifecycle:

> Call the `vx_kernel_validate_f` callback. Call the `vx_kernel_initialize_f` callback (if not NULL): if `VX_KERNEL_LOCAL_DATA_SIZE == 0`, the callback is allowed to set `VX_NODE_LOCAL_DATA_SIZE` and `VX_NODE_LOCAL_DATA_PTR`. […]

The Initializer's documented purpose is **local-data setup**, not kernel computation.

From the spec text for `vxProcessGraph`:

> This function causes the synchronous processing of a graph. […] After the graph verifies successfully then processing occurs `[*REQ-0606*]`. If the graph was previously verified via `vxVerifyGraph` or `vxProcessGraph` then the graph is processed `[*REQ-0607*]`.

\"Processing occurs\" / \"the graph is processed\" each call. There is no allowance for skipping execution when inputs haven't changed (and the implementation has no way to know they haven't — callers can mutate input pixels via `vxMapImagePatch` / `vxCopyImagePatch` between graph runs without notifying anything).

## Where the bug lives

### Bug 1: `vxLaplacianPyramidKernel`

`sample/targets/c_model/vx_pyramid.c` (and the identical pattern in `sample/targets/venum/vx_laplacianpyramid.c`):

```c
// lines 692-702 (c_model)
static vx_status VX_CALLBACK vxLaplacianPyramidKernel(
    vx_node node, const vx_reference parameters[], vx_uint32 num)
{
    vx_status status = VX_FAILURE;
    (void)node;

    if (num == dimof(laplacian_pyramid_kernel_params))
    {
        status = VX_SUCCESS;
    }
    return status;
}
```

… while the matching Initializer (lines 859-948) does the entire computation:

```c
static vx_status VX_CALLBACK vxLaplacianPyramidInitializer(...)
{
    ...
    gaussian = vxCreatePyramid(context, levels + 1, VX_SCALE_PYRAMID_HALF, ...);
    vxuGaussianPyramid(context, input, gaussian);          // <-- builds the temp
                                                           //     Gaussian pyramid
    conv = vxCreateGaussian5x5Convolution(context);

    for (lev = 0; lev < levels; lev++) {
        pyr_gauss_curr_level_filtered = vxCreateImage(...);
        upsampleImage(context, ..., gauss_next, conv,      // <-- 5x5 convolve
                      pyr_gauss_curr_level_filtered, ...);
        pyr_laplacian_curr_level = vxGetPyramidLevel(laplacian, ...);
        status |= vxuSubtract(context, gauss_cur,          // <-- L[i] = G[i] -
                              pyr_gauss_curr_level_filtered, //   upsampled(G[i+1])
                              policy, pyr_laplacian_curr_level);
        ...
    }
    ...
}
```

### Bug 2: `vxLaplacianReconstructKernel`

Same pattern, same file (lines 989-998 stub kernel, 1141+ Initializer doing the work).

### The same file already shows the correct pattern

Right next to these two buggy kernels, `vxGaussianPyramidKernel` does it the spec-compliant way (lines 464-476):

```c
static vx_status VX_CALLBACK vxGaussianPyramidKernel(
    vx_node node, const vx_reference parameters[], vx_uint32 num)
{
    vx_status status = VX_FAILURE;
    (void)parameters;

    if (num == dimof(gaussian_pyramid_kernel_params))
    {
        vx_graph subgraph = ownGetChildGraphOfNode(node);
        status = vxProcessGraph(subgraph);                 // <-- real work each call
    }

    return status;
}
```

The matching `vxGaussianPyramidInitializer` (lines 580-636) builds a child sub-graph (a chain of `vxConvolveNode` + `vxScaleImageNode` per level), stores it on the node via `ownSetChildGraphOfNode`, and returns. The kernel callback then re-executes that child sub-graph against the current input image on every `vxProcessGraph()` call. **That is the correct use of the Initializer hook** — it sets up the persistent data (the child sub-graph) once per node lifetime, while the actual kernel work happens in the kernel callback every call.

## Reproducer

A minimal C program that fails on the unfixed sample:

```c
// Build a graph with vxLaplacianPyramidNode.
// Fill input with all 100s, then run.
vxCopyImagePatch(input, &rect, 0, &addr, fill_100, VX_WRITE_ONLY, VX_MEMORY_TYPE_HOST);
vxVerifyGraph(graph);
vxProcessGraph(graph);                  // first run
vx_image lvl0 = vxGetPyramidLevel(laplacian_pyr, 0);
read_pixels(lvl0, &result_first);

// Mutate the input — overwrite every pixel with 200.
vxCopyImagePatch(input, &rect, 0, &addr, fill_200, VX_WRITE_ONLY, VX_MEMORY_TYPE_HOST);
vxProcessGraph(graph);                  // second run, *no* re-verify
read_pixels(lvl0, &result_second);

// Spec says the Laplacian pyramid is a function of the current input,
// so for two different inputs the two outputs MUST differ. With the
// Khronos sample, result_first == result_second byte-for-byte, because
// the per-iteration kernel callback was a no-op stub and the cached
// output from Initializer is still in place.
assert(memcmp(result_first.data(), result_second.data(),
              result_first.size()) != 0);   // FAILS on unfixed sample
```

The same reproducer with `vxGaussianPyramidNode` substituted (also same with `vxBox3x3Node`, `vxAddNode`, etc.) passes — only the two Laplacian kernels carry this bug.

## Why CTS does not catch it

`cts/test_conformance/test_laplacianpyramid.c` calls `vxProcessGraph(graph)` exactly once per test:

```c
ASSERT_VX_OBJECT(node = vxLaplacianPyramidNode(graph, input, laplacian, output), VX_TYPE_NODE);
VX_CALL(vxSetNodeAttribute(node, VX_NODE_BORDER, &border, sizeof(border)));

VX_CALL(vxVerifyGraph(graph));
VX_CALL(vxProcessGraph(graph));        // <-- only one call
VX_CALL(vxReleaseNode(&node));
VX_CALL(vxReleaseGraph(&graph));
```

Same shape in the LaplacianReconstruct test. Because the conformance test never invokes `vxProcessGraph` a second time on the same graph after mutating the input, the staleness of the cached output is invisible. A test that did `process → mutate input via vxCopyImagePatch → process → diff outputs` would fail on the current sample and would be the right way to gate any fix.

## Practical impact: misleading benchmark numbers

Beyond the correctness issue, this bug also distorts every benchmark that compares another OpenVX implementation against the Khronos sample. With a typical `for (i=0; i<N; i++) vxProcessGraph(g);` measurement loop, the sample reports:

- `LaplacianPyramid` on FHD (1920×1080): ~1 µs/iteration ≈ **1.4 million MP/s** ← physically impossible; just the cost of the kernel-stub function call.
- `GaussianPyramid` (correct sub-graph pattern) on the same FHD: ~159 ms/iteration ≈ 13 MP/s — the real cost.

Spec-compliant implementations that recompute the pyramid on every `vxProcessGraph` (e.g. rustVX) honestly report tens of milliseconds and end up looking many orders of magnitude slower than the sample, even though they are doing the *same amount of work* per iteration and producing fresh, correct output. The comparison row reads as "the Khronos sample is a million times faster" when in reality it is silently caching a stale result from verify-time.

This trips up:

- **Implementers porting OpenVX to a new target**: they see their kernel "underperforming" against the reference and may waste days trying to optimize it before realising the reference isn't doing the work in the timed window. The same `vx_perf` snapshot that callers get via `vxQueryNode(node, VX_NODE_PERFORMANCE, ...)` is also `0.000 ms` for these two kernels in the sample, which is the smoke signal — every other Khronos kernel where the callback contains real work has wall-clock and `vx_perf` agreeing to within 0.1 ms; only `LaplacianPyramid` and `LaplacianReconstruct` disagree by orders of magnitude.
- **Benchmark authors / users**: the [openvx-mark](https://github.com/kiritigowda/openvx-mark) FHD comparison was misreporting `Khronos LaplacianPyramid ≈ 1.4 million MP/s`. We've worked around it on the bench side ([kiritigowda/openvx-mark#4](https://github.com/kiritigowda/openvx-mark/pull/4)) by rebuilding+verifying+processing inside the timed window, but the underlying sample-impl bug is what makes the workaround necessary.
- **Anyone reading the sample as reference code** for how to write an OpenVX kernel: the file teaches two contradictory patterns side-by-side. New target back-ends derived from this sample (and there are several in the wild — AMD MIVisionX's `c_model`-derived path is one) inherit the bug.

## Suggested fix

Refactor `vxLaplacianPyramidKernel`/`vxLaplacianPyramidInitializer` (and `vxLaplacianReconstruct*`) to use the same child-sub-graph pattern that the same file already uses correctly for `vxGaussianPyramidKernel`:

1. **Initializer**: build a child sub-graph (`vxGaussianPyramidNode` for the temp pyramid, plus per-level `vxConvolveNode` for the 5×5 upsample and `vxSubtractNode` for `L[i] = G[i] - upsampled(G[i+1])`, with the appropriate scale/offset glue). Store it on the node with `ownSetChildGraphOfNode(node, subgraph)`. Verify the sub-graph here. **Do not write any output pixels** at this step.
2. **Kernel callback**: `return vxProcessGraph(ownGetChildGraphOfNode(node));` — exactly as `vxGaussianPyramidKernel` does today.
3. **Deinitializer**: release the child sub-graph (`vxGaussianPyramidDeinitializer` is a one-line template).

The same diff applies in both `sample/targets/c_model/vx_pyramid.c` and `sample/targets/venum/vx_laplacianpyramid.c`. Estimated size: ~80 lines per kernel, replacing the 90-line monolithic Initializer with a sub-graph builder of comparable size, and turning the no-op stub kernel into a one-liner.

Recommended companion change: extend `cts/test_conformance/test_laplacianpyramid.c` (and the equivalent reconstruct test) with a "process twice with different inputs and assert outputs differ" check, so this regression cannot reappear.

## References

- Detailed analysis (with the exact bench-runner timing breakdown that surfaced this): https://github.com/kiritigowda/openvx-mark/pull/4
- The bench-side workaround (rebuild graph per iteration so the sample's Initializer-time work lands inside the timed window): same PR.
- `rustVX` (a Rust reimplementation that respects the spec and recomputes per `vxProcessGraph`): https://github.com/kiritigowda/rustvx — the original observation came from rustVX's `LaplacianPyramid` benchmark looking "57 ms vs 0.001 ms" worse than the Khronos sample, which was the prompt to investigate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vxLaplacianPyramidKernel and vxLaplacianReconstructKernel hoist computation into Initializer, returning stale output across vxProcessGraph calls #59

Summary

Spec citations

Where the bug lives

Bug 1: `vxLaplacianPyramidKernel`

Bug 2: `vxLaplacianReconstructKernel`

The same file already shows the correct pattern

Reproducer

Why CTS does not catch it

Practical impact: misleading benchmark numbers

Suggested fix

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

vxLaplacianPyramidKernel and vxLaplacianReconstructKernel hoist computation into Initializer, returning stale output across vxProcessGraph calls #59

Description

Summary

Spec citations

Where the bug lives

Bug 1: vxLaplacianPyramidKernel

Bug 2: vxLaplacianReconstructKernel

The same file already shows the correct pattern

Reproducer

Why CTS does not catch it

Practical impact: misleading benchmark numbers

Suggested fix

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1: `vxLaplacianPyramidKernel`

Bug 2: `vxLaplacianReconstructKernel`