Skip to content

feat(openfeature): emit server-side EVP flagevaluation#3984

Draft
leoromanovsky wants to merge 18 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-php
Draft

feat(openfeature): emit server-side EVP flagevaluation#3984
leoromanovsky wants to merge 18 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-php

Conversation

@leoromanovsky

@leoromanovsky leoromanovsky commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Motivation

PHP server-side flag evaluations need to emit aggregated EVP flagevaluation payloads through the native Rust/C and libdatadog sidecar path, while preserving the existing OTel metric and exposure paths. The emitted events must match the live flageval-worker contract: schema-visible aggregation dimensions only, no OpenFeature reason, bounded context before buffering, and targeting_rule.key only when real targeting-rule metadata exists.

Changes

  • Adds native PHP EVP flagevaluation aggregation using schema-visible dimensions: flag key, variant key, allocation key, runtime-default state, error message, real targeting rule key when present, targeting key, and bounded context.
  • Prunes evaluation context before it participates in aggregation keys or queued snapshots.
  • Keeps OpenFeature reason out of EVP payloads and aggregation keys.
  • Preserves visible variant/allocation/error dimensions when overflow folds into degraded events.
  • Flushes the aggregated flagevaluation batch before exposure and OTel metric sidecar actions at request shutdown.
  • Splits native-to-sidecar flagevaluation batches into bounded 512-event IPC chunks.
  • Bumps libdatadog to 53f81e56510f9981dd7db95a9dc7dec592cb9678, which adds sidecar degraded coalescing and bounded EVP POST chunks.

Decisions

  • EVP cardinality is defined only by fields the worker receives; reason is not a hidden aggregate key.
  • Degraded events omit targeting key and context but keep the visible dimensions needed for backend counts.
  • PHP keeps the sidecar delivery architecture instead of adding a direct EVP HTTP writer to the extension.
  • System-test validation covers the production per-flag degradation threshold instead of relying only on SDK unit caps.
  • This PR remains draft until the cross-SDK rollout and review pass are complete.

Validation Evidence

Validated on 2026-06-17 with PHP SDK commit 696c397f3ad7fd9b8b2135e16c9424c49a55da1f, libdatadog commit 53f81e56510f9981dd7db95a9dc7dec592cb9678, and system-tests PHP enable commit a29bfcfa1b2fa172adfee5f61a2a3424456763ae based on #7146 commit 2e7ad616bc525ee76698f34a686da549701b1914.

  • cargo test -p datadog-sidecar ffe_flagevaluation_flusher --lib from dd-trace-php/libdatadog - PASS, 14 passed.
  • cargo +nightly test -p datadog-php ffe:: --lib from dd-trace-php - PASS, 21 passed.
  • docker run --rm -w /var/app -v "$PWD":/var/app datadog/dd-trace-ci:php-8.2_centos-7 bash -lc 'export CARGO_HOME=$PWD/tmp/cargo_home; make -j$(nproc)' - PASS.
  • ./build.sh php -w apache-mod-8.2 from system-tests-php - PASS.
  • ./run.sh FEATURE_FLAGGING_AND_EXPERIMENTATION --library php --weblog apache-mod-8.2 -k "Flagevaluation_Degradation" - PASS, 1 passed, 2584 deselected in 34.38s.
  • ./run.sh FEATURE_FLAGGING_AND_EXPERIMENTATION --library php --weblog apache-mod-8.2 -k "test_flag_eval_evp" - PASS, 8 passed, 2577 deselected in 63.53s.
  • Degradation payload evidence from the passing full run: events=10001, total=10050, full_total=10000, degraded_count=1, degraded_total=50.
  • Agent 7.80.1; library php@1.21.0; weblog apache-mod-8.2; local PHP 8.2 aarch64 ddtrace.so installed through the system-tests /binaries/ddtrace.so override.

…ith PREP-01 libdatadog

- Enable 'flagevaluation-evp' feature on datadog-ffe dep (FfeFlagEvaluationBatch type now compiled)
- Fix components-rs/bytes.rs: update 4x VecMap::remove() -> remove_slow() for libdatadog compat post-commit 74284cac7 (VecMap API renamed); this unblocks compilation against the PREP-01 libdatadog ref
…patch

- Two-tier aggregation in components-rs/ffe.rs: full→degraded→drop-counted
  with caps GLOBAL_CAP=131072/PER_FLAG_CAP=10000/DEGRADED_CAP=32768
- Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default: on) via
  evp_enabled() in Rust and isEvpEnabled() in EvaluationMetricRecorder.php
- ddog_ffe_flush_flag_evaluation_batch() Rust C-export dispatches
  SidecarAction::FfeFlagEvaluationBatch via sidecar_blocking::enqueue_actions
- ddtrace_ffe_flush_flag_evaluation_batch() C wrapper in tracer/ffe.c
  mirrors existing exposure/metric flush pattern with sidecar globals
- RSHUTDOWN call added in tracer/ddtrace.c after existing flush calls
- 11 Rust unit tests covering both tiers, overflow, drain, killswitch
…EVP aggregator race

ddog_ffe_evaluate() records into the global EVP_AGGREGATOR; without
EVP_TEST_LOCK the test ran concurrently with degraded_tier_overflow
tests, causing dropped_degraded_overflow to be 2 instead of 1.
… + regen Cargo.lock

Points dd-trace-php's libdatadog submodule at the local PREP-01 commit
containing the flagevaluation EVP emitter (FfeFlagEvaluationBatch), so
components-rs builds against it via the datadog-ffe path dep with the
flagevaluation-evp feature. NOTE: 89a2ba7fc is local/unpushed — re-point
to the merged upstream libdatadog SHA before any PR.
The Rust C-export ddog_ffe_flush_flag_evaluation_batch (components-rs/ffe.rs)
was added without a matching prototype in the committed cbindgen header
components-rs/datadog.h. tracer/ffe.c calls it, so PHP8's stricter toolchain
fails with -Werror=implicit-function-declaration (ddtrace.so link Error 2).
PHP7 only warned and linked, masking the bug. Prototype matches the Rust
signature (SidecarTransport**/InstanceId*/QueueId*/CharSlice x3).
…ow drops

The full-tier EVP flagevaluation drain previously emitted context: None and
drained the degraded-overflow drop count silently.

- Full tier now carries the pruned evaluation context (shared prune_context
  bounds: <=256 fields, string values >256 bytes skipped) plus context.dd.service,
  matching the degraded tier's cap enforcement. The pruned context is captured
  once per bucket at insertion and carried verbatim into the drained event.
- The degraded-tier overflow drop counter is read-and-reset at drain and logged
  via tracing::warn when non-zero, so an undersized degradedCap is observable
  instead of a silent loss of legitimate counts.
…low surfacing

- ddog_ffe_evaluate_populates_evp_aggregator_for_flush / _respects_killswitch:
  drive the real FFI entry point ddog_ffe_evaluate (the function the PHP/C layer
  calls) and assert it feeds the aggregator that the sidecar flush drains, closing
  the 'unit-green but emits nothing' gap that earlier tests left uncovered.
- full_tier_event_carries_pruned_context / _prunes_oversized_string_values /
  _empty_context_emits_no_context_object: assert the full tier carries the pruned
  context and enforces the field/value bounds.
- drain_resets_degraded_overflow_drop_counter: assert drain reads-and-resets the
  observable overflow drop counter.
…ncode-safe wire + reliable enqueue)

Bump the libdatadog submodule to the bincode-safe flagevaluation fix (DataDog/libdatadog#2117): the worker->sidecar IPC is bincode, which the old serde_json::Value + skip_serializing_if wire types could not deserialize, so the sidecar silently dropped every batch.

- Stringify the pruned full-tier context (JSON object string) at drain so the bincode wire stays plain; the sidecar flusher re-expands it into a JSON object for the POST.

- Use sidecar_blocking::enqueue_actions_reliable for the one-shot RSHUTDOWN flush.
@datadog-official

datadog-official Bot commented Jun 14, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 12 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | ASAN test_c with multiple observers: [8.0]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.4]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.2]   View in Datadog   GitLab

View all 12 failed jobs.

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 54.11% (+0.03%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 696c397 | Docs | Datadog PR Page | Give us feedback!

@pr-commenter

pr-commenter Bot commented Jun 16, 2026

Copy link
Copy Markdown

Benchmarks [ tracer ]

Benchmark execution time: 2026-06-17 08:16:43

Comparing candidate commit 696c397 in PR branch leo.romanovsky/ffl-2446-evp-flagevaluation-php with baseline commit 4342700 in branch master.

Some scenarios are present only in baseline or only in candidate runs. If you didn't create or remove some scenarios in your branch, this maybe a sign of crashed benchmarks 💥💥💥
Check Gitlab CI job log to find if any benchmark has crashed.

Scenarios present only in candidate:

  • FlagEvaluationBench/benchEvaluateWithoutCounting-opcache
  • FlagEvaluationBench/benchEvaluateWithoutCounting
  • FlagEvaluationBench/benchEvaluateDistinctContexts
  • FlagEvaluationBench/benchEvaluateDistinctContexts-opcache
  • FlagEvaluationBench/benchEvaluateSplit
  • FlagEvaluationBench/benchEvaluateSplit-opcache
  • FlagEvaluationBench/benchEvaluateTargetingMatch-opcache
  • FlagEvaluationBench/benchEvaluateTargetingMatch

Found 2 performance improvements and 0 performance regressions! Performance is the same for 192 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:MessagePackSerializationBench/benchMessagePackSerialization

  • 🟩 execution_time [-4.450µs; -2.750µs] or [-4.125%; -2.549%]

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

  • 🟩 execution_time [-3.940µs; -2.400µs] or [-3.661%; -2.229%]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant