Skip to content

feat(FeatureFlags): FFE APM feature-flag span enrichment (experimental, gated)#3996

Draft
leoromanovsky wants to merge 12 commits into
masterfrom
leo.romanovsky/ffe-apm-span-enrichment
Draft

feat(FeatureFlags): FFE APM feature-flag span enrichment (experimental, gated)#3996
leoromanovsky wants to merge 12 commits into
masterfrom
leo.romanovsky/ffe-apm-span-enrichment

Conversation

@leoromanovsky

@leoromanovsky leoromanovsky commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

feat(FeatureFlags): FFE APM feature-flag span enrichment

⚠️ Experimental, opt-in, gated behind DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED (off by default).

Summary

Adds Feature Flag Events (FFE) span enrichment to the feature-flag integration. When feature
flags are evaluated, the evaluation metadata is attached to the root APM span so APM customers
can filter traces and errors by active flag variant, and the FFE/Experimentation platform can
correlate spans with experiments. The wire format matches the merged reference implementation
(dd-trace-js#8343) so backend/Trino decode is identical.

How it works

  1. A flag is evaluated (via the OpenFeature DataDogProvider or the native DDTrace\FeatureFlags\Client).
  2. Each evaluation is accumulated inline against the current root span.
  3. At root-span close, the accumulated state is encoded and written as ffe_* tags.

Configuration

Opt-in, off by default:

DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED=true

This is distinct from DD_EXPERIMENTAL_FLAGGING_PROVIDER_ENABLED.

Span tags added

Tag Description Format
ffe_flags_enc All evaluated flag serial IDs base64 delta-varint
ffe_subjects_enc Subject → flags mapping (when doLog=true) JSON { sha256(key): encodedIds }
ffe_runtime_defaults Fallback values for flags not in UFC JSON { flagKey: value }

Limits: 200 serial IDs, 10 subjects, 20 experiments/subject, 5 runtime defaults, 64 chars/runtime-default value (UTF-8-safe truncation).

Changes

  • Gate + config: add DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED (ext/configuration.h), off by default; thread the split serial_id Rust → C → PHP mapper (components-rs/ffe.rs, tracer/ffe.c/.h, ResultMapper.php).
  • Codec + accumulator: inline span-enrichment accumulation (SpanEnrichmentAccumulator.php) with delta-varint serial IDs + SHA256-hashed subject keys; write ffe_* tags at root-span close (tracer/span.c).
  • SpanEnrichmentBinder: binds enrichment to the native DDTrace\FeatureFlags\Client path in addition to the OpenFeature DataDogProvider, so non-OpenFeature consumers are enriched too.
  • Tests: accumulator + result-mapper unit tests; .phpt ext tests for native bridge, serial-id passthrough, eval metrics, and remote-config lifecycle.

Decisions

  • Inline accumulation (not a finally hook): PHP OpenFeature does not pass ResolutionDetails to finally hooks, so enrichment is accumulated inline.
  • No idle per-span overhead when the gate is off — the accumulator is absent and spans carry no ffe_* tags.
  • Lifecycle: accumulator reset on the root-span boundary.
  • ffe_* are bare tag names on span meta (not _dd.-prefixed); subject keys are SHA256 hashes emitted only when logging is authorized.

Validation

FFE dogfooding app

Validated live against the ffe-dogfooding app via a trace-intake tee-proxy that captures the raw /v0.4/traces payload and decodes the ffe_* tags. Flag ffe-dogfooding-string-flag (serial 2312):

  • Gate ON — the root span (web.request, auto-instrumented web SAPI) carried ffe_flags_enc decoding to serial [2312] plus a SHA256-hashed ffe_subjects_enc[2312].
  • Gate OFF — span flushed with zero ffe_* tags.

Local system-tests run

Ran the frozen system-tests parametric suite (tests/parametric/test_ffe/test_span_enrichment.py, unchanged) against this branch's tracer (dd-library-php-1.21.0, C extension built from source for aarch64-linux-gnu, PHP 8.2 NTS):

TEST_LIBRARY=php ./run.sh PARAMETRIC -k span_enrichment
Library: php@1.21.0
============================= 18 passed in 49.99s ==============================

All 18 cases pass — ffe_flags_enc aggregates serial IDs across evaluations and propagates from child spans to the root (ZAgUAg==[100,108,128,130]); ffe_subjects_enc carries SHA256-hashed targeting keys gated on doLog; ffe_runtime_defaults is added for not-found flags with 64-char truncation; and all frozen limits are enforced. The SpanEnrichmentBinder change above was required so the native DDTrace\FeatureFlags\Client path (used by the parametric server) is enriched. The system-tests enablement (parametric server.php + manifests/php.yml) is a separate draft PR against DataDog/system-tests.

Full dogfooding matrix + fix (2026-06-17)

Re-validated end-to-end through the real OpenFeature provider path behind the trace-intake
tee-proxy, decoding ffe_* with scripts/decode_ffe_span_tags.py (root span web.request,
service ffe-dogfooding-php8-openfeature, extension built from this branch):

Scenario Result
Gate ON (serial 2312) ffe_flags_enc[2312]; ffe_subjects_enc = {sha256(targeting key): ids} only when do_log
Gate OFF zero ffe_* tags; no binder constructed
Aggregation multiple flags + 2 subjects on one root → ffe_flags_enc = [829,1442,2311,2312], nothing overwritten (shared SpanEnrichmentRegistry)
Unicode + object runtime defaults ffe_runtime_defaults raw UTF-8 (héllo-wörld-☃-日本語-Ω, こんにちは, 🎉), valid JSON, values truncated to 64
Codec parity ZAgUAg==[100,108,128,130]

Fix found by the matrix (commit fix(ffe): emit ffe_* JSON as raw UTF-8 with unescaped slashes):
the unicode scenario showed ffe_runtime_defaults was \uXXXX-escaped (and object values were
truncated mid-escape-sequence, yielding invalid JSON) because json_encode() was called without
flags. Added JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES at the three json_encode sites in
SpanEnrichmentAccumulator.php so the emitted bytes match the frozen Node JSON.stringify
contract (raw UTF-8, bare /). Existing accumulator unit tests json_decode the tags (normalizing
escapes) so they are unaffected; the dogfooding loop is what surfaced the divergence.

System-tests re-confirmed against a tarball rebuilt from this branch's source: 18 passed
(TEST_LIBRARY=php ./run.sh PARAMETRIC -k span_enrichment, library php@1.21.0).

…n-enrichment gate

- Add serial_id (i64) + has_serial_id (bool) to the Rust FfeResult struct and
  populate from assignment.serial_id (unwrap_or(0) + is_some()) in all ctors;
  regenerate the cbindgen common.h ABI to match.
- Surface serialId as a nullable int on the DDTrace\FfeResult object in the C
  reader (tracer/functions.c), guarded by has_serial_id so absence stays null
  (Pattern B: missing != 0); update the stub + arginfo.
- Thread serialId into ResultMapper::exposureData (only when present).
- Add the gate CONFIG(BOOL, DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED,
  "false") to ext/configuration.h (distinct from the provider-enabled gate).
- Update existing FFE phpt EXPECT blocks for the new serialId field.
…oot-close write

- Add DDTrace\FeatureFlags\SpanEnrichmentAccumulator: per-root-span accumulator
  + ULEB128 delta-varint/base64/SHA256 codec ported verbatim from the frozen
  Node reference (dd-trace-js#8343). Limits 200/10/20/5/64, dedupe+sort, object
  defaults via json_encode, UTF-8-safe 64-char truncation; tag shapes ffe_flags_enc
  (bare base64), ffe_subjects_enc / ffe_runtime_defaults (JSON objects).
- DataDogProvider: accumulate INLINE in resolve() right after recordEvaluationMetric
  (DG-004, no finally hook); gate-gated lazy accumulator (DG-005 zero-idle); error
  isolation via try/catch(\Throwable); runtime-default detection via missing variant.
- Native request-scoped staging store in tracer/ffe.c (+ ddtrace_globals.h) flushed
  into the root span meta on the ddtrace_close_span root branch and cleared on root
  close / RSHUTDOWN (no cross-request leak); gate-off path does no work.
- Add DDTrace\Internal\set_ffe_span_enrichment_tags() PHP-callable staging fn.
- Tests: SpanEnrichmentAccumulatorTest (7 required L0 cases incl. gate-off control +
  codec golden round-trip), serial_id_passthrough.phpt (C bridge), ResultMapper
  serialId threading cases.
…ry (CR-01)

The per-provider SpanEnrichmentAccumulator was only ever added to:
clear() had zero production callers and accumulateSpanEnrichment()
re-staged the FULL accumulated set on every resolve(). After a root
span closed, the next root span re-staged the prior root's serial ids /
hashed subjects / runtime defaults (within-request multi-root
contamination), and because OpenFeature providers are process-level
singletons the accumulator leaked across requests in persistent SAPIs --
a privacy leak of SHA256 subject keys.

Fix: reset the PHP accumulator on the root-span boundary, in lockstep
with the native close-span flush (which already clears the native
staging slots on the same ddtrace_close_span root branch + RSHUTDOWN):
- Track the active root span id (spl_object_id of DDTrace\root_span()).
  On any boundary transition, clear the accumulator + native staging
  store so a dropped/abandoned root (which never runs its onClose) and a
  new request both start clean.
- Bind a one-shot accumulator clear to the root span's $onClose so the
  PHP object is reset when the root closes (mirrors the frozen Node
  reference #onSpanFinish cleanup).
- Lifecycle is injectable (rootIdResolver / rootCloseScheduler) so the
  pure-PHP L0 suite can drive root transitions without the extension.

Regression tests (fail-before / pass-after): two sequential root spans
in one request -> root 2 stages only its own serial ids/subjects/
defaults; dropped-root and cross-request reset -> no carryover incl. no
leaked hashed subject keys; root close clears the accumulator with no
subsequent eval. Plus a Node String(value) runtime-default parity test
(null/true/false/scalars/objects). Native ABI passthrough, codec
(ZAgUAg==), limits, gate-off DG-005, and DG-004 inline accumulation are
unchanged.
@datadog-prod-us1-4

datadog-prod-us1-4 Bot commented Jun 16, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 14 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | benchmarks-tracer   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.1]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.4]   View in Datadog   GitLab

View all 14 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 54.08% (-0.04%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 156726d | Docs | Datadog PR Page | Give us feedback!

@pr-commenter

pr-commenter Bot commented Jun 16, 2026

Copy link
Copy Markdown

Benchmarks [ tracer ]

Benchmark execution time: 2026-06-17 08:47:40

Comparing candidate commit 3229c80 in PR branch leo.romanovsky/ffe-apm-span-enrichment with baseline commit a65b400 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 194 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

Long-running CLI servers (parametric test apps) starve the SIGVTALRM-driven
remote-config refresh because the process is mostly blocked in IO rather than
burning CPU time, so an FFE evaluation issued right after the agent ACKs a
pushed UFC config still sees no config and falls back to defaults. Add a
dd_trace_internal_fn('await_ffe_config') testing hook that actively pumps
remote configs (mirrors await_agent_info) until ddog_ffe_has_config() is true.

Enables the FROZEN system-tests span-enrichment parametric suite to load UFC
via Remote Config in the long-running PHP parametric server.
Span enrichment was accumulated only inside the OpenFeature DataDogProvider
(DG-004 inline path). The native DDTrace\FeatureFlags\Client evaluates flags
without going through the provider, so consumers on the native path (the
parametric system-tests app, and any non-OpenFeature caller) produced ffe_*
tags on the root span for OpenFeature but NOT for the native Client.

Extract the per-root-span accumulate/encode/root-boundary lifecycle into a
reusable PHP7-compatible SpanEnrichmentBinder and bind it on Client::evaluate(),
so both the provider and the native Client stage identical ffe_* tags from the
same EvaluationDetails and stay in lockstep with the native close-span write.
Honours the FROZEN contract (limits 200/10/20/5/64, delta-varint, SHA256
subjects, runtime-default detection). DG-005: no-op with the gate off.
…ment gate

Register DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED in
metadata/supported-configurations.json by running
tooling/generate-supported-configurations.sh. The config was added to
ext/configuration.h but the generated metadata was not regenerated,
causing the Configuration Consistency CI check to fail.
assertIsInt() is only available in PHPUnit 7.5+, so the new serialId
exposure-data test errored on the PHP 7.0 API unit-test job (older
PHPUnit). assertInternalType() is unavailable too (removed in PHPUnit 9,
and the matrix runs up to PHPUnit <10). Replace with
assertTrue(is_int(...)), which works across the whole 7.0-8.5 matrix.
The preceding strict assertSame already enforces the integer type.
…-only

PR review (#3996), two native findings:

- should-fix: DDTrace\root_span() calls dd_ensure_root_span(), which CREATES
  an autoroot span when none exists. Resolving the root id while merely
  evaluating a feature flag must not have that side effect. Add a non-creating
  DDTrace\Internal\peek_root_span_id() that reads DDTRACE_G(active_stack)->
  root_span directly (no dd_ensure_root_span) and returns its object handle,
  identical to spl_object_id(\DDTrace\root_span()) but without trace-state
  creation. Wired into the stub + committed arginfo (phpize build uses the
  committed header as-is; no CI stub-hash gate).

- should-fix: await_ffe_config sits in the production dd_trace_internal_fn
  dispatcher and actively pumps Remote Config, blocking up to 5s. Guard it
  behind a new DD_TEST_HELPERS compile flag (config.m4, defined for the
  standard CI/test/package builds the system-tests + ffe-dogfooding harnesses
  run against) so a hardened production build can compile the heavyweight
  test helper out of the dispatcher entirely.

ZTS-safe (DDTRACE_G accessor); no allocation, no refcount changes.
… all paths

PR review (#3996) blocker + should-fix.

blocker: tracer/ffe.c set_ffe_span_enrichment_tags() REPLACES the three
request-global tag slots on every call. Both DataDogProvider and each
FeatureFlags\Client/SpanEnrichmentBinder owned a SEPARATE accumulator and
staged independently, so two clients, two providers, or a mixed OpenFeature +
native-client evaluation under ONE root span would OVERWRITE earlier serial
ids / hashed subjects / runtime defaults instead of aggregating them.

Fix: introduce SpanEnrichmentRegistry, a single request-scoped accumulator
that ALL PHP evaluation paths feed. The staged tag set is now the union of
every evaluation on the active root span, matching the frozen Node contract.
No tag/encoding/limit semantics changed.

should-fix (per-binder onClose retention): the lifecycle is centralized in the
registry, which binds AT MOST ONE root-close reset per root span (tracked by
rootCloseBoundRootId). Many short-lived clients under one long-lived root no
longer each retain a closure + accumulator. SpanEnrichmentBinder is now a thin
gate-checked adapter; DataDogProvider drops its inline accumulator + lifecycle.

should-fix (gate-off not inert): Client and DataDogProvider now construct NO
binder unless DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED is on,
and evaluate()/resolve() skip the enrichment call entirely when the binder is
absent — no per-evaluation config read with the gate off (DG-005).

should-fix (root side effect): the registry resolves the root id via the new
non-creating DDTrace\Internal\peek_root_span_id(), falling back to the
(creating) DDTrace\root_span() only on older extensions.
…non-creating root

PR review (#3996) regression coverage.

- SpanEnrichmentRegistryTest (PHPUnit, runs without the native ext): two
  binders (standing in for two clients / a client + a provider) under one
  simulated root AGGREGATE their serial ids, hashed subjects, and runtime
  defaults into one staged payload rather than overwriting; CR-01 per-root
  reset still holds; at most ONE root-close reset is bound across many
  short-lived binders; the root-close reset clears the shared accumulator.

- ClientTest: gate-off Client allocates no SpanEnrichmentBinder and evaluate()
  short-circuits enrichment without error.

- SpanEnrichmentAccumulatorTest: rewired the DG-004 inline + CR-01 multi-root
  harness to drive the shared registry's seams (the lifecycle moved out of the
  provider); gate-off assertions now check spanEnrichmentBinder is null.

- peek_root_span_id_non_creating.phpt (orchestrator L2, needs built ext):
  proves peek_root_span_id() returns null without creating a root span
  (active_span() stays null) and otherwise equals spl_object_id(root_span()).
…arity)

json_encode() without flags escaped non-ASCII to \uXXXX and '/' to '\/',
diverging from the frozen Node JSON.stringify contract for ffe_subjects_enc
and ffe_runtime_defaults. For object/struct runtime defaults the \uXXXX
inflation also pushed the value past the 64-char limit so the truncation cut
mid-escape-sequence, yielding invalid JSON inside the tag.

Add JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES at all three json_encode
sites (toSpanTags subjects + runtime defaults, and stringifyDefault for
object/array values) so the emitted bytes match Node exactly. Verified via the
ffe-dogfooding unicode scenario: decoded ffe_runtime_defaults is now raw UTF-8
(héllo-wörld-☃-日本語-Ω / こんにちは / 🎉), valid JSON, codepoint-safe 64-char truncation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant