Add replay PBT CI proof of concept by nubtron · Pull Request #23827 · DataDog/integrations-core

nubtron · 2026-05-25T09:16:28Z

Summary

This PR adds a replay-PBT CI proof of concept for cached integration replay testing.

Highlights:

Rename compare-check concepts from old/new to record/replay while retaining old flag aliases.
Add replay-PBT fixture/target refs and cache-only probing.
Add a small branch-gated PR smoke job for KrakenD replay-PBT.
Repurpose the existing manual zz-test-worker-poc.yaml workflow as a parallel Replay PBT POC runner.
Add a replay-PBT matrix script with sharding and fail-loud truncation protection.

CI POC behavior

The manual POC workflow supports:

changed, all-cached, and all-declared modes
optional cache seeding
smoke vs all property sets
per-target cache restore/save
sharding to avoid the GitHub 256-job matrix limit
summary collection as JSON/TSV plus a GitHub step summary table

Security review notes

Before opening this draft, I reviewed the changed workflows for common workflow security issues:

No pull_request_target usage.
All external actions in changed workflow paths are pinned to full commit SHAs.
The POC workflow uses only contents: read permissions.
No GitHub secrets are used by the new POC workflow.
No github-script, dynamic script execution, eval, or curl-to-shell pattern added.
Manual ref inputs are validated before use in git fetch / matrix generation.
Matrix-derived artifact names use a sanitized artifact_slug.
Matrix target path components are validated in the matrix script before being used in cache paths.
Matrix truncation is disabled by default; over-large runs fail with shard guidance instead of silently dropping targets.

Local validation

YAML parse for changed workflows passed.
replay-pbt-matrix.py compiles.
all-declared without sharding fails loudly when over max_targets.
all-declared with shard_count=2 emits 192 targets for shard 0.
ddev lint/format passed.
Replay PBT unit/cache tests passed: 17 passed, 10 skipped.
Local cached KrakenD smoke run passed: 13 passed, 6 skipped.

Notes

This is intentionally a draft POC. The full manual replay-PBT workflow should be exercised first with a tiny capped run before any full seeded or all-property run.

This reverts commit 7fd75bfd43df2e362e11a5c4a50e9c06d5ce9bce.

…e-monkeypatch-replay # Conflicts: # cilium/tests/conftest.py # istio/tests/conftest.py

dd-octo-sts · 2026-05-27T08:57:08Z

Validation Report

All 21 validations passed.

Show details

Validation	Description	Status
`agent-reqs`	Verify check versions match the Agent requirements file	✅
`ci`	Validate CI configuration and Codecov settings	✅
`codeowners`	Validate every integration has a CODEOWNERS entry	✅
`config`	Validate default configuration files against spec.yaml	✅
`dep`	Verify dependency pins are consistent and Agent-compatible	✅
`http`	Validate integrations use the HTTP wrapper correctly	✅
`imports`	Validate check imports do not use deprecated modules	✅
`integration-style`	Validate check code style conventions	✅
`jmx-metrics`	Validate JMX metrics definition files and config	✅
`labeler`	Validate PR labeler config matches integration directories	✅
`legacy-signature`	Validate no integration uses the legacy Agent check signature	✅
`license-headers`	Validate Python files have proper license headers	✅
`licenses`	Validate third-party license attribution list	✅
`metadata`	Validate metadata.csv metric definitions	✅
`models`	Validate configuration data models match spec.yaml	✅
`openmetrics`	Validate OpenMetrics integrations disable the metric limit	✅
`package`	Validate Python package metadata and naming	✅
`qa-label`	Validate the pull request declares whether it needs QA for the next Agent release	✅
`readmes`	Validate README files have required sections	✅
`saved-views`	Validate saved view JSON file structure and fields	✅
`version`	Validate version consistency between package and changelog	✅

View full run

Replace the report's top-of-page meta sections (How to read this report, What this job is doing, Conceptual model, This batch at a glance, Check inventory, Validation families, Validation status by target, Outcome summary, Triage view, Failure categories) with a single Summary table that names the actionable buckets in plain language and lists the likely owner. Rename Actionable failed targets to Failures to fix, Property findings to Failed checks, and Setup/cache target details to Targets that did not run. Collapse the five identical-error sub-buckets in the latter into one flat list since the Short error column is the same generic 'no replay cache' string for all of them. When several targets in the same Failures to fix bucket share the exact same failing-check set, render the list once with a 'likely one root cause' note instead of repeating the same N-check explosion per row. Stop nesting <details> inside table cells in the failed-checks column; the rich detail lives in report.html. Move the Mermaid flow diagram, glossary, check inventory, and validation-family taxonomy into one collapsed 'About this report' block at the very bottom, with header levels rewritten to nest cleanly. No schema changes; build_html, the combiner, and all JSON/TSV outputs are untouched.

…tric evidence Round 2 of the readability cleanup. The previous iteration still had four problems: 1. 'Failures to fix' and 'Failed checks' rendered the same data twice. 2. Finding groups were keyed on the asset 'path', so calico appeared four times for the same monitor check and kong.http.status.count showed up twice byte-for-byte identically. 3. 'Failures to fix' rows did not say which metric was offending, so the reader had to scroll to 'Failed checks' to act. 4. 'Review warnings (296)' duplicated the fixture-coverage story already told by 'OpenMetrics fixture coverage'. 5. cassandra_nodetool x3 also appeared as three 'Changed outputs' rows downstream of the harness failure. Changes: - Drop path from group_actionable_findings key. One (target, check) is now one row with metrics aggregated. - Delete the separate Failed checks section. The Failures to fix table now carries an Evidence column with the offending metrics inline, scraped from structured findings or from the assertion's short_errors as a fallback (covers metadata-contract failures whose evidence lives in the AssertionError text). - Collapse rows within a bucket that produce identical Evidence (e.g. three marklogic envs failing the same metric set become one row). - Replace the OpenMetrics coverage section header with a Fixture coverage section that also folds in the dashboard-query-tag warning count, so fixture-quality signals live in exactly one place. - Suppress Latest release comparison rows whose target failed in the harness bucket above; show the count as a one-line note alongside the unchanged-output count, dropping the 100-row collapsed table. - Pass artifact_name='replay-pbt-combined-report' from the combine script so the 'Detailed dashboard' line points at the right artifact. - Replace the in-report Check inventory table with a link to replay-validation-README.md (static data, not run-specific). - Tighten the headline sentence to '... N need attention and M never ran.' so the call to action is obvious. - Reword the 'Targets that did not run' disclaimer to surface that each row already links to its specific shard job.

…roperties Add five new replay-validation properties and supporting helpers. Replay-regression (assert invariants over normalized check output): * histogram-bucket-monotonicity: for Prometheus cumulative histograms, group .bucket metrics by (name, tags excluding upper_bound) and assert values are non-decreasing as upper_bound grows. * histogram-inf-equals-count: by definition the +Inf bucket count equals the histogram's total observation count, so the emitted .bucket value at upper_bound:+Inf must equal the corresponding .count metric for the same series. Replay-metamorphic (mutate captured request bodies/headers and assert normalized output is unchanged): * openmetrics-line-endings: toggle LF <-> CRLF in OpenMetrics request capture bodies; equivalent record separators in Prometheus exposition. * openmetrics-sample-whitespace: widen/collapse the whitespace separating sample name/labels from value; Prometheus exposition allows any run of spaces/tabs. * http-response-header-casing: flip case of HTTP response header names in request captures; HTTP/1.1 field names are case-insensitive per RFC 7230 §3.2. Implementation: * New body mutators (toggle_line_endings, expand_sample_whitespace) in datadog_checks_dev/datadog_checks/dev/replay/pbt/openmetrics.py with semantics-preservation property tests covering both sample and non-sample bodies plus concrete round-trips. * New cache-level mutators in pbt/cache.py: mutate_request_capture_line_endings, mutate_request_capture_sample_whitespace, and mutate_request_capture_header_casing (with a _flip_header_case helper that safely declines on case-insensitive name collisions). * Histogram-output helpers in ddev/tests/cli/env/test_replay_pbt.py with full unit-test coverage of accept, reject, group-by-other-tags, skip-when-missing, and skip-when-no-eligible-series paths. * Five new pytest entry points wired into the existing _assert_mutated_cache_matches_original_output and compare-check invocation patterns. * Properties registered with families in pbt/properties.py: 19 -> 23 total properties.

…erties Mirror the five properties introduced in bc18f2b into the report renderer's parallel tables: - PROPERTY_DEFINITIONS: human labels for openmetrics-line-endings, openmetrics-sample-whitespace, http-response-header-casing, histogram-bucket-monotonicity, histogram-inf-equals-count. - PROPERTY_VALIDATION_FAMILIES: classify the three OpenMetrics/HTTP mutations as replay-metamorphic and the two histogram invariants as replay-regression so the appendix family table counts them correctly. - TEST_DEFINITIONS: labels for the five new pytest entry points so the collapsed 'same N checks' hint and the Evidence fallback render with crafted names instead of auto-titlecased snake_case. - classify(): add substring matches so failures of the new tests land in 'openmetrics-input-invariance' (mutations) or 'invalid-metric-values' (histograms) rather than the catch-all 'other-failed' bucket.

New ddev env compare-agent command + datadog_checks_dev replay/agent subpackage. Drives the real Datadog Agent binary against two images, recording HTTP/subprocess/etc. via an in-Agent monkeypatch shim and capturing three probe outputs per run: - agent integration freeze -> freeze.diff.json (IR-53148 oracle) - agent diagnose inventory -> inventory.diff.json - agent check --check-rate -> check.diff.json (behavioural) The shim is a self-contained ddev_shim package mounted into the Agent container's embedded3 site-packages; sitecustomize.py + a .pth file activate it at every interpreter start. Adapter modules are copied live from datadog_checks_dev/datadog_checks/dev/replay/adapters/ with their internal imports rewritten so the no-Agent and in-Agent code paths cannot drift. Verified locally: openmetrics Python check end-to-end (record/replay deterministic, 2 HTTP scrapes captured); IR-53148 negative control reproduces the five missing manifestless integrations across datadog/agent:7.78.0 -> 7.78.1.

- replay-pbt-matrix.py: new --runner / --record-image / --replay-image flags. Each matrix row carries 'runner', a runner-prefixed artifact_slug, and a runner-segmented cache_key so agent and no-agent caches do not cross-pollinate. JMX integrations filtered out of the agent runner (the in-Agent shim has no insertion point in JMXFetch). - zz-test-worker-poc.yaml: new 'runner' / 'record_image' / 'replay_image' workflow_dispatch inputs. New steps 'Pull Agent images (agent runner only)' and 'Run compare-agent (agent runner only)'; existing replay-pbt cache + run steps gated to runner != agent. Selected count: 227 sibling targets for the agent runner. Splittable across batches with the existing dispatch_batches=true machinery.

…dispatcher The dispatch-batches step constructs baseInputs from a fixed list of env-passthrough fields. When I added the runner/record_image/replay_image workflow inputs in the previous commit, I forgot to also extend the dispatcher's env + baseInputs to forward them to the spawned batch runs. Without this, batches dispatched with runner=agent inherit the empty default and fall through to the no-agent code path. Fixes the gate observed on run group replay-pbt-26528436414 where the 'Run compare-agent' step was skipped on all 8 batches.

The compare-agent runner now supports the same --replay-cache option as compare-check, with 'latest'/'auto' semantics scanning .ddev/replay/<integration>/<environment>/*. When a cache is provided, both Agent images run in replay mode against the same seeded fixture instead of recording fresh: - no dd_environment startup needed, - both Agents see identical inputs (pure behavioural diff), - a single seed run (via compare-check) can feed both runners. The dispatcher cache_key drops the runner segment so agent and no-agent share the cache namespace. Worker workflow gains a Restore + Seed pair for the agent runner that mirrors the no-agent flow; the Save step preserves freshly seeded caches. Validated locally: openmetrics:cachetest with --replay-cache against an existing fixture, record image 7.76.2, replay image 7.77.0. freeze.diff shows real cross-version package delta; check.diff equal because both Agents replayed the same fixture; fixture_source=cache in run_summary.json confirming the record run was skipped.

The previous worker change made the 'Seed cache if missing' step exit 1 on compare-check failure (set -euo pipefail + ddev env compare-check). GitHub Actions' default success() gating then skipped 'Run compare-agent' entirely, removing the record-mode fallback that allowed agent jobs to succeed when no cache existed. Fix: - Seed step: continue-on-error: true; explicit branching on compare-check exit code that writes seeded=false and exits 1 (still surfaces the failure in the UI but does not block downstream). - Compare-agent step: if condition explicitly accepts seed conclusion of 'success' or 'failure'. The script's use_cache logic already handles both cases. This restores the 89% pass rate baseline observed before cache-reuse was introduced, while keeping cache-reuse for cold-warmed targets.

nubtron added 30 commits May 25, 2026 08:06

Add cilium monkeypatch HTTP replay tests

e1485ae

Add live cilium monkeypatch replay e2e test

c27ea86

Move cilium replay helpers into dev platform

db2e01f

Add no-agent compare-check command

e9eedbb

Skip generic tag validation in compare-check containers

af7a69b

Intentionally alter cilium metric for compare-check validation

ea6ca0d

Revert "Intentionally alter cilium metric for compare-check validation"

d52cd66

This reverts commit 7fd75bfd43df2e362e11a5c4a50e9c06d5ce9bce.

Wait for Argo CD metrics endpoints in e2e setup

ca15104

Remove cilium-specific replay smoke tests

8b1ff3e

Infer check class in compare-check

ffe6b78

Default compare-check new side to working tree

b73e37f

Create timestamped compare-check artifact runs

f1d5dad

Add platform tests for replay helpers

76550a1

Apply replay helper formatting

b4fb6b2

Wait for Istio metrics endpoint in e2e setup

cd335ce

Merge remote-tracking branch 'origin/master' into nubtron/metadata-e2…

da26379

…e-monkeypatch-replay # Conflicts: # cilium/tests/conftest.py # istio/tests/conftest.py

Allow compare-check to select multiple environments

11fd8b9

Continue compare-check when one side fails

e569aae

Add Hatch compare-check runner and subprocess replay

074fa1b

Allow compare-check to replay cached fixtures

cc106f2

Auto-discover compare-check replay caches

2c967c5

Add CI-grade compare-check fixture keys

754f377

Store portable compare-check cache provenance

c53d3ed

Add TCP replay support for memcache and ZooKeeper

0cde24b

Add replay adapter dispatch seam

ff016be

Add process and psycopg replay adapters

1d6fdb9

Capture base agent outputs in replay

dd782ac

Load all replay adapters by default

53ada3d

Add replay adapter property tests

867554e

Add replay cache property tests

a0e6c1a

nubtron added 14 commits May 26, 2026 17:08

Trim Replay PBT target report footer

ebd0533

Rename Replay PBT flow report node

6510b7e

Simplify Replay PBT findings label

e3d709c

Use human-readable Replay PBT statuses

79e6381

Explain Replay PBT failed checks

fc381d5

Tweak Replay PBT dashboard copy

073e5f0

Collapse extra Replay PBT coverage rows

84d8327

Split asset findings by source type

44dee69

Refine Replay PBT failure categories

b5c3988

Add latest release differential Replay PBT

d556dbb

Surface release differential summaries

7ca72a5

Clarify Replay PBT report triage

0b00f76

Restructure Replay PBT report findings

689df66

Refine Replay PBT report sections

43db8a3

nubtron added 15 commits May 27, 2026 10:54

Clarify replay validation POC reports

f337a94

Measure OpenMetrics coverage empirically

7e4ebad

Improve replay PBT harness coverage

731e740

compare-agent report: surface probe diffs

d8d79cf

compare-agent worker: locate diff run directory for summaries

f257b64

replay-pbt report: clarify failure tables

91bfc7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add replay PBT CI proof of concept#23827

Add replay PBT CI proof of concept#23827
nubtron wants to merge 118 commits into
masterfrom
nubtron/metadata-e2e-monkeypatch-replay

nubtron commented May 25, 2026

Uh oh!

dd-octo-sts Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nubtron commented May 25, 2026

Summary

CI POC behavior

Security review notes

Local validation

Notes

Uh oh!

dd-octo-sts Bot commented May 27, 2026

Validation Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant