Skip to content

Add replay PBT CI proof of concept#23827

Draft
nubtron wants to merge 118 commits into
masterfrom
nubtron/metadata-e2e-monkeypatch-replay
Draft

Add replay PBT CI proof of concept#23827
nubtron wants to merge 118 commits into
masterfrom
nubtron/metadata-e2e-monkeypatch-replay

Conversation

@nubtron
Copy link
Copy Markdown
Contributor

@nubtron nubtron commented May 25, 2026

Summary

This PR adds a replay-PBT CI proof of concept for cached integration replay testing.

Highlights:

  • Rename compare-check concepts from old/new to record/replay while retaining old flag aliases.
  • Add replay-PBT fixture/target refs and cache-only probing.
  • Add a small branch-gated PR smoke job for KrakenD replay-PBT.
  • Repurpose the existing manual zz-test-worker-poc.yaml workflow as a parallel Replay PBT POC runner.
  • Add a replay-PBT matrix script with sharding and fail-loud truncation protection.

CI POC behavior

The manual POC workflow supports:

  • changed, all-cached, and all-declared modes
  • optional cache seeding
  • smoke vs all property sets
  • per-target cache restore/save
  • sharding to avoid the GitHub 256-job matrix limit
  • summary collection as JSON/TSV plus a GitHub step summary table

Security review notes

Before opening this draft, I reviewed the changed workflows for common workflow security issues:

  • No pull_request_target usage.
  • All external actions in changed workflow paths are pinned to full commit SHAs.
  • The POC workflow uses only contents: read permissions.
  • No GitHub secrets are used by the new POC workflow.
  • No github-script, dynamic script execution, eval, or curl-to-shell pattern added.
  • Manual ref inputs are validated before use in git fetch / matrix generation.
  • Matrix-derived artifact names use a sanitized artifact_slug.
  • Matrix target path components are validated in the matrix script before being used in cache paths.
  • Matrix truncation is disabled by default; over-large runs fail with shard guidance instead of silently dropping targets.

Local validation

  • YAML parse for changed workflows passed.
  • replay-pbt-matrix.py compiles.
  • all-declared without sharding fails loudly when over max_targets.
  • all-declared with shard_count=2 emits 192 targets for shard 0.
  • ddev lint/format passed.
  • Replay PBT unit/cache tests passed: 17 passed, 10 skipped.
  • Local cached KrakenD smoke run passed: 13 passed, 6 skipped.

Notes

This is intentionally a draft POC. The full manual replay-PBT workflow should be exercised first with a tiny capped run before any full seeded or all-property run.

nubtron added 30 commits May 25, 2026 08:06
This reverts commit 7fd75bfd43df2e362e11a5c4a50e9c06d5ce9bce.
…e-monkeypatch-replay

# Conflicts:
#	cilium/tests/conftest.py
#	istio/tests/conftest.py
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 27, 2026

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

nubtron added 15 commits May 27, 2026 10:54
Replace the report's top-of-page meta sections (How to read this report,
What this job is doing, Conceptual model, This batch at a glance, Check
inventory, Validation families, Validation status by target, Outcome
summary, Triage view, Failure categories) with a single Summary table
that names the actionable buckets in plain language and lists the likely
owner.

Rename Actionable failed targets to Failures to fix, Property findings
to Failed checks, and Setup/cache target details to Targets that did
not run. Collapse the five identical-error sub-buckets in the latter
into one flat list since the Short error column is the same generic
'no replay cache' string for all of them.

When several targets in the same Failures to fix bucket share the exact
same failing-check set, render the list once with a 'likely one root
cause' note instead of repeating the same N-check explosion per row.

Stop nesting <details> inside table cells in the failed-checks column;
the rich detail lives in report.html.

Move the Mermaid flow diagram, glossary, check inventory, and
validation-family taxonomy into one collapsed 'About this report'
block at the very bottom, with header levels rewritten to nest cleanly.

No schema changes; build_html, the combiner, and all JSON/TSV outputs
are untouched.
…tric evidence

Round 2 of the readability cleanup. The previous iteration still had four
problems:

1. 'Failures to fix' and 'Failed checks' rendered the same data twice.
2. Finding groups were keyed on the asset 'path', so calico appeared four
   times for the same monitor check and kong.http.status.count showed up
   twice byte-for-byte identically.
3. 'Failures to fix' rows did not say which metric was offending, so the
   reader had to scroll to 'Failed checks' to act.
4. 'Review warnings (296)' duplicated the fixture-coverage story already
   told by 'OpenMetrics fixture coverage'.
5. cassandra_nodetool x3 also appeared as three 'Changed outputs' rows
   downstream of the harness failure.

Changes:

- Drop path from group_actionable_findings key. One (target, check) is
  now one row with metrics aggregated.
- Delete the separate Failed checks section. The Failures to fix table
  now carries an Evidence column with the offending metrics inline,
  scraped from structured findings or from the assertion's short_errors
  as a fallback (covers metadata-contract failures whose evidence lives
  in the AssertionError text).
- Collapse rows within a bucket that produce identical Evidence (e.g.
  three marklogic envs failing the same metric set become one row).
- Replace the OpenMetrics coverage section header with a Fixture coverage
  section that also folds in the dashboard-query-tag warning count, so
  fixture-quality signals live in exactly one place.
- Suppress Latest release comparison rows whose target failed in the
  harness bucket above; show the count as a one-line note alongside the
  unchanged-output count, dropping the 100-row collapsed table.
- Pass artifact_name='replay-pbt-combined-report' from the combine
  script so the 'Detailed dashboard' line points at the right artifact.
- Replace the in-report Check inventory table with a link to
  replay-validation-README.md (static data, not run-specific).
- Tighten the headline sentence to '... N need attention and M never
  ran.' so the call to action is obvious.
- Reword the 'Targets that did not run' disclaimer to surface that each
  row already links to its specific shard job.
…roperties

Add five new replay-validation properties and supporting helpers.

Replay-regression (assert invariants over normalized check output):

* histogram-bucket-monotonicity: for Prometheus cumulative histograms,
  group .bucket metrics by (name, tags excluding upper_bound) and
  assert values are non-decreasing as upper_bound grows.
* histogram-inf-equals-count: by definition the +Inf bucket count
  equals the histogram's total observation count, so the emitted
  .bucket value at upper_bound:+Inf must equal the corresponding
  .count metric for the same series.

Replay-metamorphic (mutate captured request bodies/headers and assert
normalized output is unchanged):

* openmetrics-line-endings: toggle LF <-> CRLF in OpenMetrics request
  capture bodies; equivalent record separators in Prometheus
  exposition.
* openmetrics-sample-whitespace: widen/collapse the whitespace
  separating sample name/labels from value; Prometheus exposition
  allows any run of spaces/tabs.
* http-response-header-casing: flip case of HTTP response header names
  in request captures; HTTP/1.1 field names are case-insensitive per
  RFC 7230 §3.2.

Implementation:

* New body mutators (toggle_line_endings, expand_sample_whitespace) in
  datadog_checks_dev/datadog_checks/dev/replay/pbt/openmetrics.py with
  semantics-preservation property tests covering both sample and
  non-sample bodies plus concrete round-trips.
* New cache-level mutators in pbt/cache.py:
  mutate_request_capture_line_endings,
  mutate_request_capture_sample_whitespace, and
  mutate_request_capture_header_casing (with a _flip_header_case
  helper that safely declines on case-insensitive name collisions).
* Histogram-output helpers in ddev/tests/cli/env/test_replay_pbt.py
  with full unit-test coverage of accept, reject, group-by-other-tags,
  skip-when-missing, and skip-when-no-eligible-series paths.
* Five new pytest entry points wired into the existing
  _assert_mutated_cache_matches_original_output and compare-check
  invocation patterns.
* Properties registered with families in pbt/properties.py: 19 -> 23
  total properties.
…erties

Mirror the five properties introduced in bc18f2b into the report
renderer's parallel tables:

- PROPERTY_DEFINITIONS: human labels for openmetrics-line-endings,
  openmetrics-sample-whitespace, http-response-header-casing,
  histogram-bucket-monotonicity, histogram-inf-equals-count.
- PROPERTY_VALIDATION_FAMILIES: classify the three OpenMetrics/HTTP
  mutations as replay-metamorphic and the two histogram invariants as
  replay-regression so the appendix family table counts them correctly.
- TEST_DEFINITIONS: labels for the five new pytest entry points so the
  collapsed 'same N checks' hint and the Evidence fallback render with
  crafted names instead of auto-titlecased snake_case.
- classify(): add substring matches so failures of the new tests land
  in 'openmetrics-input-invariance' (mutations) or 'invalid-metric-values'
  (histograms) rather than the catch-all 'other-failed' bucket.
New ddev env compare-agent command + datadog_checks_dev replay/agent
subpackage. Drives the real Datadog Agent binary against two images,
recording HTTP/subprocess/etc. via an in-Agent monkeypatch shim and
capturing three probe outputs per run:

- agent integration freeze    -> freeze.diff.json (IR-53148 oracle)
- agent diagnose inventory    -> inventory.diff.json
- agent check --check-rate    -> check.diff.json (behavioural)

The shim is a self-contained ddev_shim package mounted into the Agent
container's embedded3 site-packages; sitecustomize.py + a .pth file
activate it at every interpreter start. Adapter modules are copied
live from datadog_checks_dev/datadog_checks/dev/replay/adapters/ with
their internal imports rewritten so the no-Agent and in-Agent code
paths cannot drift.

Verified locally: openmetrics Python check end-to-end (record/replay
deterministic, 2 HTTP scrapes captured); IR-53148 negative control
reproduces the five missing manifestless integrations across
datadog/agent:7.78.0 -> 7.78.1.
- replay-pbt-matrix.py: new --runner / --record-image / --replay-image
  flags. Each matrix row carries 'runner', a runner-prefixed
  artifact_slug, and a runner-segmented cache_key so agent and no-agent
  caches do not cross-pollinate. JMX integrations filtered out of the
  agent runner (the in-Agent shim has no insertion point in JMXFetch).

- zz-test-worker-poc.yaml: new 'runner' / 'record_image' / 'replay_image'
  workflow_dispatch inputs. New steps 'Pull Agent images (agent runner
  only)' and 'Run compare-agent (agent runner only)'; existing
  replay-pbt cache + run steps gated to runner != agent.

Selected count: 227 sibling targets for the agent runner. Splittable
across batches with the existing dispatch_batches=true machinery.
…dispatcher

The dispatch-batches step constructs baseInputs from a fixed list of
env-passthrough fields. When I added the runner/record_image/replay_image
workflow inputs in the previous commit, I forgot to also extend the
dispatcher's env + baseInputs to forward them to the spawned batch
runs. Without this, batches dispatched with runner=agent inherit the
empty default and fall through to the no-agent code path.

Fixes the gate observed on run group replay-pbt-26528436414 where the
'Run compare-agent' step was skipped on all 8 batches.
The compare-agent runner now supports the same --replay-cache option as
compare-check, with 'latest'/'auto' semantics scanning
.ddev/replay/<integration>/<environment>/*. When a cache is provided,
both Agent images run in replay mode against the same seeded fixture
instead of recording fresh:

- no dd_environment startup needed,
- both Agents see identical inputs (pure behavioural diff),
- a single seed run (via compare-check) can feed both runners.

The dispatcher cache_key drops the runner segment so agent and no-agent
share the cache namespace. Worker workflow gains a Restore + Seed pair
for the agent runner that mirrors the no-agent flow; the Save step
preserves freshly seeded caches.

Validated locally: openmetrics:cachetest with --replay-cache against an
existing fixture, record image 7.76.2, replay image 7.77.0. freeze.diff
shows real cross-version package delta; check.diff equal because both
Agents replayed the same fixture; fixture_source=cache in
run_summary.json confirming the record run was skipped.
The previous worker change made the 'Seed cache if missing' step exit 1
on compare-check failure (set -euo pipefail + ddev env compare-check).
GitHub Actions' default success() gating then skipped 'Run compare-agent'
entirely, removing the record-mode fallback that allowed agent jobs to
succeed when no cache existed.

Fix:

- Seed step: continue-on-error: true; explicit branching on compare-check
  exit code that writes seeded=false and exits 1 (still surfaces the
  failure in the UI but does not block downstream).
- Compare-agent step: if condition explicitly accepts seed conclusion of
  'success' or 'failure'. The script's use_cache logic already handles
  both cases.

This restores the 89% pass rate baseline observed before cache-reuse was
introduced, while keeping cache-reuse for cold-warmed targets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant