Add replay PBT CI proof of concept#23827
Draft
nubtron wants to merge 118 commits into
Draft
Conversation
This reverts commit 7fd75bfd43df2e362e11a5c4a50e9c06d5ce9bce.
…e-monkeypatch-replay # Conflicts: # cilium/tests/conftest.py # istio/tests/conftest.py
Contributor
Validation ReportAll 21 validations passed. Show details
|
Replace the report's top-of-page meta sections (How to read this report, What this job is doing, Conceptual model, This batch at a glance, Check inventory, Validation families, Validation status by target, Outcome summary, Triage view, Failure categories) with a single Summary table that names the actionable buckets in plain language and lists the likely owner. Rename Actionable failed targets to Failures to fix, Property findings to Failed checks, and Setup/cache target details to Targets that did not run. Collapse the five identical-error sub-buckets in the latter into one flat list since the Short error column is the same generic 'no replay cache' string for all of them. When several targets in the same Failures to fix bucket share the exact same failing-check set, render the list once with a 'likely one root cause' note instead of repeating the same N-check explosion per row. Stop nesting <details> inside table cells in the failed-checks column; the rich detail lives in report.html. Move the Mermaid flow diagram, glossary, check inventory, and validation-family taxonomy into one collapsed 'About this report' block at the very bottom, with header levels rewritten to nest cleanly. No schema changes; build_html, the combiner, and all JSON/TSV outputs are untouched.
…tric evidence Round 2 of the readability cleanup. The previous iteration still had four problems: 1. 'Failures to fix' and 'Failed checks' rendered the same data twice. 2. Finding groups were keyed on the asset 'path', so calico appeared four times for the same monitor check and kong.http.status.count showed up twice byte-for-byte identically. 3. 'Failures to fix' rows did not say which metric was offending, so the reader had to scroll to 'Failed checks' to act. 4. 'Review warnings (296)' duplicated the fixture-coverage story already told by 'OpenMetrics fixture coverage'. 5. cassandra_nodetool x3 also appeared as three 'Changed outputs' rows downstream of the harness failure. Changes: - Drop path from group_actionable_findings key. One (target, check) is now one row with metrics aggregated. - Delete the separate Failed checks section. The Failures to fix table now carries an Evidence column with the offending metrics inline, scraped from structured findings or from the assertion's short_errors as a fallback (covers metadata-contract failures whose evidence lives in the AssertionError text). - Collapse rows within a bucket that produce identical Evidence (e.g. three marklogic envs failing the same metric set become one row). - Replace the OpenMetrics coverage section header with a Fixture coverage section that also folds in the dashboard-query-tag warning count, so fixture-quality signals live in exactly one place. - Suppress Latest release comparison rows whose target failed in the harness bucket above; show the count as a one-line note alongside the unchanged-output count, dropping the 100-row collapsed table. - Pass artifact_name='replay-pbt-combined-report' from the combine script so the 'Detailed dashboard' line points at the right artifact. - Replace the in-report Check inventory table with a link to replay-validation-README.md (static data, not run-specific). - Tighten the headline sentence to '... N need attention and M never ran.' so the call to action is obvious. - Reword the 'Targets that did not run' disclaimer to surface that each row already links to its specific shard job.
…roperties Add five new replay-validation properties and supporting helpers. Replay-regression (assert invariants over normalized check output): * histogram-bucket-monotonicity: for Prometheus cumulative histograms, group .bucket metrics by (name, tags excluding upper_bound) and assert values are non-decreasing as upper_bound grows. * histogram-inf-equals-count: by definition the +Inf bucket count equals the histogram's total observation count, so the emitted .bucket value at upper_bound:+Inf must equal the corresponding .count metric for the same series. Replay-metamorphic (mutate captured request bodies/headers and assert normalized output is unchanged): * openmetrics-line-endings: toggle LF <-> CRLF in OpenMetrics request capture bodies; equivalent record separators in Prometheus exposition. * openmetrics-sample-whitespace: widen/collapse the whitespace separating sample name/labels from value; Prometheus exposition allows any run of spaces/tabs. * http-response-header-casing: flip case of HTTP response header names in request captures; HTTP/1.1 field names are case-insensitive per RFC 7230 §3.2. Implementation: * New body mutators (toggle_line_endings, expand_sample_whitespace) in datadog_checks_dev/datadog_checks/dev/replay/pbt/openmetrics.py with semantics-preservation property tests covering both sample and non-sample bodies plus concrete round-trips. * New cache-level mutators in pbt/cache.py: mutate_request_capture_line_endings, mutate_request_capture_sample_whitespace, and mutate_request_capture_header_casing (with a _flip_header_case helper that safely declines on case-insensitive name collisions). * Histogram-output helpers in ddev/tests/cli/env/test_replay_pbt.py with full unit-test coverage of accept, reject, group-by-other-tags, skip-when-missing, and skip-when-no-eligible-series paths. * Five new pytest entry points wired into the existing _assert_mutated_cache_matches_original_output and compare-check invocation patterns. * Properties registered with families in pbt/properties.py: 19 -> 23 total properties.
…erties Mirror the five properties introduced in bc18f2b into the report renderer's parallel tables: - PROPERTY_DEFINITIONS: human labels for openmetrics-line-endings, openmetrics-sample-whitespace, http-response-header-casing, histogram-bucket-monotonicity, histogram-inf-equals-count. - PROPERTY_VALIDATION_FAMILIES: classify the three OpenMetrics/HTTP mutations as replay-metamorphic and the two histogram invariants as replay-regression so the appendix family table counts them correctly. - TEST_DEFINITIONS: labels for the five new pytest entry points so the collapsed 'same N checks' hint and the Evidence fallback render with crafted names instead of auto-titlecased snake_case. - classify(): add substring matches so failures of the new tests land in 'openmetrics-input-invariance' (mutations) or 'invalid-metric-values' (histograms) rather than the catch-all 'other-failed' bucket.
New ddev env compare-agent command + datadog_checks_dev replay/agent subpackage. Drives the real Datadog Agent binary against two images, recording HTTP/subprocess/etc. via an in-Agent monkeypatch shim and capturing three probe outputs per run: - agent integration freeze -> freeze.diff.json (IR-53148 oracle) - agent diagnose inventory -> inventory.diff.json - agent check --check-rate -> check.diff.json (behavioural) The shim is a self-contained ddev_shim package mounted into the Agent container's embedded3 site-packages; sitecustomize.py + a .pth file activate it at every interpreter start. Adapter modules are copied live from datadog_checks_dev/datadog_checks/dev/replay/adapters/ with their internal imports rewritten so the no-Agent and in-Agent code paths cannot drift. Verified locally: openmetrics Python check end-to-end (record/replay deterministic, 2 HTTP scrapes captured); IR-53148 negative control reproduces the five missing manifestless integrations across datadog/agent:7.78.0 -> 7.78.1.
- replay-pbt-matrix.py: new --runner / --record-image / --replay-image flags. Each matrix row carries 'runner', a runner-prefixed artifact_slug, and a runner-segmented cache_key so agent and no-agent caches do not cross-pollinate. JMX integrations filtered out of the agent runner (the in-Agent shim has no insertion point in JMXFetch). - zz-test-worker-poc.yaml: new 'runner' / 'record_image' / 'replay_image' workflow_dispatch inputs. New steps 'Pull Agent images (agent runner only)' and 'Run compare-agent (agent runner only)'; existing replay-pbt cache + run steps gated to runner != agent. Selected count: 227 sibling targets for the agent runner. Splittable across batches with the existing dispatch_batches=true machinery.
…dispatcher The dispatch-batches step constructs baseInputs from a fixed list of env-passthrough fields. When I added the runner/record_image/replay_image workflow inputs in the previous commit, I forgot to also extend the dispatcher's env + baseInputs to forward them to the spawned batch runs. Without this, batches dispatched with runner=agent inherit the empty default and fall through to the no-agent code path. Fixes the gate observed on run group replay-pbt-26528436414 where the 'Run compare-agent' step was skipped on all 8 batches.
The compare-agent runner now supports the same --replay-cache option as compare-check, with 'latest'/'auto' semantics scanning .ddev/replay/<integration>/<environment>/*. When a cache is provided, both Agent images run in replay mode against the same seeded fixture instead of recording fresh: - no dd_environment startup needed, - both Agents see identical inputs (pure behavioural diff), - a single seed run (via compare-check) can feed both runners. The dispatcher cache_key drops the runner segment so agent and no-agent share the cache namespace. Worker workflow gains a Restore + Seed pair for the agent runner that mirrors the no-agent flow; the Save step preserves freshly seeded caches. Validated locally: openmetrics:cachetest with --replay-cache against an existing fixture, record image 7.76.2, replay image 7.77.0. freeze.diff shows real cross-version package delta; check.diff equal because both Agents replayed the same fixture; fixture_source=cache in run_summary.json confirming the record run was skipped.
The previous worker change made the 'Seed cache if missing' step exit 1 on compare-check failure (set -euo pipefail + ddev env compare-check). GitHub Actions' default success() gating then skipped 'Run compare-agent' entirely, removing the record-mode fallback that allowed agent jobs to succeed when no cache existed. Fix: - Seed step: continue-on-error: true; explicit branching on compare-check exit code that writes seeded=false and exits 1 (still surfaces the failure in the UI but does not block downstream). - Compare-agent step: if condition explicitly accepts seed conclusion of 'success' or 'failure'. The script's use_cache logic already handles both cases. This restores the 89% pass rate baseline observed before cache-reuse was introduced, while keeping cache-reuse for cold-warmed targets.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a replay-PBT CI proof of concept for cached integration replay testing.
Highlights:
zz-test-worker-poc.yamlworkflow as a parallel Replay PBT POC runner.CI POC behavior
The manual POC workflow supports:
changed,all-cached, andall-declaredmodessmokevsallproperty setsSecurity review notes
Before opening this draft, I reviewed the changed workflows for common workflow security issues:
pull_request_targetusage.contents: readpermissions.github-script, dynamic script execution,eval, or curl-to-shell pattern added.git fetch/ matrix generation.artifact_slug.Local validation
replay-pbt-matrix.pycompiles.all-declaredwithout sharding fails loudly when overmax_targets.all-declaredwithshard_count=2emits 192 targets for shard 0.17 passed, 10 skipped.13 passed, 6 skipped.Notes
This is intentionally a draft POC. The full manual replay-PBT workflow should be exercised first with a tiny capped run before any full seeded or all-property run.