Skip to content

Adapt sync-cli to support synthetics PLs replication for DDR#526

Open
melkouri wants to merge 6 commits into
mainfrom
malak.elkouri/SYNTH-26118/update-PL-creation-for-ddr
Open

Adapt sync-cli to support synthetics PLs replication for DDR#526
melkouri wants to merge 6 commits into
mainfrom
malak.elkouri/SYNTH-26118/update-PL-creation-for-ddr

Conversation

@melkouri
Copy link
Copy Markdown
Contributor

@melkouri melkouri commented Apr 17, 2026

What does this PR do?

  1. Adds DDR (Disaster Recovery) support to the synthetics_private_locations resource, enabling PL replication from a source org (R1) to a destination org (R2).

Related dogweb PRs:

RFC: https://docs.google.com/document/d/1pxwa2vqa5I_NkhWlhXlcuFsDgSbBVQNdeXFEQ1vzQSM/edit?tab=t.0#heading=h.tmwe8zpx1a8q

  1. Fixes the reset command to delete only resources managed by the sync-cli, not pre-existing resources in R2.

  2. Migrates synthetics_tests off skip_resource_mapping=True to use the map_existing_resources infrastructure. This avoids duplicating tests when running sync commands multiple times: so a sync run that finds a destination test already stamped with a known source_public_id adopts it into state instead of POSTing a duplicate.
    This guards against orphan tests when prior sync POSTs succeeded server-side but the client never received the response (504-after-success under destination API load).

Description of the Change

New CLI option: --datadog-host-override

  • Added to constants.py (DD_DATADOG_HOST_OVERRIDE env var), options.py (CLI flag), and configuration.py (threaded to the Configuration dataclass).
  • Optional CNAME override passed to the DDR create endpoint for DNS failover.

Updated synthetics_private_locations.py:

  1. excluded_attributes : added ddr_metadata, pl_id, and public_key_test. ddr_metadata is returned on the destination for DDR PLs and must not be diffed against the source. pl_id and public_key_test are returned by the source-side GET (with include_pl_info=true) and are only used at create time, so they must not show up as diffs.

  2. import_resource() : calls GET /api/v1/synthetics/private-locations/{id}?include_pl_info=true so the source state captures pl_id and public_key_test at import time. No extra fetch at sync time.

  3. create_resource() : when a source PL is being created at the destination:

    • Reads pl_id + public_key_test from the source state (already captured at import).
    • Strips null metadata from the request body (DDR endpoint rejects it).
    • Injects ddr_metadata.disaster_recovery with source_pl_id and source_name. The dogweb schema is strict (additionalProperties: false) and accepts only these two fields.
    • Sets test_encryption_public_key to the JSON-stringified public_key_test object ({pem, fingerprint, id}).
    • Optionally sets datadog_host_override.
    • Parses the DDR response: returns resp[\"private_location\"]

Updated synthetics_tests.py:

  1. resource_config: replaced skip_resource_mapping=True with resource_mapping_key="metadata.disaster_recovery.source_public_id". That stamp is already injected on every source test by pre_resource_action_hook, so it doubles as the dedup key.

  2. map_existing_resources() override: fetches destination tests via GET /api/v1/synthetics/tests?include_metadata=true and indexes them by source_public_id. Custom override avoids the destination-versions side-effect that the existing get_resources() has.

  3. create_resource() dedup short-circuit: before issuing a new POST, checks if a destination test already carries this source_public_id. If yes, adopts it into destination state and calls update_resource() instead of POSTing. Prevents new duplicates when re-syncing after a prior orphan-creating run.

Test fixture update: moved synthetics_tests from OPT_OUT_RESOURCES to MAPPING_RESOURCES in tests/unit/test_map_existing_resources.py.

Operational guidance

When running against staging (app.datad0g.com), throttle concurrency to avoid McNulty pool exhaustion (512 status codes; see Confluence):

--max-workers 20 --http-client-retry-timeout 180

The default --max-workers 100 saturates dogweb's gunicorn pool on staging, producing 512s and 504s. The 504s are particularly bad on POST because dogweb may have written the resource server-side before timing out, and the client's retry then duplicates it. Throttling keeps the pool happy and eliminates this in-POST duplication path. Prod has more headroom and can usually run at the default.

Known limitations

In-POST retry duplication: if a single POST attempt 504s and the client retries, dogweb may have already created the resource. Each retry creates another server-side row. The throttling guidance above is the operational mitigation; a proper fix (skipping 504 retries on POST in custom_client.py) is out of scope for this PR.

Verification Process

Tested end-to-end on staging (app.datad0g.com) between two orgs with --max-workers 20 --http-client-retry-timeout 180:

  1. Created PLs and tests in source org (R1).
  2. Ran import: confirmed source state captured pl_id and public_key_test via include_pl_info=true, and that every R1 test landed in resources/source/synthetics_tests.json.
  3. Ran sync: confirmed PLs created via the DDR endpoint with the right schema, tests created in R2 with metadata.disaster_recovery stamps, and destination state matches R2 counts after throttled runs.
  4. Verified end-to-end DDR replication: manually unpaused one of the replicated tests in R2 and confirmed synthetics_private_locations_check_assignment populated and the test began running on the replicated PL.
  5. Ran a second sync after dropping a destination state entry, and confirmed the dedup short-circuit adopts the existing R2 test instead of creating a duplicate.
  6. Ran reset: confirmed only sync-cli-managed resources were deleted; pre-existing (manually-created) R2 resources were untouched.

PLs created in R2:
image

with assigned tests:
image

When the test is running, it's assigned to the PL
image

Release Notes

  • Added DDR (Disaster Recovery) support for synthetics_private_locations.
  • New --datadog-host-override CLI option for optional CNAME override during PL replication.
  • Migrated synthetics_tests to map_existing_resources with cross-run dedup against metadata.disaster_recovery.source_public_id. Re-running sync after a degraded run no longer accumulates duplicate tests in the destination org.
  • Fixed reset to only delete resources managed by sync-cli.

melkouri and others added 2 commits May 27, 2026 13:25
Resolves conflict in synthetics_private_locations.py: keep DDR
include_pl_info=true query param while adopting main's new
state.set_source() API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-5
Copy link
Copy Markdown

datadog-prod-us1-5 Bot commented May 27, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

Run Tests | test (macos-latest)   View in Datadog   GitHub Actions

See error Can't overwrite existing cassette in your current record mode ('none').

Run Tests | test (ubuntu-latest)   View in Datadog   GitHub Actions

See error Can't overwrite existing cassette in current record mode ('none'). No match for the request found.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a8dc5f4 | Docs | Datadog PR Page | Give us feedback!

…un dedup

Previously synthetics_tests had skip_resource_mapping=True, so the sync
had no way to detect a destination test that already existed when state
didn't know about it. Under degraded destination APIs (504/5xx storms
during McNulty pool exhaustion on staging), repeated sync runs were
accumulating duplicate tests in R2: a prior sync's POST succeeded
server-side, the client exhausted retries and never persisted the
destination id, and the next sync's create branch POSTed again.

This migration:
- Replaces skip_resource_mapping with
  resource_mapping_key="metadata.disaster_recovery.source_public_id",
  the stamp pre_resource_action_hook already writes on every source
  test before create/update.
- Adds a map_existing_resources() override that LISTs destination tests
  with include_metadata=true and indexes them by source_public_id,
  without the destination-versions side-effect that get_resources has.
- Adds a short-circuit at the top of create_resource: if R2 already has
  a test stamped with this source_public_id, adopt it into state and
  call update_resource instead of POSTing a new copy.

This does not address in-POST retry duplications (those happen mid-POST,
after the existing_resources_map is built and frozen). Operational
mitigation for staging-degraded conditions:
  --max-workers 20 --http-client-retry-timeout 180

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@melkouri melkouri marked this pull request as ready for review June 3, 2026 12:24
@melkouri melkouri requested a review from a team as a code owner June 3, 2026 12:24
@melkouri
Copy link
Copy Markdown
Contributor Author

melkouri commented Jun 3, 2026

TO DO: Add guideline to how to sync the Synthetics Private Locations for Customers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant