Adapt sync-cli to support synthetics PLs replication for DDR#526
Open
melkouri wants to merge 6 commits into
Open
Adapt sync-cli to support synthetics PLs replication for DDR#526melkouri wants to merge 6 commits into
melkouri wants to merge 6 commits into
Conversation
Resolves conflict in synthetics_private_locations.py: keep DDR include_pl_info=true query param while adopting main's new state.set_source() API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
…un dedup Previously synthetics_tests had skip_resource_mapping=True, so the sync had no way to detect a destination test that already existed when state didn't know about it. Under degraded destination APIs (504/5xx storms during McNulty pool exhaustion on staging), repeated sync runs were accumulating duplicate tests in R2: a prior sync's POST succeeded server-side, the client exhausted retries and never persisted the destination id, and the next sync's create branch POSTed again. This migration: - Replaces skip_resource_mapping with resource_mapping_key="metadata.disaster_recovery.source_public_id", the stamp pre_resource_action_hook already writes on every source test before create/update. - Adds a map_existing_resources() override that LISTs destination tests with include_metadata=true and indexes them by source_public_id, without the destination-versions side-effect that get_resources has. - Adds a short-circuit at the top of create_resource: if R2 already has a test stamped with this source_public_id, adopt it into state and call update_resource instead of POSTing a new copy. This does not address in-POST retry duplications (those happen mid-POST, after the existing_resources_map is built and frozen). Operational mitigation for staging-degraded conditions: --max-workers 20 --http-client-retry-timeout 180 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
TO DO: Add guideline to how to sync the Synthetics Private Locations for Customers |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
synthetics_private_locationsresource, enabling PL replication from a source org (R1) to a destination org (R2).Related dogweb PRs:
include_pl_infoRFC: https://docs.google.com/document/d/1pxwa2vqa5I_NkhWlhXlcuFsDgSbBVQNdeXFEQ1vzQSM/edit?tab=t.0#heading=h.tmwe8zpx1a8q
Fixes the reset command to delete only resources managed by the sync-cli, not pre-existing resources in R2.
Migrates
synthetics_testsoffskip_resource_mapping=Trueto use themap_existing_resourcesinfrastructure. This avoids duplicating tests when running sync commands multiple times: so a sync run that finds a destination test already stamped with a knownsource_public_idadopts it into state instead of POSTing a duplicate.This guards against orphan tests when prior sync POSTs succeeded server-side but the client never received the response (504-after-success under destination API load).
Description of the Change
New CLI option:
--datadog-host-overrideconstants.py(DD_DATADOG_HOST_OVERRIDEenv var),options.py(CLI flag), andconfiguration.py(threaded to theConfigurationdataclass).Updated
synthetics_private_locations.py:excluded_attributes: addedddr_metadata,pl_id, andpublic_key_test.ddr_metadatais returned on the destination for DDR PLs and must not be diffed against the source.pl_idandpublic_key_testare returned by the source-side GET (withinclude_pl_info=true) and are only used at create time, so they must not show up as diffs.import_resource(): callsGET /api/v1/synthetics/private-locations/{id}?include_pl_info=trueso the source state capturespl_idandpublic_key_testat import time. No extra fetch at sync time.create_resource(): when a source PL is being created at the destination:pl_id+public_key_testfrom the source state (already captured at import).nullmetadata from the request body (DDR endpoint rejects it).ddr_metadata.disaster_recoverywithsource_pl_idandsource_name. The dogweb schema is strict (additionalProperties: false) and accepts only these two fields.test_encryption_public_keyto the JSON-stringifiedpublic_key_testobject ({pem, fingerprint, id}).datadog_host_override.resp[\"private_location\"]Updated
synthetics_tests.py:resource_config: replacedskip_resource_mapping=Truewithresource_mapping_key="metadata.disaster_recovery.source_public_id". That stamp is already injected on every source test bypre_resource_action_hook, so it doubles as the dedup key.map_existing_resources()override: fetches destination tests viaGET /api/v1/synthetics/tests?include_metadata=trueand indexes them bysource_public_id. Custom override avoids the destination-versions side-effect that the existingget_resources()has.create_resource()dedup short-circuit: before issuing a new POST, checks if a destination test already carries thissource_public_id. If yes, adopts it into destination state and callsupdate_resource()instead of POSTing. Prevents new duplicates when re-syncing after a prior orphan-creating run.Test fixture update: moved
synthetics_testsfromOPT_OUT_RESOURCEStoMAPPING_RESOURCESintests/unit/test_map_existing_resources.py.Operational guidance
When running against staging (
app.datad0g.com), throttle concurrency to avoid McNulty pool exhaustion (512 status codes; see Confluence):The default
--max-workers 100saturates dogweb's gunicorn pool on staging, producing 512s and 504s. The 504s are particularly bad on POST because dogweb may have written the resource server-side before timing out, and the client's retry then duplicates it. Throttling keeps the pool happy and eliminates this in-POST duplication path. Prod has more headroom and can usually run at the default.Known limitations
In-POST retry duplication: if a single POST attempt 504s and the client retries, dogweb may have already created the resource. Each retry creates another server-side row. The throttling guidance above is the operational mitigation; a proper fix (skipping 504 retries on POST in
custom_client.py) is out of scope for this PR.Verification Process
Tested end-to-end on staging (
app.datad0g.com) between two orgs with--max-workers 20 --http-client-retry-timeout 180:import: confirmed source state capturedpl_idandpublic_key_testviainclude_pl_info=true, and that every R1 test landed inresources/source/synthetics_tests.json.sync: confirmed PLs created via the DDR endpoint with the right schema, tests created in R2 withmetadata.disaster_recoverystamps, and destination state matches R2 counts after throttled runs.synthetics_private_locations_check_assignmentpopulated and the test began running on the replicated PL.syncafter dropping a destination state entry, and confirmed the dedup short-circuit adopts the existing R2 test instead of creating a duplicate.reset: confirmed only sync-cli-managed resources were deleted; pre-existing (manually-created) R2 resources were untouched.PLs created in R2:

with assigned tests:

When the test is running, it's assigned to the PL

Release Notes
synthetics_private_locations.--datadog-host-overrideCLI option for optional CNAME override during PL replication.synthetics_teststomap_existing_resourceswith cross-run dedup againstmetadata.disaster_recovery.source_public_id. Re-running sync after a degraded run no longer accumulates duplicate tests in the destination org.resetto only delete resources managed by sync-cli.