Skip to content

Detect active deployments before provisioning#7251

Draft
spboyer wants to merge 11 commits intomainfrom
fix/deployment-active-conflict
Draft

Detect active deployments before provisioning#7251
spboyer wants to merge 11 commits intomainfrom
fix/deployment-active-conflict

Conversation

@spboyer
Copy link
Member

@spboyer spboyer commented Mar 23, 2026

Summary

Fixes #7248

Before starting a deployment, azd now checks for active deployments on the target scope. If another deployment is in progress, it warns the user and waits for it to complete — avoiding the DeploymentActive ARM error that wastes ~5 minutes of the user's time.

Telemetry Context

  • 199 DeploymentActive failures in March (~270/month projected)
  • Average wait before failure: 5.3 minutes (P90: 12.2 min)
  • 78% from provision, 19% from up

Changes

Pre-deployment active check (bicep_provider.go)

Added waitForActiveDeployments() between preflight validation and deployment submission:

  • Lists deployments filtered for active provisioning states
  • If found: warns with deployment names, polls at 30s intervals
  • Timeout: 30 minutes (matches typical long deployments)
  • Only ignores ErrDeploymentsNotFound (scope doesn't exist yet); other errors propagate

Active state classification (deployments.go)

IsActiveDeploymentState() classifies 11 provisioning states as active, including transitional states (Canceling, Deleting, DeletingResources, UpdatingDenyAssignments) that can still block new deployments.

Scope interface (scope.go)

Added ListActiveDeployments() to both ResourceGroupScope and SubscriptionScope.

Error suggestion (error_suggestions.yaml)

Added DeploymentActive rule with user-friendly message and ARM troubleshooting link.

Test Coverage (8 tests, 24 subtests)

Test Coverage
TestIsActiveDeploymentState 17 subtests covering all provisioning states
TestWaitForActiveDeployments_NoActive Happy path
TestWaitForActiveDeployments_InitialListError_NotFound RG doesn't exist yet
TestWaitForActiveDeployments_InitialListError_Other Auth/throttle errors propagate
TestWaitForActiveDeployments_ActiveThenClear Polling until clear
TestWaitForActiveDeployments_CancelledContext Context cancellation
TestWaitForActiveDeployments_PollError Error during polling
TestWaitForActiveDeployments_Timeout 30min timeout

Related

Copilot AI review requested due to automatic review settings March 23, 2026 15:17
@spboyer spboyer added the bug Something isn't working label Mar 23, 2026
@spboyer spboyer self-assigned this Mar 23, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pre-deployment check that detects in-progress ARM deployments at the target scope and waits for them to complete, avoiding DeploymentActive failures during provisioning.

Changes:

  • Introduces waitForActiveDeployments() in the Bicep provisioning flow and polls until deployments clear or a timeout is reached.
  • Adds IsActiveDeploymentState() plus new tests to classify which provisioning states are considered “active”.
  • Extends infra.Scope with ListActiveDeployments() and adds a DeploymentActive error suggestion rule.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
cli/azd/resources/error_suggestions.yaml Adds a user-facing suggestion for DeploymentActive ARM errors.
cli/azd/pkg/infra/scope.go Extends scope interface + implements ListActiveDeployments() for RG and subscription scopes.
cli/azd/pkg/infra/provisioning/bicep/bicep_provider.go Adds wait loop before deployment submission, with polling/timeout defaults.
cli/azd/pkg/infra/provisioning/bicep/bicep_provider_test.go Updates mocked scope to satisfy the new Scope interface.
cli/azd/pkg/infra/provisioning/bicep/active_deployment_check_test.go Adds tests covering wait-loop behavior, errors, cancellation, and timeout.
cli/azd/pkg/azapi/deployments.go Adds IsActiveDeploymentState() helper.
cli/azd/pkg/azapi/deployment_state_test.go Adds unit tests for active/inactive state classification.

@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from f456f95 to 8829c13 Compare March 23, 2026 22:24
@spboyer
Copy link
Member Author

spboyer commented Mar 23, 2026

Telemetry Context: DeploymentActive + Retry Behavior

This PR addresses DeploymentActive (199 errors/month). Additional context from the deep dive:

Retry behavior makes this especially valuable

Of machines that hit InvalidTemplateDeployment errors (which includes DeploymentActive in the chain):

  • 66% retry without changing anything — for DeploymentActive, this means they re-submit and hit the same active deployment again
  • Average 3.6 retries per machine before they either succeed or give up
  • The detect-and-wait pattern in this PR would break that retry loop immediately

Time savings

  • DeploymentActive users currently wait an average deployment duration before failing, then retry
  • With this PR: one wait period (with progress feedback) instead of N failed attempts × deployment time each

This is a clean win — the fix is architecturally simple (poll + wait) and eliminates a category of failure that can never be solved by retrying.

@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from 8829c13 to 38516a8 Compare March 24, 2026 13:34
@kristenwomack
Copy link
Collaborator

@copilot - will you check to ensure we have metrics coverage so we can see how often this error and fix occur after merging this change?

Copy link
Contributor

Copilot AI commented Mar 24, 2026

@kristenwomack I've opened a new pull request, #7288, to work on those changes. Once the pull request is ready, I'll request review from you.

@spboyer spboyer marked this pull request as draft March 24, 2026 16:39
@spboyer
Copy link
Member Author

spboyer commented Mar 24, 2026

Converting to draft — the active deployment check integration point in Deploy() causes nil-pointer panics in existing test mocks. The test infrastructure creates Deployment objects with nil inner scopes, and scopeForTemplate requires provider state that isn't available in all test paths.

Need to either:

  1. Restructure as a standalone pre-deployment middleware (avoids touching the Deploy code path)
  2. Add the check at the provision action level (internal/cmd/provision.go) before it calls into the bicep provider

The standalone tests for waitForActiveDeployments, IsActiveDeploymentState, and the error suggestion rule all pass. Only the integration with the existing Deploy flow is problematic.

@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from 3732718 to f1f8bfc Compare March 24, 2026 19:22
@spboyer
Copy link
Member Author

spboyer commented Mar 25, 2026

/azp run

@azure-pipelines
Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from f1f8bfc to 74a9edb Compare March 25, 2026 00:33
spboyer added a commit that referenced this pull request Mar 25, 2026
Add lessons learned from recent PR reviews (#7290, #7251, #7250,
#7247, #7236, #7235, #7202, #7039) as agent instructions to prevent
recurring review findings.

New sections:
- Error handling: ErrorWithSuggestion completeness, telemetry service
  attribution, scope-agnostic messages
- Architecture boundaries: pkg/project target-agnostic, extension docs
- Output formatting: shell-safe paths, consistent JSON contracts
- Path safety: traversal validation, quoted paths in messages
- Testing best practices: test actual rules, extract shared helpers,
  correct env vars, TypeScript patterns, efficient dir checks
- CI/GitHub Actions: permissions, PATH handling, artifact downloads,
  prefer ADO for secrets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer added a commit that referenced this pull request Mar 25, 2026
Add lessons learned from team and Copilot reviews across PRs #7290,
#7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions
to prevent recurring review findings.

New/expanded sections:
- Error handling: ErrorWithSuggestion field completeness, telemetry
  service attribution, scope-agnostic messages, link/suggestion parity,
  stale data in polling loops
- Architecture boundaries: pkg/project target-agnostic, extension docs
  separation, env var verification against source code
- Output formatting: shell-safe quoted paths, consistent JSON types
- Path safety: traversal validation, quoted paths in messages
- Code organization: extract shared logic across scopes
- Documentation standards: help text consistency, no dead references,
  PR description accuracy
- Testing best practices: test YAML rules e2e, extract shared helpers,
  correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns,
  reasonable timeouts, cross-platform paths, test new JSON fields
- CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow
  artifacts, prefer ADO for secrets, no placeholder steps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer
Copy link
Member Author

spboyer commented Mar 25, 2026

/azp run azure-dev - cli

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

spboyer added a commit that referenced this pull request Mar 25, 2026
Add lessons learned from team and Copilot reviews across PRs #7290,
#7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions
to prevent recurring review findings.

New/expanded sections:
- Error handling: ErrorWithSuggestion field completeness, telemetry
  service attribution, scope-agnostic messages, link/suggestion parity,
  stale data in polling loops
- Architecture boundaries: pkg/project target-agnostic, extension docs
  separation, env var verification against source code
- Output formatting: shell-safe quoted paths, consistent JSON types
- Path safety: traversal validation, quoted paths in messages
- Code organization: extract shared logic across scopes
- Documentation standards: help text consistency, no dead references,
  PR description accuracy
- Testing best practices: test YAML rules e2e, extract shared helpers,
  correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns,
  reasonable timeouts, cross-platform paths, test new JSON fields
- CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow
  artifacts, prefer ADO for secrets, no placeholder steps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer
Copy link
Member Author

spboyer commented Mar 25, 2026

/azp run azure-dev - cli

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

spboyer added a commit that referenced this pull request Mar 25, 2026
Add lessons learned from team and Copilot reviews across PRs #7290,
#7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions
to prevent recurring review findings.

New/expanded sections:
- Error handling: ErrorWithSuggestion field completeness, telemetry
  service attribution, scope-agnostic messages, link/suggestion parity,
  stale data in polling loops
- Architecture boundaries: pkg/project target-agnostic, extension docs
  separation, env var verification against source code
- Output formatting: shell-safe quoted paths, consistent JSON types
- Path safety: traversal validation, quoted paths in messages
- Code organization: extract shared logic across scopes
- Documentation standards: help text consistency, no dead references,
  PR description accuracy
- Testing best practices: test YAML rules e2e, extract shared helpers,
  correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns,
  reasonable timeouts, cross-platform paths, test new JSON fields
- CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow
  artifacts, prefer ADO for secrets, no placeholder steps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from 74a9edb to f506378 Compare March 25, 2026 15:02
spboyer added a commit that referenced this pull request Mar 25, 2026
Add lessons learned from team and Copilot reviews across PRs #7290,
#7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions
to prevent recurring review findings.

New/expanded sections:
- Error handling: ErrorWithSuggestion field completeness, telemetry
  service attribution, scope-agnostic messages, link/suggestion parity,
  stale data in polling loops
- Architecture boundaries: pkg/project target-agnostic, extension docs
  separation, env var verification against source code
- Output formatting: shell-safe quoted paths, consistent JSON types
- Path safety: traversal validation, quoted paths in messages
- Code organization: extract shared logic across scopes
- Documentation standards: help text consistency, no dead references,
  PR description accuracy
- Testing best practices: test YAML rules e2e, extract shared helpers,
  correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns,
  reasonable timeouts, cross-platform paths, test new JSON fields
- CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow
  artifacts, prefer ADO for secrets, no placeholder steps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from f506378 to 752630f Compare March 25, 2026 16:48
@spboyer
Copy link
Member Author

spboyer commented Mar 25, 2026

/azp run azure-dev - cli

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@spboyer
Copy link
Member Author

spboyer commented Mar 25, 2026

/azp run azure-dev - cli

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

spboyer added a commit that referenced this pull request Mar 25, 2026
Add lessons learned from team and Copilot reviews across PRs #7290,
#7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions
to prevent recurring review findings.

New/expanded sections:
- Error handling: ErrorWithSuggestion field completeness, telemetry
  service attribution, scope-agnostic messages, link/suggestion parity,
  stale data in polling loops
- Architecture boundaries: pkg/project target-agnostic, extension docs
  separation, env var verification against source code
- Output formatting: shell-safe quoted paths, consistent JSON types
- Path safety: traversal validation, quoted paths in messages
- Code organization: extract shared logic across scopes
- Documentation standards: help text consistency, no dead references,
  PR description accuracy
- Testing best practices: test YAML rules e2e, extract shared helpers,
  correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns,
  reasonable timeouts, cross-platform paths, test new JSON fields
- CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow
  artifacts, prefer ADO for secrets, no placeholder steps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from 752630f to 8216b2b Compare March 25, 2026 23:21
JeffreyCA pushed a commit that referenced this pull request Mar 26, 2026
Add lessons learned from team and Copilot reviews across PRs #7290,
#7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions
to prevent recurring review findings.

New/expanded sections:
- Error handling: ErrorWithSuggestion field completeness, telemetry
  service attribution, scope-agnostic messages, link/suggestion parity,
  stale data in polling loops
- Architecture boundaries: pkg/project target-agnostic, extension docs
  separation, env var verification against source code
- Output formatting: shell-safe quoted paths, consistent JSON types
- Path safety: traversal validation, quoted paths in messages
- Code organization: extract shared logic across scopes
- Documentation standards: help text consistency, no dead references,
  PR description accuracy
- Testing best practices: test YAML rules e2e, extract shared helpers,
  correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns,
  reasonable timeouts, cross-platform paths, test new JSON fields
- CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow
  artifacts, prefer ADO for secrets, no placeholder steps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rajeshkamal5050 pushed a commit that referenced this pull request Mar 26, 2026
* docs: add PR review patterns to AGENTS.md

Add lessons learned from team and Copilot reviews across PRs #7290,
#7251, #7250, #7247, #7236, #7235, #7202, #7039 as agent instructions
to prevent recurring review findings.

New/expanded sections:
- Error handling: ErrorWithSuggestion field completeness, telemetry
  service attribution, scope-agnostic messages, link/suggestion parity,
  stale data in polling loops
- Architecture boundaries: pkg/project target-agnostic, extension docs
  separation, env var verification against source code
- Output formatting: shell-safe quoted paths, consistent JSON types
- Path safety: traversal validation, quoted paths in messages
- Code organization: extract shared logic across scopes
- Documentation standards: help text consistency, no dead references,
  PR description accuracy
- Testing best practices: test YAML rules e2e, extract shared helpers,
  correct env vars (AZD_FORCE_TTY, NO_COLOR), TypeScript patterns,
  reasonable timeouts, cross-platform paths, test new JSON fields
- CI / GitHub Actions: permissions blocks, PATH handling, cross-workflow
  artifacts, prefer ADO for secrets, no placeholder steps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix Copilot instructions for code review and strengthen guidance on Go patterns (#7320)

* Fix Copilot instructions for code review and strengthen guidance on Go patterns

* Update wording

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: JeffreyCA <jeffreychen@microsoft.com>
spboyer and others added 10 commits March 25, 2026 20:41
Before starting a Bicep deployment, check the target scope for
in-progress ARM deployments and wait for them to complete. This avoids
the DeploymentActive error that ARM returns after ~5 minutes when a
concurrent deployment is already running on the same resource group.

Changes:
- Add IsActiveDeploymentState() helper in azapi to classify provisioning
  states as active or terminal.
- Add ListActiveDeployments() to the infra.Scope interface and both
  ResourceGroupScope / SubscriptionScope implementations.
- Add waitForActiveDeployments() in the Bicep provider, called after
  preflight validation and before deployment submission. It polls until
  active deployments clear or a 30-minute timeout is reached.
- Add a DeploymentActive error suggestion rule to error_suggestions.yaml.
- Add unit tests for state classification, polling, timeout, error
  handling, and context cancellation.

Fixes #7248

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…per, refresh timeout names

- Fix 'range 200' compile error (not valid in all Go versions)
- Make DeploymentActive YAML rule scope-agnostic
- Extract filterActiveDeployments helper to deduplicate scope logic
- Refresh deployment names from latest poll on timeout message

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move ListActiveDeployments to a standalone function instead of adding
it to the exported Scope interface. Adding methods to exported
interfaces is a breaking change for any external implementation
(including test mocks in CI).

The standalone infra.ListActiveDeployments(ctx, scope) function calls
scope.ListDeployments and filters for active states, achieving the
same result without widening the interface contract.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The deployment object returned by generateDeploymentObject embeds a
Scope that can be nil in test environments (e.g. mockedScope returns
an empty SubscriptionDeployment). Using scopeForTemplate resolves
the scope from the provider's configuration, avoiding nil panics
in existing tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the fix/deployment-active-conflict branch from 8216b2b to 216a054 Compare March 26, 2026 03:41
Copy link
Member Author

@spboyer spboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

1. ErrDeploymentsNotFound not handled in poll loop

cli/azd/pkg/infra/provisioning/bicep/bicep_provider.go:693-696 — Medium

The initial ListActiveDeployments call (line 631) correctly handles ErrDeploymentsNotFound by returning nil. The polling loop does not. If the resource group is deleted externally while azd is waiting for active deployments to drain, the next poll returns ErrDeploymentsNotFound and surfaces it as a hard error. The deployment should proceed — the resource group is gone, so there are no conflicts.

Fix: Add the same ErrDeploymentsNotFound guard inside the ticker.C case:

case <-ticker.C:
    active, err = infra.ListActiveDeployments(ctx, scope)
    if err != nil {
        if errors.Is(err, infra.ErrDeploymentsNotFound) {
            return nil
        }
        return fmt.Errorf("checking active deployments: %w", err)
    }

2. Active-deployment check only queries one deployment backend

cli/azd/pkg/infra/scope.go:29 — Medium

ListActiveDeployments delegates to scope.ListDeployments(), which comes from whichever single azapi.DeploymentService was selected in DI. In standard mode it only sees ARM deployments; in deployment-stacks mode it only sees stacks. An active deployment of the other kind at the same scope is invisible, so azd up can still hit DeploymentActive despite the guard passing.

Worth noting in the PR description as a known limitation, or adding a combined query for conflict detection.


3. Race window between wait completion and deploy request

cli/azd/pkg/infra/provisioning/bicep/bicep_provider.go:816 — Low

The wait loop exits when zero active deployments are found, then proceeds to deployModule. A concurrent deployment starting in that gap still produces DeploymentActive. This is inherent to the check-then-act pattern and cannot be fully eliminated without retry-on-conflict logic. Consider noting this as a known limitation, and optionally adding a retry path if the deploy call itself returns DeploymentActive.

If the resource group is deleted externally while waiting for active
deployments to drain, the poll now returns nil instead of surfacing
a hard error. This matches the initial check behavior.

Known limitations documented:
- Only queries the active deployment backend (standard or stacks)
- Race window between wait completion and deploy request is inherent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle DeploymentActive conflict -- detect and wait for in-progress deployments

4 participants