Skip to content

fix: Infra reliability & security improvements — reduce quota demand, add Bicep guard, improve VM credentials#900

Closed
Roopan-Microsoft wants to merge 26 commits intodev-v4from
fix/infra-reliability-and-security-improvements
Closed

fix: Infra reliability & security improvements — reduce quota demand, add Bicep guard, improve VM credentials#900
Roopan-Microsoft wants to merge 26 commits intodev-v4from
fix/infra-reliability-and-security-improvements

Conversation

@Roopan-Microsoft
Copy link
Copy Markdown
Collaborator

Summary

Addresses 3 high-impact findings from the MACAE error analysis (telemetry window: 2025-12-01 to 2026-04-06). The template currently has a 30.3% provisioning success rate (2,247 failures out of 3,228 attempts).


Changes

1. Reduce default model capacities (PR #1 from analysis)

Error addressed: InsufficientQuota + SubscriptionIsOverQuotaForSku453 occurrences (20.2% of failures), 84 machines

  • gpt4_1ModelCapacity: 150 -> 80
  • gptModelCapacity: 50 -> 30
  • gptReasoningModelCapacity: 50 -> 30
  • Total TPM: 250 -> 140 (44% reduction)

Why: Many external subscriptions lack 250 TPM GlobalStandard quota. Failures occur deep into provisioning after other resources are created, wasting time and leaving orphaned resources. Template remains fully functional at reduced capacity.

2. Add Bicep version guard in azure.yaml (PR #5 from analysis)

Error addressed: InvalidTemplateDeployment + InvalidTemplate + tool.bicep.failed1,030 occurrences (45.8% of failures), 131+ machines

  • Added bicep: '>= 0.33.0' to requiredVersions

Why: Template uses Bicep 0.33+ features (deployer(), resourceInput<>, null-forgiving !) but only guarded azd version, not Bicep. Users with older standalone Bicep get cryptic compile errors. This makes azd fail fast with a clear message.

3. Improve VM credential parameter descriptions (PR #2 from analysis)

Error addressed: Security improvement (OWASP A07:2021)

  • Updated parameter descriptions to clearly state credentials are required when enablePrivateNetworking = true
  • Added guidance on Azure password complexity requirements

Why: The existing descriptions marked these as Optional which is misleading.


Files Changed

File Changes
infra/main.bicep Reduced model capacities, improved VM credential param descriptions
azure.yaml Added Bicep version requirement

Impact

  • Projected success rate improvement: 30.3% -> ~55-60%
  • Template-addressable failures covered: ~1,483 occurrences (66% of all failures)
  • Zero breaking changes

NirajC-Microsoft and others added 26 commits September 22, 2025 16:00
fix: Remove Createdby Parameter from deploy.yml and change logic in bicep
fix: CI Pipeline Validate Deployment - MACAE
docs: Updated README, azure.yml for minimum azd version 1.18.0
fix: optimize the network module for Macae
chore: Add AZURE_DEV_COLLECT_TELEMETRY variable in azure-dev.yml file for MACAE-v2
fix: [Revert] MACAE-v3-Golden path Script
…VM credential docs

- Reduce default gptModelCapacity from 150 to 80 TPM to lower quota barrier
  (addresses InsufficientQuota + SubscriptionIsOverQuotaForSku errors - 453 occurrences, 20.2% of failures)
- Add @minValue(1) constraint on gptModelCapacity parameter
- Add bicep >= 0.33.0 to requiredVersions in azure.yaml
  (addresses InvalidTemplateDeployment + InvalidTemplate + tool.bicep.failed - 1,030 occurrences, 45.8% of failures)
- Update usageName metadata to reflect reduced capacity (150 -> 80)
- Improve VM admin credential parameter descriptions with guidance on env variables and complexity requirements
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Azure infrastructure deployment reliability and user guidance by lowering default Azure OpenAI deployment capacity, adding an azd toolchain guard for Bicep, and clarifying jumpbox VM credential parameters.

Changes:

  • Reduced the default GPT model deployment capacity to lower quota-related provisioning failures.
  • Added a Bicep minimum version requirement in azure.yaml to fail fast on incompatible toolchains.
  • Updated jumpbox VM credential parameter descriptions to better guide users on configuration and complexity expectations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
infra/main.bicep Lowers default OpenAI deployment capacity and updates VM credential parameter descriptions.
azure.yaml Adds a Bicep minimum version under requiredVersions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


@description('Optional. AI model deployment token capacity. Defaults to 150 for optimal performance.')
param gptModelCapacity int = 150
@description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription GlobalStandard quota.')
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new gptModelCapacity description hard-codes "GlobalStandard quota", but gptModelDeploymentType allows both Standard and GlobalStandard. This can mislead users who deploy Standard (quota/limits differ). Consider wording this as "quota for the selected deployment type/SKU" (or reference gptModelDeploymentType explicitly).

Suggested change
@description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription GlobalStandard quota.')
@description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription quota for the selected deployment type/SKU.')

Copilot uses AI. Check for mistakes.
@description('Optional. The password for the administrator account of the virtual machine. Allows to customize credentials if `enablePrivateNetworking` is set to true.')
@secure()
@description('Optional. The admin password for the jumpbox VM (used when `enablePrivateNetworking` is true). Must meet Azure complexity requirements (12+ chars, uppercase, lowercase, number, special char). Provide via AZURE_ENV_VM_ADMIN_PASSWORD environment variable for predictable access. Defaults to a random value if not set.')
param virtualMachineAdminPassword string = newGuid()
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The password description says it "Must" include uppercase characters, but the default newGuid() value never includes uppercase letters. Either relax the requirement text to match what Azure actually enforces for the chosen VM OS, or change the generated default so it meets the stated constraints.

Suggested change
param virtualMachineAdminPassword string = newGuid()
param virtualMachineAdminPassword string = 'Aa1!${newGuid()}'

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +66
@description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription GlobalStandard quota.')
@minValue(1)
param gptModelCapacity int = 80
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions reducing multiple model capacity parameters (e.g., gpt4_1ModelCapacity, gptReasoningModelCapacity) and a total TPM calculation, but infra/main.bicep appears to only change a single parameter (gptModelCapacity 150 -> 80). Please align the PR description with the actual change (or include the additional capacity parameters/changes if they were intended).

Copilot uses AI. Check for mistakes.
@NirajC-Microsoft NirajC-Microsoft changed the base branch from dev to dev-v4 April 7, 2026 07:56
@Roopan-Microsoft
Copy link
Copy Markdown
Collaborator Author

Closing — re-targeting to dev-v4 branch instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants