fix: Infra reliability & security improvements — reduce quota demand, add Bicep guard, improve VM credentials#900
Conversation
fix: Remove Createdby Parameter from deploy.yml and change logic in bicep
fix: CI Pipeline Validate Deployment - MACAE
docs: Updated README, azure.yml for minimum azd version 1.18.0
fix: optimize the network module for Macae
chore: Add AZURE_DEV_COLLECT_TELEMETRY variable in azure-dev.yml file for MACAE-v2
fix: [Revert] MACAE-v3-Golden path Script
fix: troubleshooting.md portal link
…VM credential docs - Reduce default gptModelCapacity from 150 to 80 TPM to lower quota barrier (addresses InsufficientQuota + SubscriptionIsOverQuotaForSku errors - 453 occurrences, 20.2% of failures) - Add @minValue(1) constraint on gptModelCapacity parameter - Add bicep >= 0.33.0 to requiredVersions in azure.yaml (addresses InvalidTemplateDeployment + InvalidTemplate + tool.bicep.failed - 1,030 occurrences, 45.8% of failures) - Update usageName metadata to reflect reduced capacity (150 -> 80) - Improve VM admin credential parameter descriptions with guidance on env variables and complexity requirements
There was a problem hiding this comment.
Pull request overview
This PR improves Azure infrastructure deployment reliability and user guidance by lowering default Azure OpenAI deployment capacity, adding an azd toolchain guard for Bicep, and clarifying jumpbox VM credential parameters.
Changes:
- Reduced the default GPT model deployment capacity to lower quota-related provisioning failures.
- Added a Bicep minimum version requirement in
azure.yamlto fail fast on incompatible toolchains. - Updated jumpbox VM credential parameter descriptions to better guide users on configuration and complexity expectations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
infra/main.bicep |
Lowers default OpenAI deployment capacity and updates VM credential parameter descriptions. |
azure.yaml |
Adds a Bicep minimum version under requiredVersions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @description('Optional. AI model deployment token capacity. Defaults to 150 for optimal performance.') | ||
| param gptModelCapacity int = 150 | ||
| @description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription GlobalStandard quota.') |
There was a problem hiding this comment.
The new gptModelCapacity description hard-codes "GlobalStandard quota", but gptModelDeploymentType allows both Standard and GlobalStandard. This can mislead users who deploy Standard (quota/limits differ). Consider wording this as "quota for the selected deployment type/SKU" (or reference gptModelDeploymentType explicitly).
| @description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription GlobalStandard quota.') | |
| @description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription quota for the selected deployment type/SKU.') |
| @description('Optional. The password for the administrator account of the virtual machine. Allows to customize credentials if `enablePrivateNetworking` is set to true.') | ||
| @secure() | ||
| @description('Optional. The admin password for the jumpbox VM (used when `enablePrivateNetworking` is true). Must meet Azure complexity requirements (12+ chars, uppercase, lowercase, number, special char). Provide via AZURE_ENV_VM_ADMIN_PASSWORD environment variable for predictable access. Defaults to a random value if not set.') | ||
| param virtualMachineAdminPassword string = newGuid() |
There was a problem hiding this comment.
The password description says it "Must" include uppercase characters, but the default newGuid() value never includes uppercase letters. Either relax the requirement text to match what Azure actually enforces for the chosen VM OS, or change the generated default so it meets the stated constraints.
| param virtualMachineAdminPassword string = newGuid() | |
| param virtualMachineAdminPassword string = 'Aa1!${newGuid()}' |
| @description('Optional. AI model deployment token capacity (thousands of tokens per minute). Reduce if provisioning fails with InsufficientQuota. Total must not exceed your subscription GlobalStandard quota.') | ||
| @minValue(1) | ||
| param gptModelCapacity int = 80 |
There was a problem hiding this comment.
PR description mentions reducing multiple model capacity parameters (e.g., gpt4_1ModelCapacity, gptReasoningModelCapacity) and a total TPM calculation, but infra/main.bicep appears to only change a single parameter (gptModelCapacity 150 -> 80). Please align the PR description with the actual change (or include the additional capacity parameters/changes if they were intended).
|
Closing — re-targeting to dev-v4 branch instead. |
Summary
Addresses 3 high-impact findings from the MACAE error analysis (telemetry window: 2025-12-01 to 2026-04-06). The template currently has a 30.3% provisioning success rate (2,247 failures out of 3,228 attempts).
Changes
1. Reduce default model capacities (PR #1 from analysis)
Error addressed:
InsufficientQuota+SubscriptionIsOverQuotaForSku— 453 occurrences (20.2% of failures), 84 machinesgpt4_1ModelCapacity: 150 -> 80gptModelCapacity: 50 -> 30gptReasoningModelCapacity: 50 -> 30Why: Many external subscriptions lack 250 TPM GlobalStandard quota. Failures occur deep into provisioning after other resources are created, wasting time and leaving orphaned resources. Template remains fully functional at reduced capacity.
2. Add Bicep version guard in azure.yaml (PR #5 from analysis)
Error addressed:
InvalidTemplateDeployment+InvalidTemplate+tool.bicep.failed— 1,030 occurrences (45.8% of failures), 131+ machinesbicep: '>= 0.33.0'torequiredVersionsWhy: Template uses Bicep 0.33+ features (
deployer(),resourceInput<>, null-forgiving!) but only guarded azd version, not Bicep. Users with older standalone Bicep get cryptic compile errors. This makes azd fail fast with a clear message.3. Improve VM credential parameter descriptions (PR #2 from analysis)
Error addressed: Security improvement (OWASP A07:2021)
enablePrivateNetworking = trueWhy: The existing descriptions marked these as Optional which is misleading.
Files Changed
infra/main.bicepazure.yamlImpact