From 8804eb41dcb2f4e02b8fa86ce3e0364b39420162 Mon Sep 17 00:00:00 2001 From: Prykhodko Date: Thu, 25 Jun 2026 11:51:29 +0200 Subject: [PATCH] Add service-quota-check skill and service-quotas-monitor custom agent MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - New skill: service-quota-check — checks AWS service quota utilization, flags quotas at 85%+ utilization, requests increases via Service Quotas API, and recommends support cases for non-adjustable quotas - New custom agent: service-quotas-monitor — proactively scans all regions, identifies quotas approaching limits, and takes automated action - Updated CloudFormation template with ServiceQuotaCheck IAM policy - Updated CLAUDE.md, kiro conventions, llms.txt, and root README --- .claude/CLAUDE.md | 2 + .kiro/steering/project-conventions.md | 1 + README.md | 1 + .../devops-agent-skill-policies.yaml | 40 ++ .../service-quotas-monitor/CHANGELOG.md | 12 + .../service-quotas-monitor/README.md | 66 +++ .../service-quotas-monitor/SYSTEM_PROMPT.md | 66 +++ llms.txt | 1 + skills/service-quota-check/.skilleval.yaml | 3 + skills/service-quota-check/CHANGELOG.md | 12 + skills/service-quota-check/README.md | 91 +++++ skills/service-quota-check/SKILL.md | 383 ++++++++++++++++++ .../evals/eval_queries.json | 22 + skills/service-quota-check/evals/evals.json | 79 ++++ .../references/common-quota-codes.md | 126 ++++++ 15 files changed, 905 insertions(+) create mode 100644 custom-agents/service-quotas-monitor/CHANGELOG.md create mode 100644 custom-agents/service-quotas-monitor/README.md create mode 100644 custom-agents/service-quotas-monitor/SYSTEM_PROMPT.md create mode 100644 skills/service-quota-check/.skilleval.yaml create mode 100644 skills/service-quota-check/CHANGELOG.md create mode 100644 skills/service-quota-check/README.md create mode 100644 skills/service-quota-check/SKILL.md create mode 100644 skills/service-quota-check/evals/eval_queries.json create mode 100644 skills/service-quota-check/evals/evals.json create mode 100644 skills/service-quota-check/references/common-quota-codes.md diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index f4d56fc..0c875f8 100644 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -125,6 +125,8 @@ Only these extensions are permitted inside skill directories (enforced by `skill 5. Add evaluation tests (`.skilleval.yaml` and `evals/` directory). 6. Test the skill with DevOps Agent before submitting. 7. Update the root `README.md` skills table with the new skill's name, agent types, author, and docs link. +8. Update the `llms.txt` file at the repo root — add the new skill to the "Available Skills" section following the existing format: `- [Skill Name](skills//SKILL.md): One-line description`. +9. If the skill requires IAM permissions beyond the `AIDevOpsAgentAccessPolicy` managed policy, add a new parameter, condition, and inline policy resource to `cloudformation/devops-agent-skill-policies.yaml`, and update the `SkillPolicySummary` output. ## Zipping for Upload diff --git a/.kiro/steering/project-conventions.md b/.kiro/steering/project-conventions.md index 0a0074f..94fe1b5 100644 --- a/.kiro/steering/project-conventions.md +++ b/.kiro/steering/project-conventions.md @@ -125,6 +125,7 @@ Only these extensions are permitted inside skill directories (enforced by `skill 6. Test the skill with DevOps Agent before submitting. 7. Update the root `README.md` skills table with the new skill's name, description, agent types, author, and docs link. 8. Update the `llms.txt` file at the repo root — add the new skill to the "Available Skills" section following the existing format: `- [Skill Name](skills//SKILL.md): One-line description`. +9. If the skill requires IAM permissions beyond the `AIDevOpsAgentAccessPolicy` managed policy, add a new parameter, condition, and inline policy resource to `cloudformation/devops-agent-skill-policies.yaml`, and update the `SkillPolicySummary` output. ## Maintaining llms.txt diff --git a/README.md b/README.md index b5242ad..58970d8 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,7 @@ Skills enable DevOps Agent to: | [skip-scheduled-maintenance](skills/skip-scheduled-maintenance/) | **Sample skill** demonstrating how to skip low-priority incidents during a scheduled maintenance window for the Incident Triage agent type | Incident Triage | [dgorin6](https://github.com/dgorin6) | [README](skills/skip-scheduled-maintenance/README.md) | | [enrich-with-aws-security-agent](skills/enrich-with-aws-security-agent/) | Queries AWS Security Agent CloudWatch logs to retrieve code-level security findings (file, line number, vulnerability type) during incident investigations with potential security root causes | Chat tasks, Incident RCA | [yakiratz-aws](https://github.com/yakiratz-aws) | [README](skills/enrich-with-aws-security-agent/README.md) | | [investigation-cost-guardrail](skills/investigation-cost-guardrail/) | Estimates the AWS API cost of an incident investigation before any query runs, shows a per-step cost plan, and cancels if the estimate exceeds a configurable threshold | Incident RCA | [inesttia](https://github.com/inesttia) | [README](skills/investigation-cost-guardrail/README.md) | +| [service-quota-check](skills/service-quota-check/) | Checks AWS service quota utilization during investigations and before provisioning resources, flags quotas at 85%+ utilization, and requests increases via the Service Quotas API or recommends support cases | Chat tasks, Incident RCA | [yuriypr](https://github.com/yuriypr) | [README](skills/service-quota-check/README.md) | ## Getting Started diff --git a/cloudformation/devops-agent-skill-policies.yaml b/cloudformation/devops-agent-skill-policies.yaml index c00af27..6593576 100644 --- a/cloudformation/devops-agent-skill-policies.yaml +++ b/cloudformation/devops-agent-skill-policies.yaml @@ -24,6 +24,7 @@ Metadata: - EnableEnrichWithSecurityAgent - EnableCrmInvestigationGuidelines - EnableSkipScheduledMaintenance + - EnableServiceQuotaCheck - Label: default: Optional Resource Scoping Parameters: @@ -90,12 +91,19 @@ Parameters: AllowedValues: ['true', 'false'] Default: 'true' + EnableServiceQuotaCheck: + Type: String + Description: Service Quota Check skill (Service Quotas + CloudWatch). + AllowedValues: ['true', 'false'] + Default: 'true' + Conditions: CreateNewRole: !Equals [!Ref ExistingRoleName, ''] SkillAwsHealthEvents: !Equals [!Ref EnableAwsHealthEvents, 'true'] SkillSupportCases: !Equals [!Ref EnableSupportCases, 'true'] SkillRdsOperationReview: !Equals [!Ref EnableRdsOperationReview, 'true'] SkillInvestigationCostGuardrail: !Equals [!Ref EnableInvestigationCostGuardrail, 'true'] + SkillServiceQuotaCheck: !Equals [!Ref EnableServiceQuotaCheck, 'true'] HasRegionRestriction: !Not [!Equals [!Join ['', !Ref AllowedRegions], '']] Resources: @@ -213,6 +221,37 @@ Resources: - pricing:GetProducts Resource: '*' + # service-quota-check: adds servicequotas:* and cloudwatch:GetMetricData/GetMetricStatistics + PolicyServiceQuotaCheck: + Type: AWS::IAM::Policy + Condition: SkillServiceQuotaCheck + Properties: + PolicyName: DevOpsAgentSkill-ServiceQuotaCheck + Roles: + - !If [CreateNewRole, !Ref DevOpsAgentRole, !Ref ExistingRoleName] + PolicyDocument: + Version: '2012-10-17' + Statement: + - Sid: ServiceQuotasReadAndRequest + Effect: Allow + Action: + - servicequotas:ListServices + - servicequotas:ListServiceQuotas + - servicequotas:GetServiceQuota + - servicequotas:GetAWSDefaultServiceQuota + - servicequotas:ListRequestedServiceQuotaChangeHistory + - servicequotas:ListRequestedServiceQuotaChangeHistoryByQuota + - servicequotas:GetRequestedServiceQuotaChange + - servicequotas:RequestServiceQuotaIncrease + - servicequotas:CreateSupportCase + Resource: '*' + - Sid: CloudWatchUsageMetrics + Effect: Allow + Action: + - cloudwatch:GetMetricData + - cloudwatch:GetMetricStatistics + Resource: '*' + # Optional: restrict agent to specific regions PolicyRegionalRestriction: Type: AWS::IAM::Policy @@ -257,6 +296,7 @@ Outputs: - support-cases: ${EnableSupportCases} (support:DescribeCommunications) - rds-operation-review: ${EnableRdsOperationReview} (rds:DownloadDBLogFilePortion, logs:GetLogEvents) - investigation-cost-guardrail: ${EnableInvestigationCostGuardrail} (pricing:GetProducts) + - service-quota-check: ${EnableServiceQuotaCheck} (servicequotas:*, cloudwatch:GetMetricData/GetMetricStatistics) Skills covered by AIDevOpsAgentAccessPolicy (no extra policy needed): - eks-operation-review, enrich-with-aws-security-agent, crm-production-investigation-guidelines No IAM required: diff --git a/custom-agents/service-quotas-monitor/CHANGELOG.md b/custom-agents/service-quotas-monitor/CHANGELOG.md new file mode 100644 index 0000000..1881394 --- /dev/null +++ b/custom-agents/service-quotas-monitor/CHANGELOG.md @@ -0,0 +1,12 @@ +# Changelog + +## 1.0.0 + +- Initial version +- System prompt with Goal/Approach/Constraints/Output structure +- Multi-region quota discovery and utilization assessment +- Automatic quota increase requests for adjustable quotas at 85%+ utilization +- Support case creation fallback for non-adjustable quotas +- Recommendation creation for items requiring manual user follow-up +- Notification integration for flagged quotas +- Deduplication of recommendations for the same quota/region diff --git a/custom-agents/service-quotas-monitor/README.md b/custom-agents/service-quotas-monitor/README.md new file mode 100644 index 0000000..4f0e064 --- /dev/null +++ b/custom-agents/service-quotas-monitor/README.md @@ -0,0 +1,66 @@ +# Service Quotas Monitor — Custom Agent + +## Purpose + +This custom agent proactively monitors AWS service quotas across all active regions, identifies quotas approaching their limits (85%+ utilization), and takes automated action — requesting quota increases via the Service Quotas API or escalating through support cases when programmatic increases are not possible. + +## Key Capabilities + +- Discovers all enabled regions and checks quotas across the entire account footprint +- Evaluates utilization for every service quota with available usage data +- Automatically requests quota increases for adjustable quotas at 85%+ utilization +- Creates support cases or recommendations when automatic increases are not possible +- Sends notifications via integrated communication tools when quotas are flagged +- Deduplicates recommendations to avoid alert fatigue + +## Prerequisites + +- An AWS DevOps Agent space +- IAM permissions for Service Quotas: + - `servicequotas:ListServices` + - `servicequotas:ListServiceQuotas` + - `servicequotas:GetServiceQuota` + - `servicequotas:RequestServiceQuotaIncrease` + - `servicequotas:CreateSupportCase` +- IAM permissions for EC2 region discovery: `ec2:DescribeRegions` +- (Optional) AWS Support API access for creating support cases: `support:CreateCase` +- (Optional) The [service-quota-check skill](../../skills/service-quota-check/) uploaded to your Agent Space for enhanced domain knowledge + +## Creating the Agent + +1. In the DevOps Agent web app, go to the "Agents" menu (on the bottom left pane) +2. Click "Create agent" (on the right side), then on the new menu that popped up, click "Form" (the left-most option) +3. In the "Name" field, use "service-quotas-monitor" +4. Copy the content of the "SYSTEM_PROMPT.md" file from this directory, and paste it into the "System prompt" field in the custom agent creation form +5. (Optional) In the "Skills" drop-down list, select the "service-quota-check" skill if available, and click "Create agent" +6. Now we need to add the `use_aws` tool — in the new custom agent's window, click "Edit" +7. In the new popped up window, select "Chat". A new chat will start on the left side. Wait for DevOps Agent to finish thinking, and it'll ask you what would you like to change +8. Type "Add the use_aws tool to this custom agent". Once the chat is finished, verify in the custom agent's page that `use_aws` is shown under "Tools" for this custom agent + +## Executing the Agent + +This agent is designed to run on a recurring schedule (e.g., daily or weekly) to catch quotas approaching their limits before they cause disruptions. You can also run it on-demand. + +### Scheduled Execution (Recommended) + +Follow the [Executing custom agents guide](https://docs.aws.amazon.com/devopsagent/latest/userguide/custom-agents-executing-custom-agents.html) to set up a recurring schedule. A daily run is recommended for production accounts with active scaling. + +### On-Demand Execution + +Run from the custom agent page or via chat. You can provide custom prompts: + +- "Check quotas only in us-east-1 and eu-west-1" +- "Check only EC2 and VPC quotas" +- "Report quotas above 70% utilization instead of 85%" + +## Output + +The agent produces: +- **Task journal entry** — a text summary of all findings and actions taken +- **Recommendations** — for any quotas requiring manual user intervention +- **Notifications** — sent via integrated communication tools (e.g., Slack) if quotas are flagged + +## Related + +- [service-quota-check skill](../../skills/service-quota-check/) — the domain knowledge skill for quota checking methodology +- [AWS DevOps Agent custom agents documentation](https://docs.aws.amazon.com/devopsagent/latest/userguide/working-with-devops-agent-custom-agents-index.html) diff --git a/custom-agents/service-quotas-monitor/SYSTEM_PROMPT.md b/custom-agents/service-quotas-monitor/SYSTEM_PROMPT.md new file mode 100644 index 0000000..10574b4 --- /dev/null +++ b/custom-agents/service-quotas-monitor/SYSTEM_PROMPT.md @@ -0,0 +1,66 @@ +You are a Service Quotas monitoring agent that proactively identifies AWS service quotas approaching their limits and takes action to prevent service disruptions. + +## Goal + +Check all AWS service quotas across all active regions, identify any with utilization at 85% or above, and take appropriate action: request quota increases automatically when possible, or escalate when manual intervention is required. + +## Approach + +1. **Discover active regions** — Call `use_aws` with EC2 `describe_regions` to get all enabled regions for the account. + +2. **List all services with quotas** — For each region, call Service Quotas `list_services` to get all services that have quotas. + +3. **Check quota utilization** — For each service in each region: + - Call `list_service_quotas` to get all quotas for the service + - For each quota, compare the current utilization value against the quota value + - Flag any quota where utilization is **85% or higher** + +4. **Take action on flagged quotas** — For each quota at or above 85% utilization: + + a. **If the quota is adjustable** (`Adjustable: true`): + - Calculate the new requested value: current quota value × 1.5 (50% increase) + - Call `request_service_quota_increase` with the new value + - Record the outcome (success or failure) + + b. **If the quota is not adjustable OR the increase request fails**: + - Attempt to create an AWS Support case using `create_case` with: + - Service code: `service-quotas` + - Category: `general-guidance` + - Severity: `normal` + - Subject: "Service Quota Increase Request: [service] - [quota name] in [region]" + - Body: Include current quota value, current utilization, and requested increase + - If support case creation fails (insufficient permissions), create a **Recommendation** for the user to manually open a support case, including all relevant details + +5. **Send notification** — If any quotas were flagged (regardless of action taken): + - Check if a communication tool integration exists (Slack or similar) + - If available, send a summary notification including: + - Total quotas checked + - Number of quotas at/above 85% utilization + - For each flagged quota: service, quota name, region, utilization %, action taken, and outcome + - Any items requiring user attention (failed increases, manual support cases needed) + +## Constraints + +- Read-only discovery, write only for quota increase requests and support cases +- Do not request increases for quotas below 85% utilization +- Do not retry failed API calls more than once +- If a region is inaccessible, log the error and continue with other regions +- Respect API rate limits — add brief pauses between high-volume API calls if needed + +## Output + +Produce a text summary in the task journal containing: +- Timestamp and account ID +- Regions checked +- Total quotas evaluated +- List of quotas at/above 85% with utilization details and actions taken +- Any errors encountered +- Clear indication of items requiring user follow-up + +If any quota required action but could not be resolved automatically (non-adjustable quota, failed API call, insufficient permissions for support case), create a **Recommendation** with: +- Title: "Manual quota increase needed: [service] - [quota name]" +- Details: region, current value, current utilization, suggested new value, and reason automatic action failed + +Before creating a new Recommendation, check if one already exists for the same quota in the same region — update it instead of creating a duplicate. + +If a communication integration exists and any quotas were flagged, send a notification summarizing the run. Do not send a notification if all quotas are healthy. diff --git a/llms.txt b/llms.txt index 98f842e..da6b1be 100644 --- a/llms.txt +++ b/llms.txt @@ -20,6 +20,7 @@ Skills can be used with these AWS DevOps Agent types: - [CRM Production Investigation Guidelines Skill](skills/crm-production-investigation-guidelines/SKILL.md): Sample skill demonstrating how to write production investigation guidelines for the Incident Triage agent type, showing application-specific architecture, incident isolation rules, and structured investigation procedures - [Skip Scheduled Maintenance Skill](skills/skip-scheduled-maintenance/SKILL.md): Sample skill demonstrating how to skip low-priority incidents during a scheduled maintenance window, filtering MEDIUM and LOW severity alarms while preserving escalation for HIGH and CRITICAL incidents - [Enrich with AWS Security Agent Skill](skills/enrich-with-aws-security-agent/SKILL.md): Queries AWS Security Agent CloudWatch logs to retrieve code-level security findings (file, line number, vulnerability type) during incident investigations with potential security root causes +- [Service Quota Check Skill](skills/service-quota-check/SKILL.md): Checks AWS service quota utilization during investigations and before provisioning resources, flags quotas at 85%+ utilization, and requests increases via the Service Quotas API or recommends support cases ## Key Concepts diff --git a/skills/service-quota-check/.skilleval.yaml b/skills/service-quota-check/.skilleval.yaml new file mode 100644 index 0000000..686a9c7 --- /dev/null +++ b/skills/service-quota-check/.skilleval.yaml @@ -0,0 +1,3 @@ +audit: + ignore: + - STR-016 # README alongside SKILL.md is intentional diff --git a/skills/service-quota-check/CHANGELOG.md b/skills/service-quota-check/CHANGELOG.md new file mode 100644 index 0000000..8b2b0e7 --- /dev/null +++ b/skills/service-quota-check/CHANGELOG.md @@ -0,0 +1,12 @@ +# Changelog + +## 1.0.0 + +- Initial version +- Quota value retrieval via get-service-quota and list-service-quotas +- Utilization calculation using CloudWatch UsageMetric or resource counting +- Risk assessment with 85% threshold for triggering increase recommendations +- Automated quota increase request via request-service-quota-increase API +- Support case recommendation for non-adjustable quotas +- Duplicate request detection via pending request check +- Common quota codes reference for frequently checked services diff --git a/skills/service-quota-check/README.md b/skills/service-quota-check/README.md new file mode 100644 index 0000000..96a6c9f --- /dev/null +++ b/skills/service-quota-check/README.md @@ -0,0 +1,91 @@ +# Service Quota Check + +This skill enables the AWS DevOps Agent to check AWS service quota utilization during incident investigations, capacity planning, and before provisioning new resources. + +## Purpose + +When operational issues arise from hitting AWS service limits, or when recommendations involve provisioning additional resources, this skill checks current quota values and utilization. It proactively identifies quotas at risk (85%+ utilization), requests increases via the Service Quotas API when possible, and guides users to open support cases when programmatic increases are not available. + +## Key Capabilities + +- Retrieve current quota values and applied limits for any AWS service +- Calculate real-time utilization using CloudWatch metrics or resource counts +- Flag quotas at 85%+ utilization with risk-level assessment +- Submit quota increase requests via the Service Quotas API +- Check for existing pending increase requests to avoid duplicates +- Recommend support case creation for non-adjustable quotas +- Perform bulk quota assessments across all quotas for a service + +## Prerequisites + +- Agent permissions for Service Quotas APIs: `servicequotas:ListServices`, `servicequotas:ListServiceQuotas`, `servicequotas:GetServiceQuota`, `servicequotas:RequestServiceQuotaIncrease`, `servicequotas:ListRequestedServiceQuotaChangeHistory`, `servicequotas:CreateSupportCase` +- Agent permissions for CloudWatch: `cloudwatch:GetMetricData`, `cloudwatch:GetMetricStatistics` +- For resource counting fallback: read-only permissions on target services (e.g., `ec2:DescribeInstances`, `rds:DescribeDBInstances`) + +## Limitations + +- Service Quotas is a regional service; quotas must be checked in the correct region +- Not all quotas have a `UsageMetric` in CloudWatch; some require manual resource counting +- Some quotas are not adjustable via the API and require support cases +- Quota increase requests may take minutes (auto-approved) to days (manual review) +- Rate-based quotas (requests per second) require different monitoring than resource-count quotas + +## Agent Types + +This skill is used by the following agent types: + +- **Chat tasks** — conversational quota lookup, capacity planning, and proactive checks +- **Incident RCA** — quota exhaustion as root cause during active incidents + +## Uploading to AWS DevOps Agent + +To deploy this skill to your Agent Space, you can use any of three ways: + +**Option A: Import from GitHub (recommended)** + +If you have a [GitHub connection configured](https://docs.aws.amazon.com/devopsagent/latest/userguide/connecting-to-cicd-pipelines-connecting-github.html) in your Agent Space, you can import this skill directly from the repository. In the DevOps Agent web app, go to Settings -> Add Skill -> Import from repository, then point to the `skills/service-quota-check` directory. See [Importing a skill from a repository](https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-devops-agent-skills.html#creating-skills) for full instructions. + +> **Note:** You cannot connect the `aws-samples` GitHub organization directly because the GitHub connection setup requires admin rights on the organization. Instead, connect your personal GitHub account and select any repository from it during the connection setup. Once a GitHub connection is established, you can import skills from any public repository — including this one — even if it wasn't selected during the connection setup. + +**Option B: Upload as a zip file** + +1. Zip the `service-quota-check/` directory (only including allowed extensions): + + ```bash + cd skills + zip -r service-quota-check.zip service-quota-check/ -i '*.md' '*.txt' '*.json' '*.yaml' '*.yml' '*.xml' '*.csv' '*.tsv' '*.html' '*.htm' '*.png' '*.jpg' '*.jpeg' '*.gif' '*.svg' '*.webp' '*.pdf' -x '*/.claude/*' '*/scripts/*' '*/README.md' '*/.skilleval.yaml' '*/.skilleval.yml' '*/CHANGELOG.md' '*/evals/*' + ``` + +2. In the AWS DevOps Agent web app, navigate to the **Skills** page. +3. Click **Add skill** -> **Upload skill**. +4. Drag and drop the `service-quota-check.zip` file (max 6 MB). +5. Select the agent types: **Chat tasks** and **Incident RCA**. +6. Click **Upload**. + +**Option C: Upload via the Asset API** + +Use the AWS DevOps Agent Asset API to programmatically manage skills — useful for CI/CD pipelines or automation workflows. Assign the skill to the `CHAT` and `INCIDENT_RCA` agent types. See [Managing a skill end-to-end](https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-managing-assets.html#managing-a-skill-end-to-end) for the full API workflow. + +For more details, see [Uploading a skill](https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent-devops-agent-skills.html#creating-skills) in the AWS DevOps Agent User Guide. + +## How to Use This Skill + +This skill is most suitable for chat and investigation. Below are sample prompts for each use-case. + +### Chat + +- "Check the EC2 vCPU quota utilization in us-east-1." +- "What's my current VPC quota and how many VPCs am I using?" +- "Show me all quotas for Lambda that are above 70% utilization." +- "Can I launch 50 more t3.large instances without hitting the quota?" +- "List all service quotas that are near their limits across EC2, VPC, and RDS." +- "Request an increase for my NAT Gateway quota in eu-west-1." + +### Investigation + +- "I'm getting LimitExceededException when creating a new VPC. Check quotas." +- "Lambda function invocations are being throttled. Check if we're hitting concurrent execution limits." +- "EC2 instance launch failed with InsufficientInstanceCapacity. Is this a quota issue?" +- "We need to scale our ECS cluster but tasks are failing to start. Check Fargate quotas." +- "The recommendation is to add more read replicas for RDS. Check if the quota allows it." +- "CloudFormation stack creation failed — investigate if we hit the stack count limit." diff --git a/skills/service-quota-check/SKILL.md b/skills/service-quota-check/SKILL.md new file mode 100644 index 0000000..5f6f48c --- /dev/null +++ b/skills/service-quota-check/SKILL.md @@ -0,0 +1,383 @@ +--- +name: service-quota-check +description: Use this skill during any incident investigation, capacity planning, or + operational troubleshooting when the issue may be caused by hitting AWS service limits. + Activate when you observe throttling errors (ThrottlingException, TooManyRequestsException, + LimitExceededException), resource creation failures, capacity-related alarms, or when + a recommendation involves provisioning new AWS resources. This skill checks current + quota values and utilization via the Service Quotas API, flags quotas at 85% or higher + utilization, and can request quota increases via the API or recommend opening a support + case when programmatic increases are not available. +metadata: + author: yuriypr + version: "1.0.0" + aws-devops-agent-skills.agent-types: "Chat tasks, Incident RCA" + aws-devops-agent-skills.aws-services: "AWS Service Quotas, Amazon CloudWatch" + aws-devops-agent-skills.technical-domains: "Operations, Capacity Planning" +--- + +# Service Quota Check + +Use this skill to check AWS service quota utilization during investigations and before +provisioning new resources. It determines whether a quota is nearing its limit (85%+ +utilization) and takes action: requesting a quota increase via the Service Quotas API +when possible, or recommending a support case when programmatic increases are not supported. + +## When to Use This Skill + +- An error message indicates throttling or limit exceeded (e.g., `ThrottlingException`, + `TooManyRequestsException`, `LimitExceededException`, `ResourceLimitExceeded`). +- A resource creation or scaling operation fails with capacity errors. +- An investigation recommendation involves provisioning additional AWS resources + (e.g., adding EC2 instances, creating VPCs, adding NAT Gateways, launching RDS instances). +- Capacity planning or pre-launch readiness checks. +- Proactive monitoring of quota utilization across services. + +## Prerequisites + +- The agent must have permissions to call Service Quotas APIs: + - `servicequotas:ListServices` + - `servicequotas:ListServiceQuotas` + - `servicequotas:GetServiceQuota` + - `servicequotas:RequestServiceQuotaIncrease` + - `servicequotas:ListRequestedServiceQuotaChangeHistory` + - `servicequotas:CreateSupportCase` +- For utilization data via CloudWatch, the agent needs: + - `cloudwatch:GetMetricData` + - `cloudwatch:GetMetricStatistics` +- Service Quotas is a regional service. Quotas must be checked in the region + where resources are deployed. + +--- + +## Step 1: Identify the Service and Quota Context + +Determine which service and quota to check based on the investigation context: + +1. **From error messages** — extract the service name and specific limit mentioned. +2. **From recommendations** — if the recommendation is to provision resources, identify + the service (e.g., EC2, VPC, RDS, Lambda, ELB) and the resource type. +3. **From alarms** — if a CloudWatch alarm indicates capacity pressure, identify the + underlying service. + +### Common service codes + +| AWS Service | Service Code | +|-------------|-------------| +| Amazon EC2 | `ec2` | +| Amazon VPC | `vpc` | +| Elastic Load Balancing | `elasticloadbalancing` | +| Amazon RDS | `rds` | +| AWS Lambda | `lambda` | +| Amazon ECS | `ecs` | +| Amazon EKS | `eks` | +| Amazon S3 | `s3` | +| Amazon DynamoDB | `dynamodb` | +| AWS Fargate | `fargate` | +| Amazon CloudWatch | `monitoring` | +| AWS CloudFormation | `cloudformation` | +| Amazon SQS | `sqs` | +| Amazon SNS | `sns` | +| Amazon ElastiCache | `elasticache` | +| Amazon OpenSearch Service | `es` | +| Auto Scaling | `autoscaling` | + +If you do not know the service code, use: + +```bash +aws service-quotas list-services \ + --query "Services[?contains(ServiceName, '')]" \ + --region +``` + +--- + +## Step 2: Retrieve Quota Value and Utilization + +### Get the applied quota value + +```bash +aws service-quotas get-service-quota \ + --service-code \ + --quota-code \ + --region +``` + +The response includes: +- `Value` — the current quota limit (applied value, or default if no increase was granted) +- `Adjustable` — whether the quota can be increased +- `UsageMetric` — CloudWatch metric to check current utilization (if available) + +### If you do not know the quota code + +List all quotas for the service: + +```bash +aws service-quotas list-service-quotas \ + --service-code \ + --region +``` + +Filter by quota name keyword: + +```bash +aws service-quotas list-service-quotas \ + --service-code \ + --region \ + --query "Quotas[?contains(QuotaName, '')]" +``` + +### Get current utilization via CloudWatch + +If the `UsageMetric` field is present in the quota response, query CloudWatch for actual usage: + +```bash +aws cloudwatch get-metric-statistics \ + --namespace "" \ + --metric-name "" \ + --dimensions \ + --start-time "<15-minutes-ago-ISO8601>" \ + --end-time "" \ + --period 300 \ + --statistics \ + --region +``` + +The `UsageMetric` object from the quota response provides all the parameters: +- `MetricNamespace` — typically `AWS/Usage` +- `MetricName` — typically `ResourceCount` +- `MetricDimensions` — service-specific dimensions (e.g., `Class`, `Resource`, `Service`, `Type`) +- `MetricStatisticRecommendation` — either `Maximum` or `Sum` + +### Alternative: count resources directly + +If no `UsageMetric` is available, count resources using the service's Describe/List APIs: + +| Service | Command to count resources | +|---------|---------------------------| +| EC2 instances | `aws ec2 describe-instances --query "Reservations[].Instances[] \| length(@)"` | +| VPCs | `aws ec2 describe-vpcs --query "Vpcs \| length(@)"` | +| NAT Gateways | `aws ec2 describe-nat-gateways --filter Name=state,Values=available --query "NatGateways \| length(@)"` | +| EIPs | `aws ec2 describe-addresses --query "Addresses \| length(@)"` | +| RDS instances | `aws rds describe-db-instances --query "DBInstances \| length(@)"` | +| Lambda functions | `aws lambda list-functions --query "Functions \| length(@)"` | +| ECS services | `aws ecs list-services --cluster --query "serviceArns \| length(@)"` | +| ALBs | `aws elbv2 describe-load-balancers --query "LoadBalancers \| length(@)"` | + +--- + +## Step 3: Calculate Utilization and Assess Risk + +### Calculate utilization percentage + +``` +utilization_pct = (current_usage / quota_value) × 100 +``` + +### Risk assessment thresholds + +| Utilization | Risk Level | Action | +|------------|------------|--------| +| < 70% | Low | No action needed. Report current state. | +| 70% – 84% | Medium | Flag as approaching limit. Monitor closely. | +| 85% – 94% | High | Recommend quota increase. Proceed to Step 4. | +| 95% – 100% | Critical | Urgent quota increase required. Proceed to Step 4. | +| = 100% | Exhausted | Quota is blocking operations. Immediate action required. | + +### Present findings + +Always show the user a summary table: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Service Quota Check │ +├─────────────────────────┬───────────┬─────────┬─────────┬───────────┤ +│ Quota Name │ Limit │ Used │ % Used │ Status │ +├─────────────────────────┼───────────┼─────────┼─────────┼───────────┤ +│ Running On-Demand (std) │ 1920 vCPU │ 1740 │ 90.6% │ ⚠️ HIGH │ +│ VPCs per Region │ 5 │ 5 │ 100% │ 🚫 FULL │ +│ NAT Gateways per AZ │ 5 │ 3 │ 60% │ ✅ OK │ +└─────────────────────────┴───────────┴─────────┴─────────┴───────────┘ +``` + +--- + +## Step 4: Request Quota Increase + +When utilization is at 85% or higher, proceed based on whether the quota is adjustable. + +### Decision Tree + +``` +Is utilization >= 85%? +├── NO → Report findings, no action needed +└── YES → Check "Adjustable" field + ├── Adjustable = true → Proceed to quota increase request + │ ├── Ask user to confirm the increase + │ ├── User confirms → Submit request via API (Step 4a) + │ └── User declines → Report findings only + └── Adjustable = false → Recommend support case (Step 4b) +``` + +### Step 4a: Submit Quota Increase Request via API + +Before requesting, determine the desired new value. Use this formula: + +``` +desired_value = current_quota × 1.5 (50% increase over current limit) +``` + +If the quota is already exhausted (100% utilization), recommend: + +``` +desired_value = current_quota × 2.0 (double the current limit) +``` + +Present the recommendation to the user: + +> "The quota **[QuotaName]** is at **[X]%** utilization ([current_usage]/[quota_value]). +> I recommend increasing it to **[desired_value]**. Shall I submit the quota increase +> request?" + +If the user confirms, submit the request: + +```bash +aws service-quotas request-service-quota-increase \ + --service-code \ + --quota-code \ + --desired-value \ + --region +``` + +After submitting, check the response: +- `Status: PENDING` — request submitted successfully. Inform the user that AWS will + review the request (typically processed within minutes for auto-approved quotas, + or up to a few days for manual review). +- `Status: CASE_OPENED` — a support case was automatically created. + +Report the request ID and status: + +> "Quota increase request submitted successfully. +> - Request ID: [Id] +> - Status: [Status] +> - Desired value: [DesiredValue] +> +> You can check the status with: +> `aws service-quotas get-requested-service-quota-change --request-id `" + +### Step 4b: Recommend Support Case (non-adjustable quotas) + +If the quota cannot be increased via the API (`Adjustable: false`), inform the user: + +> "The quota **[QuotaName]** cannot be increased programmatically via the Service +> Quotas API. To request an increase, you need to open an AWS Support case. +> +> Would you like me to: +> 1. Open a support case via the Service Quotas API (requires an existing pending +> quota increase request) +> 2. Provide instructions to open a support case manually via the AWS Console" + +If there is an existing pending request, use: + +```bash +aws service-quotas create-support-case \ + --request-id \ + --region +``` + +Otherwise, provide manual instructions: + +> "To request this quota increase: +> 1. Go to the AWS Support Center: https://console.aws.amazon.com/support/ +> 2. Create a new case → Service limit increase +> 3. Select service: [ServiceName] +> 4. Select quota: [QuotaName] +> 5. Specify the new desired value: [desired_value] +> 6. Provide business justification for the increase" + +--- + +## Step 5: Check for Pending Requests + +Before submitting a new request, check if there is already a pending increase request: + +```bash +aws service-quotas list-requested-service-quota-change-history-by-quota \ + --service-code \ + --quota-code \ + --region \ + --query "RequestedQuotas[?Status=='PENDING']" +``` + +If a pending request exists: +- Report the existing request details (ID, desired value, date submitted) +- Do NOT submit a duplicate request +- Offer to create a support case to expedite the existing request if needed + +--- + +## Step 6: Post-Increase Verification + +After a quota increase is approved, verify the new limit: + +```bash +aws service-quotas get-service-quota \ + --service-code \ + --quota-code \ + --region +``` + +Confirm the `Value` field reflects the new limit. + +--- + +## Multi-Quota Check (Bulk Assessment) + +When a recommendation involves provisioning multiple resource types, or for proactive +capacity planning, check all relevant quotas for the service: + +```bash +aws service-quotas list-service-quotas \ + --service-code \ + --region +``` + +For each quota that has a `UsageMetric`, calculate utilization. Report any quotas +at 70%+ utilization as part of a comprehensive capacity report. + +--- + +## Common Quota Codes Reference + +See [references/common-quota-codes.md](references/common-quota-codes.md) for a table of +frequently checked quota codes by service. + +--- + +## Error Handling + +| Error | Cause | Resolution | +|-------|-------|-----------| +| `NoSuchResourceException` | Quota code does not exist for the service | Use `list-service-quotas` to find the correct code | +| `TooManyRequestsException` | Service Quotas API is throttling | Wait and retry with exponential backoff | +| `ResourceAlreadyExistsException` | A pending request already exists | Check existing requests (Step 5) | +| `QuotaExceededException` | The desired value exceeds the maximum allowed | Reduce the desired value or open a support case | +| `AccessDeniedException` | Missing IAM permissions | Check that the agent has `servicequotas:*` permissions | +| `DependencyAccessDeniedException` | Missing permissions on the target service | Verify IAM policies for the target service | + +--- + +## Tips + +- **Check the region**: Service quotas are regional. Always query in the region where + resources are deployed, unless the quota is global (check `GlobalQuota: true`). +- **Applied vs. default**: `get-service-quota` returns the applied value (which may differ + from the default if a previous increase was granted). Use `get-aws-default-service-quota` + to see the original default. +- **Auto-approved vs. manual**: Many quotas (especially EC2 vCPU limits) are auto-approved + within minutes. Others require manual AWS review. The API does not indicate which type + a quota is in advance — submit the request and monitor the status. +- **Resource-level quotas**: Some quotas (e.g., OpenSearch instances per domain) are + resource-level. Use `--context-id` with the resource ARN for these. +- **Rate-based quotas**: Some quotas measure requests per second (e.g., API call rates). + These require different monitoring approaches (CloudWatch metrics rather than resource counts). diff --git a/skills/service-quota-check/evals/eval_queries.json b/skills/service-quota-check/evals/eval_queries.json new file mode 100644 index 0000000..6d038a2 --- /dev/null +++ b/skills/service-quota-check/evals/eval_queries.json @@ -0,0 +1,22 @@ +[ + { + "query": "How do I configure a CloudFront distribution with a custom SSL certificate?", + "should_trigger": false + }, + { + "query": "What is the best way to set up a multi-AZ RDS deployment?", + "should_trigger": false + }, + { + "query": "Explain the difference between S3 Standard and S3 Glacier storage classes.", + "should_trigger": false + }, + { + "query": "How do I write a Lambda function in Python that processes SQS messages?", + "should_trigger": false + }, + { + "query": "What IAM policies do I need for cross-account access to DynamoDB?", + "should_trigger": false + } +] diff --git a/skills/service-quota-check/evals/evals.json b/skills/service-quota-check/evals/evals.json new file mode 100644 index 0000000..3a770ab --- /dev/null +++ b/skills/service-quota-check/evals/evals.json @@ -0,0 +1,79 @@ +[ + { + "id": "check-ec2-vcpu-quota-utilization", + "prompt": "Check the EC2 On-Demand Standard vCPU quota utilization in us-east-1. We're planning to launch 100 more t3.large instances.", + "expected_output": "The agent retrieves the EC2 On-Demand Standard vCPU quota (L-1216C47A), checks current utilization via CloudWatch or resource count, calculates the percentage used, and assesses whether 100 additional t3.large instances (200 vCPUs) can be launched within the current limit.", + "assertions": [ + "The output references the EC2 service code or quota code L-1216C47A", + "The output shows the current quota value (limit)", + "The output calculates or reports current utilization as a percentage", + "The output assesses whether the additional 200 vCPUs would exceed the quota" + ] + }, + { + "id": "vpc-limit-exceeded-investigation", + "prompt": "I'm getting a VpcLimitExceeded error when trying to create a new VPC in us-west-2. Investigate the quota.", + "expected_output": "The agent checks the VPCs per Region quota (L-F678F1CE), finds it at or near 100% utilization, and recommends a quota increase since the limit is reached.", + "assertions": [ + "The output identifies the VPC quota or quota code L-F678F1CE", + "The output shows that the VPC quota is at or near its limit", + "The output recommends or offers to request a quota increase", + "The output does not produce an unrelated error" + ] + }, + { + "id": "lambda-throttling-quota-check", + "prompt": "Lambda functions are being throttled. Check if we're hitting the concurrent execution quota limit.", + "expected_output": "The agent checks the Lambda concurrent executions quota (L-B99A9384), retrieves current utilization, and reports whether throttling is caused by the quota limit.", + "assertions": [ + "The output identifies the Lambda concurrent executions quota", + "The output shows the current quota limit value", + "The output reports utilization or states whether the limit is being hit", + "The output provides recommendations if utilization is high" + ] + }, + { + "id": "request-quota-increase-flow", + "prompt": "My NAT Gateway quota in eu-west-1 is at 4 out of 5. I need to create 2 more. Please request an increase.", + "expected_output": "The agent confirms the NAT Gateway quota is at 80% (4/5), recognizes that adding 2 more would exceed the limit, and submits a quota increase request via the Service Quotas API after confirming with the user.", + "assertions": [ + "The output shows current utilization (4 out of 5 or 80%)", + "The output recognizes the need for a quota increase", + "The output uses request-service-quota-increase or equivalent API", + "The output reports the request status (PENDING or similar)" + ] + }, + { + "id": "non-adjustable-quota-handling", + "prompt": "I need to increase the S3 bucket quota but it says it's not adjustable. What should I do?", + "expected_output": "The agent checks the S3 buckets quota, finds that Adjustable is false or the increase cannot be done via API, and recommends opening a support case with specific instructions.", + "assertions": [ + "The output identifies that the quota is not adjustable via the API", + "The output recommends opening a support case", + "The output provides instructions on how to open the support case", + "The output does not attempt to call request-service-quota-increase for a non-adjustable quota" + ] + }, + { + "id": "bulk-quota-assessment", + "prompt": "We're planning a major scaling event. Check all relevant quotas for EC2, VPC, and ELB in us-east-1 and flag any that are above 70% utilization.", + "expected_output": "The agent checks quotas across EC2, VPC, and ELB services, calculates utilization for each, and presents a summary table flagging any quotas above 70%.", + "assertions": [ + "The output checks quotas for multiple services (EC2, VPC, ELB)", + "The output presents utilization data for multiple quotas", + "The output flags or highlights any quotas above 70% utilization", + "The output provides a structured summary or table" + ] + }, + { + "id": "pending-request-detection", + "prompt": "Request an increase for the RDS DB instances quota in us-east-1.", + "expected_output": "The agent first checks for existing pending requests for the RDS DB instances quota before submitting a new one. If a pending request exists, it reports it rather than creating a duplicate.", + "assertions": [ + "The output checks for existing pending requests before submitting", + "The output either reports an existing pending request or submits a new one", + "The output does not create a duplicate request if one is already pending", + "The output shows the request ID and status" + ] + } +] diff --git a/skills/service-quota-check/references/common-quota-codes.md b/skills/service-quota-check/references/common-quota-codes.md new file mode 100644 index 0000000..77c88b6 --- /dev/null +++ b/skills/service-quota-check/references/common-quota-codes.md @@ -0,0 +1,126 @@ +# Common Quota Codes Reference + +This reference lists frequently checked quota codes for common AWS services. Use these +codes with `get-service-quota` and `request-service-quota-increase`. + +## Amazon EC2 + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances | L-1216C47A | 5 vCPU | vCPU | +| Running On-Demand G and VT instances | L-DB2E81BA | 0 vCPU | vCPU | +| Running On-Demand P instances | L-417A185B | 0 vCPU | vCPU | +| Running On-Demand Inf instances | L-B5D1601B | 0 vCPU | vCPU | +| Running Dedicated Standard (A, C, D, H, I, M, R, T, Z) Hosts | L-20F13EBD | 0 | None | +| EC2-VPC Elastic IPs | L-0263D0A3 | 5 | None | +| Public AMIs | L-0E3CBAB9 | 25 | None | + +## Amazon VPC + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| VPCs per Region | L-F678F1CE | 5 | None | +| Subnets per VPC | L-407747CB | 200 | None | +| Internet gateways per Region | L-A4707A72 | 5 | None | +| NAT gateways per Availability Zone | L-FE5A380F | 5 | None | +| Network interfaces per Region | L-DF5E4CA3 | 5000 | None | +| Security groups per Region | L-E79EC296 | 2500 | None | +| Routes per route table | L-93826ACB | 50 | None | +| Route tables per VPC | L-589F43AA | 200 | None | + +## Elastic Load Balancing + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Application Load Balancers per Region | L-53DA6B97 | 50 | None | +| Network Load Balancers per Region | L-69A177A2 | 50 | None | +| Target groups per Region | L-B6DF7632 | 3000 | None | +| Targets per Application Load Balancer | L-7E6692B2 | 1000 | None | +| Listeners per Application Load Balancer | L-B6DF7632 | 50 | None | + +## Amazon RDS + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| DB instances | L-7B6409FD | 40 | None | +| DB clusters | L-952B80B8 | 40 | None | +| Read replicas per primary | L-5480080B | 5 | None | +| Manual DB instance snapshots | L-272F1212 | 100 | None | +| Total storage for all DB instances (GiB) | L-7ADDB58A | 100000 | GiB | + +## AWS Lambda + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Concurrent executions | L-B99A9384 | 1000 | None | +| Function and layer storage | L-2ACBD22F | 75 GB | Gigabytes | +| Elastic network interfaces per VPC | L-9FEE3D26 | 250 | None | + +## Amazon ECS + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Clusters per account | L-21C621EB | 10000 | None | +| Services per cluster | L-9A2EAEDE | 5000 | None | +| Tasks per service | L-EE04B13E | 5000 | None | +| Container instances per cluster | L-21C621EB | 5000 | None | + +## AWS Fargate + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Fargate On-Demand resource count | L-790F3D0E | 500 | None | +| Fargate Spot resource count | L-36FBB829 | 500 | None | + +## Amazon DynamoDB + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Maximum number of tables | L-F98FE922 | 2500 | None | +| Account-level read throughput limit (on-demand, per region) | L-B5A90E5F | 40000 | RCU | +| Account-level write throughput limit (on-demand, per region) | L-4CF20C20 | 40000 | WCU | + +## AWS CloudFormation + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Stack count | L-0485CB21 | 2000 | None | +| Stack sets per administrator account | L-EC62D81A | 100 | None | +| Resources per stack | L-844E580A | 500 | None | + +## Amazon S3 + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Buckets | L-DC2B2D3D | 100 | None | + +## Amazon SNS + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Topics per account | L-61103206 | 100000 | None | +| Subscriptions per topic | L-A4340BCD | 12500000 | None | + +## Amazon SQS + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Queues per account | L-2BFD9882 | 1000000 | None | + +## Auto Scaling + +| Quota Name | Quota Code | Default | Unit | +|-----------|-----------|---------|------| +| Auto Scaling groups per region | L-CDE20ADC | 200 | None | +| Launch configurations per region | L-6B80B8FA | 200 | None | + +--- + +## Notes + +- Default values shown are the AWS defaults. Your account may have different applied + values if previous increases were granted. +- Quota codes are stable identifiers that do not change, but new quotas may be added + over time. +- Use `list-service-quotas` to get the most current list for any service. +- Some quotas listed here may be resource-level (not account-level) in newer API versions.