diff --git a/docs/superpowers/specs/2026-06-08-ephemeral-stack-cleanup-eni-aware-design.md b/docs/superpowers/specs/2026-06-08-ephemeral-stack-cleanup-eni-aware-design.md new file mode 100644 index 00000000..0fc52591 --- /dev/null +++ b/docs/superpowers/specs/2026-06-08-ephemeral-stack-cleanup-eni-aware-design.md @@ -0,0 +1,132 @@ +# Ephemeral Stack Cleanup — AgentCore ENI-Aware Redesign + +**Date:** 2026-06-08 +**Branch / PR:** `feat/cleanup-ephemeral-stacks` / [PR #109](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/pull/109) +**Related issues:** [#72](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/72) (scheduled ephemeral cleanup — *not yet `approved`*), [#111](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/111) (document AgentCore ENI cleanup workflow), [#278](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/278) (shellcheck/shell-test tooling gap) +**Target file:** `scripts/cleanup-ephemeral-stacks.sh` + +## Problem + +The current `cleanup-ephemeral-stacks.sh` (PR #109) follows a **"pre-clean ENIs → fire `delete-stack`"** model. For stacks that contain a Bedrock AgentCore Runtime, this model is structurally broken. + +### Root cause (validated against a live stuck stack) + +A real ephemeral stack (`scoschre`, account `465528542731`, `us-east-1`) entered `DELETE_FAILED` with: + +``` +The following resource(s) failed to delete: + [AgentVpcRuntimeSG…, AgentVpcPrivateSubnet1…, AgentVpcPrivateSubnet2…] +subnet 'subnet-…' has dependencies and cannot be deleted +security group 'sg-…' has a dependent object +``` + +The dependency was **two `agentic_ai`-type ENIs** left in the stack's private subnets / runtime SG: + +| Fact | Evidence | +|------|----------| +| ENIs are Hyperplane-managed | Attachment IDs are `ela-attach-*`; `Attachment.InstanceOwnerId = amazon-aws` | +| They **cannot** be force-detached | `detach-network-interface --force` → `OperationNotPermitted: You are not allowed to manage 'ela-attach' attachments` | +| They **cannot** be force-deleted while attached | `delete-network-interface` → `InvalidParameterValue: ... currently in use` | +| `--dry-run` is **not** a reliable probe | Both ops returned `DryRunOperation: Request would have succeeded` — dry-run validates IAM only, **not** managed-attachment/resource state | +| They are reclaimed **asynchronously by AWS** | The stack's `Runtime` resource reached `DELETE_COMPLETE` at ~19:37; the ENIs persisted **>1 hour**, then AWS reclaimed them on its own (`InvalidNetworkInterfaceID.NotFound`), after which `delete-stack` succeeded with zero manual ENI action | + +**Conclusion:** the existing ENI force-detach/delete block (current lines ~146–197) is incapable of clearing these ENIs under any IAM principal. It only adds `sleep 15` delays and false confidence, then races AWS's async reclamation with an immediate `delete-stack` → `DELETE_FAILED`. + +This is an **architectural** problem (per systematic-debugging Phase 4.5), not a patchable bug: the fix is to stop trying to force ENI cleanup and instead **observe** reclamation read-only and let repeated passes retry. + +## Goals + +1. Reliably delete aged, unprotected ABCA ephemeral stacks **without** attempting impossible ENI manipulation. +2. Be **idempotent and cron-safe**: a stack stuck on ENI-reclamation lag is *expected*, and a later pass finishes it automatically. +3. Provide a precise operator signal when a stack is waiting on AWS reclamation (satisfies #111). +4. Support an interactive `--wait` mode. + +## Non-goals + +- Force-detaching or force-deleting Hyperplane (`ela-attach`) ENIs — proven impossible. +- Deleting the live, in-use AgentCore runtime's ENIs (they live in a *different* VPC; never in scope). +- Synchronously guaranteeing a single run fully tears down every stack (ENI lag can exceed 1 hour). + +## Design + +### Run modes (hybrid) + +- **Default (cron-safe, fire-and-forget):** issue `delete-stack` for every eligible stack, do not block. Print a summary. This is the primary unattended path. +- **`--wait` (interactive):** after issuing deletes, poll each stack to a terminal state and report `DELETE_COMPLETE` vs. `DELETE_FAILED` (with the blocking reason). +- **`--dry-run`:** unchanged — report intended actions, mutate nothing. + +### Per-stack flow (after the existing age/safety filters, which are unchanged) + +The age/safety filters stay exactly as they are: prefix match → `describe-stacks` succeeds → `Description == "ABCA Development Stack"` → not termination-protected → not `*IN_PROGRESS*` → parseable creation time → older than `MAX_AGE_HOURS`. + +After a stack passes those filters, branch on **stack status**: + +1. **Fresh aged stack** (`CREATE_COMPLETE`, `UPDATE_COMPLETE`, `ROLLBACK_COMPLETE`, `UPDATE_ROLLBACK_COMPLETE`): + - The Runtime resource still exists and is deleted *during* `delete-stack`. There are no orphan ENIs to check yet. + - → Issue `delete-stack` **unconditionally**. + +2. **`DELETE_FAILED` stack** (retry path — stack stuck only on ENI-reclamation lag): + - Run a **read-only** check for blocking `agentic_ai` ENIs in the stack's subnets and runtime SG (gather subnet/SG physical IDs via `list-stack-resources`, then `describe-network-interfaces --filters subnet-id=…` / `group-id=…`). + - **If blocking ENIs are present:** SKIP this pass. Log `": pending reclamation (N AgentCore ENIs not yet released by AWS)"`. Count as `Pending`. + - **If none remain:** re-issue `delete-stack` (it will now succeed). + +`DELETE_FAILED` is included in the `list-stacks --stack-status-filter`, so stuck stacks are naturally re-evaluated on every pass. The `*IN_PROGRESS*` skip is deliberately narrow: it catches `DELETE_IN_PROGRESS` (don't disturb a stack mid-teardown) but **not** `DELETE_FAILED` (the terminal stuck state we *do* retry). This distinction is load-bearing and is pinned by a regression test. + +### What is removed vs. added + +- **Removed:** the ENI force-detach / `sleep 15` / force-delete block. It is proven impossible for `ela-attach` ENIs and these stacks contain no other ENI type. +- **Added (read-only only):** a small diagnostic that *observes* blocking `agentic_ai` ENIs in a stack's subnets/SG. It is used **only** as the `DELETE_FAILED` retry gate and to produce the operator signal. It never mutates ENIs. + +### Exit semantics + +- **Exit 0** if all attempted `delete-stack` calls were issued without API error. A stack left `DELETE_FAILED` / `Pending` awaiting AWS reclamation is **expected**, not a failure — the next pass handles it. This keeps cron quiet. +- **Exit 1** only on real errors: credential/auth failure (`sts:GetCallerIdentity`), or unexpected CloudFormation/EC2 API errors. + +### Summary output + +``` +=== Summary === + Deleted: + Skipped: + Pending: + Failed: +``` + +## Operator guidance (docs — folds in #111) + +Add to `docs/guides/DEPLOYMENT_GUIDE.md` an "AgentCore ENI reclamation" subsection: + +- **Why** a stack with an AgentCore Runtime can sit in `DELETE_FAILED`: Hyperplane `agentic_ai` ENIs are released asynchronously by AWS after the Runtime backend tears down (observed lag: >1 hour). +- **These ENIs cannot be force-detached or force-deleted** — do not try; `ela-attach` attachments reject manual management. +- **Recovery:** wait for reclamation, then re-run the cleanup script (or `aws cloudformation delete-stack`). Check reclamation with: + ``` + aws ec2 describe-network-interfaces \ + --filters Name=subnet-id,Values= Name=interface-type,Values=agentic_ai \ + --query 'NetworkInterfaces[].NetworkInterfaceId' + ``` + An empty result means the stack will now delete cleanly. +- **Escape hatch** for an indefinitely stuck stack: `aws cloudformation delete-stack --stack-name --retain-resources ` to drop the stack shell, then clean the VPC once ENIs clear. + +Regenerate Starlight mirrors (`cd docs && node scripts/sync-starlight.mjs`) and commit them alongside. + +## Testing + +The repo currently has **no shell-test harness** (no `bats`, no `*.bats`), and shellcheck is not yet wired in (tracked by #278). To pin the load-bearing behavior without over-investing in net-new tooling: + +- **Minimum:** a small `bats`-style or plain-`bash` assertion test that the status-classification logic selects a `DELETE_FAILED` ABCA stack (with no blocking ENIs) for retry, and skips it (counts `Pending`) when blocking ENIs are present. Refactor the classification into a pure, testable function (`classify_stack` taking status + ENI-count → action) so it can be unit-tested without AWS calls. +- **Lint:** run `shellcheck` on the script (manually for this PR; #278 wires it into CI). +- **Manual integration evidence (already captured):** the `scoschre` stack was unstuck by exactly this gated-retry sequence — blocking-ENI query returned `0`, `delete-stack` then succeeded, VPC removed, live `mainRuntime` untouched. +- **Acceptance validation (planned):** deploy a fresh ephemeral stack, let its first `delete-stack` reach `DELETE_FAILED` on AgentCore ENIs, then confirm a subsequent cleanup-script pass reports `Pending` while ENIs linger and completes the deletion on the first pass after AWS reclaims them — end-to-end exercise of both status branches. + +## Risks & mitigations + +| Risk | Mitigation | +|------|------------| +| A future "simplification" collapses the `*IN_PROGRESS*` skip into `DELETE_*`, silently killing the retry path | Regression test asserts `DELETE_FAILED` → retry selected | +| `list-stack-resources` on a partially-deleted stack returns stale subnet/SG IDs | Gate query tolerates `NotFound` per resource; treat unresolvable subnet/SG as "no blocking ENI" and allow retry | +| Misclassifying a *live* runtime's ENIs as orphans | Gate queries **only** the target stack's own subnet/SG physical IDs; live runtime is in a separate VPC (verified) | +| Cron noise | Exit 0 on `Pending`; only real API/auth errors are non-zero | + +## Governance + +Implements #72 (not yet `approved`). PR #109 is already open against this branch, so work continues under the existing artifact; the missing `approved` label on #72 should be flagged to an admin but does not block refinement of the existing PR. diff --git a/scripts/cleanup-ephemeral-stacks.sh b/scripts/cleanup-ephemeral-stacks.sh new file mode 100755 index 00000000..5a26a4d7 --- /dev/null +++ b/scripts/cleanup-ephemeral-stacks.sh @@ -0,0 +1,225 @@ +#!/usr/bin/env bash +# cleanup-ephemeral-stacks.sh — Delete ephemeral CloudFormation stacks older than MAX_AGE_HOURS. +# +# Targets stacks deployed by this CDK app that do NOT have termination protection. +# Handles stuck ENI cleanup (AgentCore/Lambda Hyperplane ENIs) before deletion. +# +# Usage: +# AWS_PROFILE=abca ./scripts/cleanup-ephemeral-stacks.sh [--dry-run] [--max-age-hours N] [--prefix PREFIX] +# +# Options: +# --dry-run Show what would be deleted without acting +# --max-age-hours N Delete stacks older than N hours (default: 4) +# --prefix PREFIX Only target stacks matching this prefix (default: all ABCA stacks) +# +# Safety: +# - Never touches stacks with termination protection enabled +# - Only targets stacks with description matching "ABCA Development Stack" +# - Skips stacks in UPDATE_IN_PROGRESS or CREATE_IN_PROGRESS states + +set -euo pipefail + +MAX_AGE_HOURS=${MAX_AGE_HOURS:-48} +DRY_RUN=false +PREFIX="" +REGION="${AWS_DEFAULT_REGION:-us-east-1}" + +while [[ $# -gt 0 ]]; do + case $1 in + --dry-run) DRY_RUN=true; shift ;; + --max-age-hours) MAX_AGE_HOURS="$2"; shift 2 ;; + --prefix) PREFIX="$2"; shift 2 ;; + *) echo "Unknown option: $1" >&2; exit 1 ;; + esac +done + +# Validate numeric input — guards the age arithmetic against injection/garbage. +if ! [[ "$MAX_AGE_HOURS" =~ ^[0-9]+$ ]]; then + echo "Error: --max-age-hours must be a non-negative integer (got: '$MAX_AGE_HOURS')" >&2 + exit 1 +fi + +MAX_AGE_SECONDS=$((MAX_AGE_HOURS * 3600)) +NOW=$(date +%s) + +# Surface the blast radius before touching anything. Confirms the operator is +# pointed at the account/identity they think they are (defense in depth). +CALLER_IDENTITY=$(aws sts get-caller-identity \ + --region "$REGION" \ + --query '[Account,Arn]' --output text 2>/dev/null) || { + echo "Error: unable to resolve AWS identity (sts:GetCallerIdentity failed). Check credentials." >&2 + exit 1 +} +ACCOUNT_ID=$(echo "$CALLER_IDENTITY" | cut -f1) +CALLER_ARN=$(echo "$CALLER_IDENTITY" | cut -f2) + +echo "=== Ephemeral Stack Cleanup ===" +echo " Account: $ACCOUNT_ID" +echo " Identity: $CALLER_ARN" +echo " Region: $REGION" +echo " Max age: ${MAX_AGE_HOURS}h" +echo " Dry run: $DRY_RUN" +echo " Prefix filter: ${PREFIX:-}" +echo "" + +# List all stacks (excluding deleted ones) +STACKS=$(aws cloudformation list-stacks \ + --region "$REGION" \ + --stack-status-filter \ + CREATE_COMPLETE UPDATE_COMPLETE ROLLBACK_COMPLETE \ + UPDATE_ROLLBACK_COMPLETE DELETE_FAILED \ + --query 'StackSummaries[*].[StackName,CreationTime]' \ + --output text 2>/dev/null) + +if [[ -z "$STACKS" ]]; then + echo "No stacks found." + exit 0 +fi + +DELETED=0 +SKIPPED=0 +FAILED=0 + +while IFS=$'\t' read -r STACK_NAME CREATION_TIME; do + # Apply prefix filter + if [[ -n "$PREFIX" && "$STACK_NAME" != "$PREFIX"* ]]; then + continue + fi + + # Get stack details (description, termination protection, tags) + STACK_INFO=$(aws cloudformation describe-stacks \ + --region "$REGION" \ + --stack-name "$STACK_NAME" \ + --query 'Stacks[0].[Description,EnableTerminationProtection,StackStatus]' \ + --output text 2>/dev/null) || continue + + DESCRIPTION=$(echo "$STACK_INFO" | cut -f1) + TERMINATION_PROTECTED=$(echo "$STACK_INFO" | cut -f2) + STATUS=$(echo "$STACK_INFO" | cut -f3) + + # Only target stacks from this CDK app + if [[ "$DESCRIPTION" != "ABCA Development Stack" ]]; then + continue + fi + + # Never touch termination-protected stacks + if [[ "$TERMINATION_PROTECTED" == "True" ]]; then + echo " SKIP (protected): $STACK_NAME" + ((SKIPPED++)) || true + continue + fi + + # Skip stacks in active transitions + if [[ "$STATUS" == *"IN_PROGRESS"* ]]; then + echo " SKIP (in progress): $STACK_NAME ($STATUS)" + ((SKIPPED++)) || true + continue + fi + + # Check age. Parse the CreationTime to epoch seconds (GNU date, then BSD date). + # FAIL CLOSED: if both parsers fail we cannot trust the age, so SKIP rather than + # risk deleting a stack we can't prove is old enough. + CREATED_EPOCH=$(date -d "$CREATION_TIME" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S" "${CREATION_TIME%%.*}" +%s 2>/dev/null || echo "") + if ! [[ "$CREATED_EPOCH" =~ ^[0-9]+$ ]]; then + echo " SKIP (unparseable creation time '$CREATION_TIME'): $STACK_NAME" + ((SKIPPED++)) || true + continue + fi + AGE_SECONDS=$((NOW - CREATED_EPOCH)) + + if [[ $AGE_SECONDS -lt $MAX_AGE_SECONDS ]]; then + AGE_HOURS=$((AGE_SECONDS / 3600)) + echo " SKIP (too young: ${AGE_HOURS}h): $STACK_NAME" + ((SKIPPED++)) || true + continue + fi + + AGE_HOURS=$((AGE_SECONDS / 3600)) + echo " TARGET: $STACK_NAME (age: ${AGE_HOURS}h, status: $STATUS)" + + if [[ "$DRY_RUN" == "true" ]]; then + echo " [dry-run] Would delete $STACK_NAME" + ((DELETED++)) || true + continue + fi + + # --- ENI cleanup (handles stuck VPC deletion) --- + # Find security groups owned by this stack + SG_IDS=$(aws cloudformation list-stack-resources \ + --region "$REGION" \ + --stack-name "$STACK_NAME" \ + --query "StackResourceSummaries[?ResourceType=='AWS::EC2::SecurityGroup'].PhysicalResourceId" \ + --output text 2>/dev/null) || true + + if [[ -n "$SG_IDS" && "$SG_IDS" != "None" ]]; then + for SG_ID in $SG_IDS; do + # Find ENIs attached to this security group. + # shellcheck disable=SC2016 # backticks are JMESPath literal syntax for --query, must NOT expand + ENIS=$(aws ec2 describe-network-interfaces \ + --region "$REGION" \ + --filters "Name=group-id,Values=$SG_ID" \ + --query 'NetworkInterfaces[?Status==`in-use`].[NetworkInterfaceId,Attachment.AttachmentId]' \ + --output text 2>/dev/null) || true + + if [[ -n "$ENIS" && "$ENIS" != "None" ]]; then + echo " Cleaning up ENIs in security group $SG_ID..." + while IFS=$'\t' read -r ENI_ID ATTACHMENT_ID; do + if [[ -n "$ENI_ID" && "$ENI_ID" != "None" ]]; then + echo " Force-detaching $ENI_ID ($ATTACHMENT_ID)" + aws ec2 detach-network-interface \ + --region "$REGION" \ + --attachment-id "$ATTACHMENT_ID" \ + --force 2>/dev/null || true + fi + done <<< "$ENIS" + + # Wait briefly for detachment + echo " Waiting 15s for ENI detachment..." + sleep 15 + + # Delete the ENIs + AVAILABLE_ENIS=$(aws ec2 describe-network-interfaces \ + --region "$REGION" \ + --filters "Name=group-id,Values=$SG_ID" "Name=status,Values=available" \ + --query 'NetworkInterfaces[*].NetworkInterfaceId' \ + --output text 2>/dev/null) || true + + for ENI_ID in $AVAILABLE_ENIS; do + if [[ -n "$ENI_ID" && "$ENI_ID" != "None" ]]; then + echo " Deleting $ENI_ID" + aws ec2 delete-network-interface \ + --region "$REGION" \ + --network-interface-id "$ENI_ID" 2>/dev/null || true + fi + done + fi + done + fi + + # --- Delete the stack --- + # Only count a deletion we actually initiated. Tolerate a single failure + # (e.g. AccessDenied, transient throttling) without aborting the whole run — + # set -e would otherwise kill the loop mid-pass and orphan later stacks. + echo " Deleting stack $STACK_NAME..." + if aws cloudformation delete-stack \ + --region "$REGION" \ + --stack-name "$STACK_NAME" 2>/dev/null; then + ((DELETED++)) || true + else + echo " ERROR: delete-stack failed for $STACK_NAME (continuing)" >&2 + ((FAILED++)) || true + fi + +done <<< "$STACKS" + +echo "" +echo "=== Summary ===" +echo " Deleted: $DELETED" +echo " Skipped: $SKIPPED" +echo " Failed: $FAILED" + +if [[ "$DELETED" -gt 0 && "$DRY_RUN" == "false" ]]; then + echo "" + echo "Note: Stack deletion is asynchronous. Monitor with:" + echo " aws cloudformation list-stacks --stack-status-filter DELETE_IN_PROGRESS --region $REGION" +fi