Skip to content

fix(cloud-security): run AWS integration checks on our server (scheduled false-fails)#3133

Merged
tofikwest merged 5 commits into
mainfrom
tofik/aws-checks-on-server
Jun 12, 2026
Merged

fix(cloud-security): run AWS integration checks on our server (scheduled false-fails)#3133
tofikwest merged 5 commits into
mainfrom
tofik/aws-checks-on-server

Conversation

@tofikwest

@tofikwest tofikwest commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

AWS integration checks (S3 Block Public Access, encryption, EC2, RDS, KMS, CloudTrail, IAM) false-fail on the scheduled / auto-run paths but pass on the manual "Run" — repeatedly (CS reports over the last ~2 weeks). Customer-visible error:

AccessDenied: ... is not authorized to perform: s3:GetBucketPublicAccessBlock ...
because no VPC endpoint policy allows the s3:GetBucketPublicAccessBlock action

Root cause

It is not a code bug — it's the same check code with the same credentials. The only difference is where it executes, and therefore which network the AWS API call exits through:

Executes in S3 call exits via Result
Manual "Run" our server (ECS), in comp-production-vpc our S3 endpoint (allows s3:Get*)
Scheduled / auto-run Trigger.dev's runtime (their VPC) Trigger.dev's S3 endpoint (blocks cross-account)

AWS evaluates allow/deny based on the VPC the request exits from — which, for the scheduled path, lives in Trigger.dev's AWS account and is not ours to change. (Confirmed Trigger.dev is Cloud, not self-hosted: our infra only whitelists Trigger.dev's static IPs; there's no Trigger VPC in our account. Their PrivateLink/static-IP features don't cover public-S3 egress.)

This is AWS-specific: GCP/Azure/dynamic/legacy checks make plain HTTPS calls to public APIs with no VPC endpoint involved, so they work identically in either environment.

Fix (AWS only)

When providerSlug === 'aws', the scheduled task (run-task-integration-checks) and the auto-run task (run-connection-checks) delegate execution to our server via a new internal, service-token-only endpoint, then persist the returned result with the existing shared logic. Trigger.dev still owns scheduling, retries, and the long-run tolerance — only where the AWS API calls run changed.

  • New ConnectionCheckRunnerService.runChecks() — runs a connection's checks on our server and returns the raw result (no persistence).
  • New POST /v1/integrations/internal/run-connection-checks/:connectionId (HybridAuthGuard + ServiceTokenOnlyGuard + PermissionGuard) — excluded from public docs/MCP.
  • New run-checks-on-server.ts — the thin fetch helper the AWS branch calls.
  • The two Trigger tasks gain a single providerSlug === 'aws' ? runChecksOnServer(...) : runAllChecks(...) branch. The else branch is the original code, unchanged (the diff is just re-indentation).

Explicitly NOT touched

  • GCP, Azure, dynamic, and legacy integrations — byte-for-byte unchanged; keep executing in Trigger.dev.
  • The manual "Run" path (already on our server), the orchestrator, and the shared persistence / task-status / email logic.
  • Remediation ("Find & Apply") is a separate flow that already runs in the API (ECS) — not affected.

Verification

  • ✅ 16 tests pass (new service/controller/helper specs + existing trigger specs).
  • ✅ Changed files are type-clean (remaining repo typecheck errors are pre-existing on main, in untouched files).
  • openapi.json unchanged — the internal endpoint is excluded from public docs.

Rollout

⚠️ This changes where AWS scheduled checks execute for all orgs. The AWS-on-our-server path is the same one the manual "Run" and the Cloud Tests scan already use in prod, but it isn't end-to-end tested locally. Deploy to the staging Trigger.dev env first, trigger integration-checks-schedule manually, confirm AWS checks pass and the API handles the load, then prod.

Optional tidy-up (left out to keep the diff minimal): for AWS the task still makes two now-unused pre-flight ECS calls (ensure-valid-credentials + resolve-session) — harmless, removable in a follow-up.

🤖 Generated with Claude Code


Summary by cubic

Run AWS integration checks on our server for scheduled and auto-run paths to stop false failures caused by Trigger.dev VPC endpoint policies. This aligns scheduled runs with manual runs, removes AccessDenied errors, and avoids 429s during fan-out.

  • Bug Fixes

    • Root cause: scheduled/auto runs executed in Trigger.dev’s VPC, whose S3 endpoint blocked cross-account reads (e.g., s3:GetBucketPublicAccessBlock).
    • Fix (AWS only): Trigger tasks call a new internal service-token endpoint on our API to execute checks in our VPC; results are persisted by existing logic.
    • New: ConnectionCheckRunnerService, POST /v1/integrations/internal/run-connection-checks/:connectionId (internal-only, throttle-exempt), and a thin run-checks-on-server helper; excluded /v1/integrations/internal from public OpenAPI.
    • Reliability: skipped Trigger-side credential/session preflight for AWS, added a 10‑minute AbortController timeout on the internal fetch, validated credentials by auth type on the server-run path (oauth2/api_key/basic/custom), and handled per-check transport errors individually on AWS so one blip doesn’t abort sibling checks; non‑AWS errors still throw as before.
    • Unchanged: GCP, Azure, dynamic/legacy providers, and manual “Run”.
  • Migration

    • Deploy to staging Trigger.dev first; manually trigger the schedule and verify AWS checks pass, then roll out to production.
    • Ensure SERVICE_TOKEN_TRIGGER is set for the Trigger runtime.

Written for commit 91ca27b. Summary will update on new commits.

Review in cubic

…led false-fails)

AWS integration checks (S3, EC2, RDS, KMS, CloudTrail, IAM) false-failed on the
scheduled and auto-run paths while the manual "Run" passed. Same code, same
credentials — the only difference is where the check executes:

- Manual runs on our server (ECS), inside our VPC, whose S3 endpoint allows our
  cross-account audit reads.
- Scheduled/auto-run execute in the Trigger.dev runtime, whose VPC S3 endpoint
  policy blocks cross-account reads ("no VPC endpoint policy allows ...").

AWS allows/denies based on the VPC the request exits from, which lives in
Trigger.dev's account and isn't ours to change.

Fix (AWS only): when providerSlug === 'aws', the scheduled and auto-run Trigger
tasks delegate execution to our server via a new service-token endpoint
(POST /v1/integrations/internal/run-connection-checks/:connectionId) and persist
the returned result with the existing shared logic. Trigger.dev still handles
scheduling and retries; only where the AWS API calls run changed.

GCP, Azure, dynamic and legacy integrations are untouched — they make plain
HTTPS calls (no VPC endpoint) and keep executing in Trigger.dev exactly as
before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
comp-framework-editor Ready Ready Preview, Comment Jun 12, 2026 8:51pm
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
app Skipped Skipped Jun 12, 2026 8:51pm
portal Skipped Skipped Jun 12, 2026 8:51pm

Request Review

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 10 files

Confidence score: 3/5

  • The highest risk is in apps/api/src/trigger/integration-platform/run-task-integration-checks.ts and apps/api/src/trigger/integration-platform/run-connection-checks.ts: the AWS path still performs Trigger-side credential/session preflight calls before server-side checks, so transient preflight failures can incorrectly fail scheduled AWS runs. This weakens the intended server-run-only flow and can create user-visible false negatives—skip those preflight calls when providerSlug === 'aws' before merging.
  • In apps/api/src/trigger/integration-platform/run-checks-on-server.ts, the internal fetch has no timeout, so a network stall can hang AWS scheduled checks instead of failing fast and retrying. Add an abort timeout to bound request duration and let retry/error handling recover predictably.

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread apps/api/src/trigger/integration-platform/run-task-integration-checks.ts Outdated
Comment thread apps/api/src/trigger/integration-platform/run-checks-on-server.ts Outdated
Comment thread apps/api/src/trigger/integration-platform/run-connection-checks.ts
Cubic (and an independent adversarial review) flagged three issues on the AWS
server-run path:

1. The AWS branch still ran the Trigger-side credential/session preflight
   (requestValidCredentials + injectAwsResolvedSession) before delegating to the
   server. Those calls are unused on the server path (it decrypts creds and
   assumes the role itself), so a transient preflight failure would falsely fail
   an AWS run. Skip the preflight entirely for AWS and drop the now-dead
   injectAwsResolvedSession call (both Trigger tasks).

2. The internal fetch had no timeout, so a hung connection could block the task
   until maxDuration. Add a generous AbortController timeout (10m — well below
   the 15m maxDuration but high enough never to abort a legitimately long run)
   so a stalled socket surfaces as an error and the task retries.

3. (Review) The scheduled per-check loop was inside one outer try/catch; because
   runChecksOnServer throws on transport failure and several AWS checks share a
   task, a blip on one check aborted its siblings and skipped lastSyncAt/status.
   Catch the throw per-check, record that check as failed, and continue —
   matching runAllChecks' per-check resilience (hasExecutionErrors keeps
   integrationLastRunAt unwritten so the next tick retries).

Non-AWS (GCP/Azure/dynamic/legacy) paths remain unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel vercel Bot temporarily deployed to Preview – portal June 12, 2026 20:44 Inactive
@vercel vercel Bot temporarily deployed to Preview – app June 12, 2026 20:44 Inactive
@tofikwest

Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai review it

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai review it

@tofikwest I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 10 files

Confidence score: 3/5

  • In apps/api/src/trigger/integration-platform/run-task-integration-checks.ts, the shared catch path appears to treat non-AWS exceptions like AWS transport blips, which can mask real runAllChecks failures and let unexpected integration-check errors be silently downgraded; before merging, re-throw non-AWS errors so only true AWS transient failures follow the retry/degraded path.

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

The per-check catch added for AWS server-run resilience wrapped both branches,
so a (rare) runAllChecks throw on a non-AWS provider would be downgraded to a
per-check failure instead of propagating. Re-throw when providerSlug !== 'aws'
so non-AWS behavior is unchanged from before this PR; only AWS transport blips
get the degrade-and-continue treatment. (cubic review)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tofikwest

Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai ultrareview it

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai ultrareview it

@tofikwest Starting ultrareview. I'll post findings when complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultrareview completed in 28m 1s

2 issues found across 10 files

Confidence score: 3/5

  • In apps/api/src/integration-platform/controllers/internal-checks.controller.ts, the new internal AWS check-run endpoint still goes through global rate limiting, so bursty scheduled runs can hit 429s and repeat the same check failures after merge — exempt this internal route from the global limiter (or add a dedicated higher-limit policy) before merging.
  • In apps/api/src/integration-platform/services/connection-check-runner.service.ts, the server-run path skips auth-type credential validation used by the existing run flow, which can let malformed credentials execute and produce inconsistent check outcomes — align server-run validation with the existing path before merging.

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

… + validate creds by auth type

cubic Ultrareview findings:

1. The internal run-connection-checks endpoint went through the global
   ThrottlerGuard, so the 6 AM AWS fan-out could hit 429s and re-fail checks.
   Add @SkipThrottle() (matching the Trigger-called resolve-session endpoint).

2. ConnectionCheckRunnerService only checked for missing credentials; the in-app
   run paths also validate by auth type. Align it (oauth2 / api_key / basic /
   custom) so a server-run rejects malformed credentials up front with a clear
   error instead of executing the check on bad input.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tofikwest

Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai review it

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai review it

@tofikwest I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 10 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Re-trigger cubic

@tofikwest tofikwest merged commit a7399d8 into main Jun 12, 2026
7 checks passed
@tofikwest tofikwest deleted the tofik/aws-checks-on-server branch June 12, 2026 21:33
@claudfuen

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 3.82.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants