fix(cloud-security): run AWS integration checks on our server (scheduled false-fails)#3133
Conversation
…led false-fails)
AWS integration checks (S3, EC2, RDS, KMS, CloudTrail, IAM) false-failed on the
scheduled and auto-run paths while the manual "Run" passed. Same code, same
credentials — the only difference is where the check executes:
- Manual runs on our server (ECS), inside our VPC, whose S3 endpoint allows our
cross-account audit reads.
- Scheduled/auto-run execute in the Trigger.dev runtime, whose VPC S3 endpoint
policy blocks cross-account reads ("no VPC endpoint policy allows ...").
AWS allows/denies based on the VPC the request exits from, which lives in
Trigger.dev's account and isn't ours to change.
Fix (AWS only): when providerSlug === 'aws', the scheduled and auto-run Trigger
tasks delegate execution to our server via a new service-token endpoint
(POST /v1/integrations/internal/run-connection-checks/:connectionId) and persist
the returned result with the existing shared logic. Trigger.dev still handles
scheduling and retries; only where the AWS API calls run changed.
GCP, Azure, dynamic and legacy integrations are untouched — they make plain
HTTPS calls (no VPC endpoint) and keep executing in Trigger.dev exactly as
before.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
3 issues found across 10 files
Confidence score: 3/5
- The highest risk is in
apps/api/src/trigger/integration-platform/run-task-integration-checks.tsandapps/api/src/trigger/integration-platform/run-connection-checks.ts: the AWS path still performs Trigger-side credential/session preflight calls before server-side checks, so transient preflight failures can incorrectly fail scheduled AWS runs. This weakens the intended server-run-only flow and can create user-visible false negatives—skip those preflight calls whenproviderSlug === 'aws'before merging. - In
apps/api/src/trigger/integration-platform/run-checks-on-server.ts, the internal fetch has no timeout, so a network stall can hang AWS scheduled checks instead of failing fast and retrying. Add an abort timeout to bound request duration and let retry/error handling recover predictably.
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
Cubic (and an independent adversarial review) flagged three issues on the AWS server-run path: 1. The AWS branch still ran the Trigger-side credential/session preflight (requestValidCredentials + injectAwsResolvedSession) before delegating to the server. Those calls are unused on the server path (it decrypts creds and assumes the role itself), so a transient preflight failure would falsely fail an AWS run. Skip the preflight entirely for AWS and drop the now-dead injectAwsResolvedSession call (both Trigger tasks). 2. The internal fetch had no timeout, so a hung connection could block the task until maxDuration. Add a generous AbortController timeout (10m — well below the 15m maxDuration but high enough never to abort a legitimately long run) so a stalled socket surfaces as an error and the task retries. 3. (Review) The scheduled per-check loop was inside one outer try/catch; because runChecksOnServer throws on transport failure and several AWS checks share a task, a blip on one check aborted its siblings and skipped lastSyncAt/status. Catch the throw per-check, record that check as failed, and continue — matching runAllChecks' per-check resilience (hasExecutionErrors keeps integrationLastRunAt unwritten so the next tick retries). Non-AWS (GCP/Azure/dynamic/legacy) paths remain unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@cubic-dev-ai review it |
@tofikwest I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 10 files
Confidence score: 3/5
- In
apps/api/src/trigger/integration-platform/run-task-integration-checks.ts, the shared catch path appears to treat non-AWS exceptions like AWS transport blips, which can mask realrunAllChecksfailures and let unexpected integration-check errors be silently downgraded; before merging, re-throw non-AWS errors so only true AWS transient failures follow the retry/degraded path.
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
The per-check catch added for AWS server-run resilience wrapped both branches, so a (rare) runAllChecks throw on a non-AWS provider would be downgraded to a per-check failure instead of propagating. Re-throw when providerSlug !== 'aws' so non-AWS behavior is unchanged from before this PR; only AWS transport blips get the degrade-and-continue treatment. (cubic review) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@cubic-dev-ai ultrareview it |
@tofikwest Starting ultrareview. I'll post findings when complete. |
There was a problem hiding this comment.
Ultrareview completed in 28m 1s
2 issues found across 10 files
Confidence score: 3/5
- In
apps/api/src/integration-platform/controllers/internal-checks.controller.ts, the new internal AWS check-run endpoint still goes through global rate limiting, so bursty scheduled runs can hit 429s and repeat the same check failures after merge — exempt this internal route from the global limiter (or add a dedicated higher-limit policy) before merging. - In
apps/api/src/integration-platform/services/connection-check-runner.service.ts, the server-run path skips auth-type credential validation used by the existing run flow, which can let malformed credentials execute and produce inconsistent check outcomes — align server-run validation with the existing path before merging.
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
… + validate creds by auth type cubic Ultrareview findings: 1. The internal run-connection-checks endpoint went through the global ThrottlerGuard, so the 6 AM AWS fan-out could hit 429s and re-fail checks. Add @SkipThrottle() (matching the Trigger-called resolve-session endpoint). 2. ConnectionCheckRunnerService only checked for missing credentials; the in-app run paths also validate by auth type. Align it (oauth2 / api_key / basic / custom) so a server-run rejects malformed credentials up front with a clear error instead of executing the check on bad input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@cubic-dev-ai review it |
@tofikwest I have started the AI code review. It will take a few minutes to complete. |
|
🎉 This PR is included in version 3.82.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Problem
AWS integration checks (S3 Block Public Access, encryption, EC2, RDS, KMS, CloudTrail, IAM) false-fail on the scheduled / auto-run paths but pass on the manual "Run" — repeatedly (CS reports over the last ~2 weeks). Customer-visible error:
Root cause
It is not a code bug — it's the same check code with the same credentials. The only difference is where it executes, and therefore which network the AWS API call exits through:
comp-production-vpcs3:Get*)AWS evaluates allow/deny based on the VPC the request exits from — which, for the scheduled path, lives in Trigger.dev's AWS account and is not ours to change. (Confirmed Trigger.dev is Cloud, not self-hosted: our infra only whitelists Trigger.dev's static IPs; there's no Trigger VPC in our account. Their PrivateLink/static-IP features don't cover public-S3 egress.)
This is AWS-specific: GCP/Azure/dynamic/legacy checks make plain HTTPS calls to public APIs with no VPC endpoint involved, so they work identically in either environment.
Fix (AWS only)
When
providerSlug === 'aws', the scheduled task (run-task-integration-checks) and the auto-run task (run-connection-checks) delegate execution to our server via a new internal, service-token-only endpoint, then persist the returned result with the existing shared logic. Trigger.dev still owns scheduling, retries, and the long-run tolerance — only where the AWS API calls run changed.ConnectionCheckRunnerService.runChecks()— runs a connection's checks on our server and returns the raw result (no persistence).POST /v1/integrations/internal/run-connection-checks/:connectionId(HybridAuthGuard + ServiceTokenOnlyGuard + PermissionGuard) — excluded from public docs/MCP.run-checks-on-server.ts— the thin fetch helper the AWS branch calls.providerSlug === 'aws' ? runChecksOnServer(...) : runAllChecks(...)branch. Theelsebranch is the original code, unchanged (the diff is just re-indentation).Explicitly NOT touched
Verification
main, in untouched files).openapi.jsonunchanged — the internal endpoint is excluded from public docs.Rollout
integration-checks-schedulemanually, confirm AWS checks pass and the API handles the load, then prod.Optional tidy-up (left out to keep the diff minimal): for AWS the task still makes two now-unused pre-flight ECS calls (
ensure-valid-credentials+resolve-session) — harmless, removable in a follow-up.🤖 Generated with Claude Code
Summary by cubic
Run AWS integration checks on our server for scheduled and auto-run paths to stop false failures caused by Trigger.dev VPC endpoint policies. This aligns scheduled runs with manual runs, removes AccessDenied errors, and avoids 429s during fan-out.
Bug Fixes
ConnectionCheckRunnerService,POST /v1/integrations/internal/run-connection-checks/:connectionId(internal-only, throttle-exempt), and a thinrun-checks-on-serverhelper; excluded/v1/integrations/internalfrom public OpenAPI.Migration
SERVICE_TOKEN_TRIGGERis set for the Trigger runtime.Written for commit 91ca27b. Summary will update on new commits.