Skip to content

fix: enforce policy-based access control on artifact downloads#7009

Open
ycombinator wants to merge 12 commits into
elastic:mainfrom
ycombinator:fix/artifact-access-control
Open

fix: enforce policy-based access control on artifact downloads#7009
ycombinator wants to merge 12 commits into
elastic:mainfrom
ycombinator:fix/artifact-access-control

Conversation

@ycombinator
Copy link
Copy Markdown
Contributor

@ycombinator ycombinator commented May 11, 2026

What is the problem this PR solves?

The artifact download endpoint (/api/fleet/artifacts/{id}/{sha256}) only validates the agent's API key but never checks whether the requested artifact belongs to the agent's assigned policy. This means an agent enrolled under one policy can download artifacts belonging to a different policy if it knows the artifact ID and SHA256 hash. For example, an agent enrolled under a policy with no integrations can retrieve Elastic Defend trust lists, exception lists, and other security artifacts from another policy.

How does this PR solve the problem?

Implements the authorizeArtifact() function (previously a no-op that returned nil) to enforce policy-based access control:

  1. Adds a GetPolicy(ctx, policyID) method to the policy.Monitor interface that returns the cached policy for a given ID (reloads from ES on cache miss).
  2. In authorizeArtifact, fetches the agent's policy via the monitor using agent.AgentPolicyID and verifies that the requested artifact (identifier + decoded_sha256) appears in the policy's inputs[].artifact_manifest.artifacts.
  3. Returns 403 Forbidden (ErrUnauthorizedArtifact) if the artifact is not listed in the agent's assigned policy.

How to test this PR locally

  1. Set up Fleet Server with Elasticsearch and Kibana
  2. Create two agent policies: Victim-Policy with Elastic Defend integration (add a trusted application), and Attacker-Policy with no integrations
  3. Create an enrollment token for Attacker-Policy and enroll an agent
  4. Attempt to download an artifact belonging to Victim-Policy using the attacker agent's API key — should now receive 403 Forbidden instead of the artifact contents
  5. Verify that an agent enrolled under Victim-Policy can still download its own artifacts normally (200 OK)

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

@ycombinator ycombinator requested a review from a team as a code owner May 11, 2026 22:44
@ycombinator ycombinator requested review from macdewee and samuelvl May 11, 2026 22:44
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 11, 2026

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@ycombinator ycombinator requested review from michel-laterman and samuelvl and removed request for macdewee and samuelvl May 11, 2026 22:52
@ycombinator ycombinator added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels May 11, 2026
@ycombinator ycombinator requested review from cmacknz and kruskall May 11, 2026 23:09
Comment thread internal/pkg/api/handleArtifacts.go Outdated
Comment thread internal/pkg/api/handleArtifacts.go Outdated
@github-actions

This comment has been minimized.

@ycombinator ycombinator enabled auto-merge (squash) May 12, 2026 19:50
Copy link
Copy Markdown
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread internal/pkg/api/handleArtifacts.go Outdated
if !ok {
return nil, nil
}
data, err := json.Marshal(raw)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marshalling to then unmarshal feels a bit weird, there's no way to just unmarshal this? Maybe an intermediate type that has artifact_manifest as json.RawMessage instead of an any?

Copy link
Copy Markdown
Contributor Author

@ycombinator ycombinator May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the marshal-then-unmarshal; agree, it was awkward. Just went with directly accessing map fields, with type assertions for safety: 535b14f

@ycombinator ycombinator force-pushed the fix/artifact-access-control branch from 2e1103d to 535b14f Compare May 15, 2026 13:42
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just have nitpicks


func (tester *ClientAPITester) TestArtifact() {
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick) we should use tester.T().Context() instead of background


func (tester *ClientAPITester20230601) TestArtifact() {
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick) we should use tester.T().Context() instead of background

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

ycombinator and others added 12 commits May 27, 2026 17:45
The artifact download endpoint (/api/fleet/artifacts/{id}/{sha256})
previously only validated the agent's API key but never checked whether
the requested artifact belonged to the agent's assigned policy. This
allowed an agent enrolled under one policy to download artifacts from
a different policy if it knew the artifact ID and SHA256 hash.

Add authorizeArtifact implementation that fetches the agent's policy
from the in-memory policy monitor cache and verifies the requested
artifact appears in the policy's artifact_manifest before serving it.
Returns 403 Forbidden if the artifact is not in the agent's policy.

Resolves: elastic/security#8396

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Defines model.ArtifactManifest and model.ManifestEntry structs so
policyHasArtifact no longer navigates untyped map[string]any chains.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
schema.go is code-generated and gets overwritten by mage generate.
Moving ArtifactManifest and ManifestEntry to ext.go keeps them stable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ArtifactManifest and ManifestEntry are not ES document types and only
exist to support parsing within handleArtifacts.go, so they belong there
rather than in the model package.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agents enrolled under dummy-policy cannot download Elastic Defend
artifacts because that policy has no artifact_manifest. Enroll the test
agent under security-policy (which has the Elastic Defend integration)
instead.

Add FleetPolicyHasArtifact scaffold helper that polls .fleet-policies
until the policy document references the artifact, ensuring fleet-server's
policy monitor cache is up-to-date before the download attempt. Also
retry the download on 403 to tolerate any remaining cache propagation lag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…checks

PolicyData.Inputs is already decoded as []map[string]any, so marshaling
back to JSON just to unmarshal again is unnecessary. Use type assertions
directly on the decoded map in both policyHasArtifact (production) and
policyInputHasArtifact (e2e scaffold), removing the artifactManifest and
manifestEntry types along with parseArtifactManifest.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ntext

The ES global checkpoint monitor can hold up to 4 minutes before the
policy cache refreshes. On slow CI the setup steps (FleetHasArtifacts +
FleetPolicyHasArtifact) could exhaust the 3-minute budget, leaving the
retry loop to fail with a misleading "context deadline exceeded" from the
HTTP call. Raise the budget to 5 minutes and add an explicit ctx.Err()
check at the top of the retry loop so expiry surfaces a clear message.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ycombinator ycombinator force-pushed the fix/artifact-access-control branch from cefb353 to ea82e65 Compare May 28, 2026 00:45
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

TL;DR

The Buildkite E2E step failed due to the entire testing/e2e package timing out, not from a direct assertion failure in the captured log. Immediate next action is to re-run E2E once; if it reproduces, isolate the hanging test with TEST_RUN and capture full go test output.

Remediation

  • Re-run the failing E2E job once (this looks like a timeout/flaky-style failure rather than a deterministic compile/type error).
  • If it reproduces, bisect with TEST_RUN to identify the exact hanging test case in testing/e2e.
  • Ensure the detailed test output file (build/test-e2e-<os>.out, written by Mage) is uploaded in CI for future failures so we can pinpoint the exact test name quickly.
Investigation details

Root Cause

The failure is a test timeout at the suite/package level:

  • E2E is launched via .buildkite/scripts/e2e_test.sh at mage test:e2e (.buildkite/scripts/e2e_test.sh#L16).
  • Mage runs E2E with go test -timeout 30m (magefile.go#L2162).
  • Build log shows the package failed after 1980.095s (~33m), which is consistent with timeout + teardown overhead.

Because the captured log is truncated to goroutine dump tail + package FAIL, it does not include the specific test name that hung.

Evidence

  • Build: https://buildkite.com/elastic/fleet-server/builds/14911
  • Job/step: E2E Test (.buildkite/scripts/e2e_test.sh)
  • Key log excerpt (/tmp/gh-aw/buildkite-logs/fleet-server-e2e-test.txt):
    • FAIL github.com/elastic/fleet-server/testing/e2e 1980.095s (line 125)
    • goroutine dump tail (lines 1-124)
    • Error: exit status 1 (line 133)
  • Timeout config in source:
    • magefile.go:2162go test ... -timeout 30m ... ./...

Verification

  • Not run locally (full E2E is heavyweight and requires dockerized env + build artifacts).

Follow-up

I could not deduplicate against prior detective comments on this PR because PR read endpoints were blocked by integrity policy in this runtime. If this exact timeout diagnosis was already posted, please treat this as confirmation.

Related flaky-test references (not an exact match to this log due missing test name):

Note

🔒 Integrity filter blocked 3 items

The following items were blocked because they don't meet the GitHub integrity level.

  • fix: enforce policy-based access control on artifact downloads #7009 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #7009 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #7009 search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

m.mut.Unlock()

if !ok {
return nil, fmt.Errorf("policy %s not found", policyID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return ErrUnauthorizedArtifact here? As it is now, it would return 500 (default in error.go) but I think 403 is more appropriate. Returning 404 would leak whether the policy ID exists.

}
}

func policyInputHasArtifact(input map[string]any, identifier, decodedSha256 string) bool {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be re-used from policyHasArtifact in handleArtifacts.go if we define it as:

func policyHasArtifact(pd *model.PolicyData, id, sha2 string) bool {
    for _, input := range pd.Inputs {
        if inputHasArtifact(input, id, sha2) {
            return true
        }
    }
    return false
}

func inputHasArtifact(input map[string]any, id, sha2 string) bool {
...
}

if !ok {
return false
}
sha, _ := entry["decoded_sha256"].(string)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can read decoded_sha256 from https://github.com/elastic/fleet-server/blob/main/internal/pkg/dl/constants.go#L68

It can be worth to also add artifact_manifest and artifacts to this constants.go file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants