Skip to content

Proposal: Untrusted Annotation for Tool Responses #136

@jpagnucco

Description

@jpagnucco

Background
Attackers can reliably manipulate agent behavior by planting malicious instructions inside User Generated Content (UGC) and server responses, which the agent blindly ingests when executing tools, potentially causing prompt injections. While agents should defend against prompt injection from both site owners and UGC, a threat model that limits cross-origin actions for site owners (e.g. human-in-the-loop checks for cross-origin navigations or external tool calls) means the highest risk originates from unchecked third-party content such as user reviews or comments.

Proposed Solution
Giving agents information about trust boundaries can allow for the use of indicators such as spotlighting to highlight untrustworthy content to the model.

Developers should add a boolean flag (contains_untrusted_content: true) to their tool definition if the tool may handle such content. This acts as a signal to the client that the payload requires heightened security handling. We could also use the openWorldHint to denote this instead, as proposed for standard MCP. This is related to the proposal in Issue #53.

Enforcement
When the WebMCP client (e.g. browser/agent) sees the contains_untrusted_content: true or openWorldHint annotation, it can strictly parse and sanitize the payload before passing it to the model.

Alternatives Considered
Inline Tagging: Relying solely on explicit boundary tags (<untrusted>...</untrusted>) without a global warning flag. This is too fragile, as attackers can easily manipulate or fake the tags within the payload to escape the boundary if the client isn't explicitly warned beforehand to parse and sanitize them.

Context: Standard MCP Handling of Untrusted Content
MCP acts purely as a transport layer and thus it pushes most of the security burden onto the host applications. This proposal introduces a standardized trust flag (that is already available in some capacity in standard MCP) to prevent clients from having to rely on fragmented heuristics to guess when a payload requires strict security isolation. There have been additional proposals to standardize other annotations in MCP but they do not seem to be adopted yet. See also SEP.

cc: @johannhof

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions