Proposal: Untrusted Annotation for Tool Responses

**Background**   
Attackers can reliably manipulate agent behavior by planting malicious instructions inside User Generated Content (UGC) and server responses, which the agent blindly ingests when executing tools, potentially causing prompt injections. While agents should defend against prompt injection from both site owners and UGC, a threat model that limits cross-origin actions for site owners (e.g. human-in-the-loop checks for cross-origin navigations or external tool calls) means the highest risk originates from unchecked third-party content such as user reviews or comments.

**Proposed Solution**  
Giving agents information about trust boundaries can allow for the use of indicators such as [spotlighting](https://arxiv.org/pdf/2403.14720) to highlight untrustworthy content to the model. 

Developers should add a boolean flag (`contains_untrusted_content: true`) to their tool definition if the tool may handle such content. This acts as a signal to the client that the payload requires heightened security handling. We could also use the [openWorldHint](https://github.com/modelcontextprotocol/modelcontextprotocol/issues/711) to denote this instead, as proposed for standard MCP. This is related to the proposal in Issue #53. 

**Enforcement**  
When the WebMCP client (e.g. browser/agent) sees the `contains_untrusted_content: true` or `openWorldHint` annotation, it can strictly parse and sanitize the payload before passing it to the model.

**Alternatives Considered**  
**Inline Tagging**: Relying solely on explicit boundary tags (`<untrusted>...</untrusted>`) without a global warning flag. This is too fragile, as attackers can easily manipulate or fake the tags within the payload to escape the boundary if the client isn't explicitly warned beforehand to parse and sanitize them.

**Context: Standard MCP Handling of Untrusted Content**
MCP acts purely as a transport layer and thus it pushes most of the security burden onto the host applications. This proposal introduces a standardized trust flag (that is already available in some capacity in standard MCP) to prevent clients from having to rely on fragmented heuristics to guess when a payload requires strict security isolation. There have been additional [proposals](https://github.com/modelcontextprotocol/modelcontextprotocol/issues/711) to standardize other annotations in MCP but they do not seem to be adopted yet. See also [SEP](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/1913). 

cc: @johannhof 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Untrusted Annotation for Tool Responses #136

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Untrusted Annotation for Tool Responses #136

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions