-
Notifications
You must be signed in to change notification settings - Fork 121
Description
Background
Attackers can reliably manipulate agent behavior by planting malicious instructions inside User Generated Content (UGC) and server responses, which the agent blindly ingests when executing tools, potentially causing prompt injections. While agents should defend against prompt injection from both site owners and UGC, a threat model that limits cross-origin actions for site owners (e.g. human-in-the-loop checks for cross-origin navigations or external tool calls) means the highest risk originates from unchecked third-party content such as user reviews or comments.
Proposed Solution
Giving agents information about trust boundaries can allow for the use of indicators such as spotlighting to highlight untrustworthy content to the model.
Developers should add a boolean flag (contains_untrusted_content: true) to their tool definition if the tool may handle such content. This acts as a signal to the client that the payload requires heightened security handling. We could also use the openWorldHint to denote this instead, as proposed for standard MCP. This is related to the proposal in Issue #53.
Enforcement
When the WebMCP client (e.g. browser/agent) sees the contains_untrusted_content: true or openWorldHint annotation, it can strictly parse and sanitize the payload before passing it to the model.
Alternatives Considered
Inline Tagging: Relying solely on explicit boundary tags (<untrusted>...</untrusted>) without a global warning flag. This is too fragile, as attackers can easily manipulate or fake the tags within the payload to escape the boundary if the client isn't explicitly warned beforehand to parse and sanitize them.
Context: Standard MCP Handling of Untrusted Content
MCP acts purely as a transport layer and thus it pushes most of the security burden onto the host applications. This proposal introduces a standardized trust flag (that is already available in some capacity in standard MCP) to prevent clients from having to rely on fragmented heuristics to guess when a payload requires strict security isolation. There have been additional proposals to standardize other annotations in MCP but they do not seem to be adopted yet. See also SEP.
cc: @johannhof