Skip to content

Add untrusted annotation to proposed mitigations#137

Open
johannhof wants to merge 1 commit intowebmachinelearning:mainfrom
johannhof:add-untrusted-annotation-mitigation
Open

Add untrusted annotation to proposed mitigations#137
johannhof wants to merge 1 commit intowebmachinelearning:mainfrom
johannhof:add-untrusted-annotation-mitigation

Conversation

@johannhof
Copy link
Contributor

No description provided.

@johannhof johannhof force-pushed the add-untrusted-annotation-mitigation branch from 18bfb9c to 416bad9 Compare March 11, 2026 21:16
@johannhof
Copy link
Contributor Author

@victorhuangwq


**Threats addressed:** Prompt Injection Attacks (Output Injection Attacks)

**How:** A boolean `contains_untrusted_content: true` annotation that acts as a signal to the client that the payload requires heightened security handling, allowing the client to sanitize the payload, use indicators such as spotlighting to highlight untrustworthy content to the model, or hide that part of the response entirely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer this to be part of ToolAnnotations. But I think we can discuss that in the thread.

Copy link
Contributor

@victorhuangwq victorhuangwq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@johannhof
Copy link
Contributor Author

@domfarolino PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants