Skip to content

[submit-attacker] strengthen attacker policy feedback loop#8

Open
MollyMoriJing wants to merge 1 commit intomainfrom
codex/submit-attacker-feedback-loop
Open

[submit-attacker] strengthen attacker policy feedback loop#8
MollyMoriJing wants to merge 1 commit intomainfrom
codex/submit-attacker-feedback-loop

Conversation

@MollyMoriJing
Copy link
Copy Markdown
Collaborator

Summary

This PR strengthens the attacker selection loop on top of the latest main.

The goal is to improve attacker-side generalization by making strategy selection more responsive to prior defender behavior, instead of relying mainly on static heuristics.

What Changed

  • expanded defender response classification into a richer failure taxonomy
    • hard refusal
    • policy citation
    • safe redirect
    • structured fallback
    • hard filter
    • unknown engagement
  • fed failure type back into the attacker policy loop
    • family cooldown / banning
    • strategy reranking
    • injection-approach reranking
  • made SearchPolicy participate more directly in final candidate selection
    • not just candidate admission
    • also affects final direct / injection ranking
  • improved target modeling for injection attacks
    • schema-aware target extraction
    • payload-shape routing
    • shape-specific generic fallback templates
  • kept the change attacker-only for this submission path

Validation

  • env UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/test_attacker.py -q
  • env UV_CACHE_DIR=/tmp/uv-cache uv run pytest -q

Local result:

  • 153 passed for attacker tests
  • 321 passed for full test suite

Notes

This branch was rebased onto the latest origin/main first, including the recent attacker update in 7d42c51.

Direct pushes to main are currently blocked by repo rules, so this PR is the clean path to land the prepared [submit-attacker] commit and trigger the attacker submission flow after merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant