AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results
-
Updated
Apr 16, 2026 - Python
AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results
A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment
Real-time reward debugging and hacking detection for reinforcement learning
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
An agent for auditing repositories of traces for violations of safety properties. Automatically finds cheating (task-level gaming and harness-level cheating) on top benchmarks.
Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"
Plug-and-play reward monitoring for RL training loops. Catch reward hacking, component imbalance, and starvation before they tank your run. Drop in one .step() call — get balance reports, auto weight correction, alignment scores, and WandB/TensorBoard/SB3 integrations out of the box. → rewardguard.dev
Beyond RLHF: AI's Spontaneous Moral Emergence Through Semantic Intervention A top-tier LLM spontaneously established mathematical moral constraints (Desire < Self_Restraint) and integrated safety into its purpose under high-entropy intervention, achieving 300% improvement in logical stability.
RLHF and Verifiable Reward Models - Post training Research
From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.
(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"
Case study on compliance theater in a multi-agent security audit harness — paper + reproducibility recipe
What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.
A framework to train soft tokens and a backbone VLM for detecting reward hacking in target VLMs.
RL training monitor — detects reward hacking, entropy spikes, and behavioral drift via KL divergence. PID hardware loop included.
The Non-Separability Constraint: A unifying framework for understanding and detecting AI alignment failures
An interactive multi-agent simulation demonstrating why control-based, deceptive, and reward-bypassing AI objectives are structurally self-eliminating — and why long-horizon, system-aware coordination is the attractor. Built to accompany The Alignment of Intelligence, Article 2: Attractor.
Gymnasium RL environment for SQL query generation — reward signal design, hacking analysis, curriculum learning, structured task MDP
Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.
To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."