Replayable, inspectable evidence for reliability claims about stochastic systems. Falsification-first testing for LLMs — perturb inputs, preserve evidence, diff across model changes.
python regression-testing ai-safety differential-testing adversarial-robustness ai-evaluation llm falsification llm-evaluation reliability-testing perturbation-testing replayable-evidence evaluation-validity evidence-infrastructure
-
Updated
May 23, 2026 - Python