This directory contains benchmark scripts to measure and validate FairProp Inspector's performance.
Compares FairProp Inspector against baseline methods:
- Regex-based rules
- FairProp Inspector (ModernBERT)
Run:
python benchmarks/accuracy_comparison.pyOutput:
- Console: Accuracy breakdown by category
- File:
benchmarks/results/accuracy_report.json
Expected Results:
- Regex Rules: ~65% accuracy
- FairProp Inspector: ~94% accuracy
Measures inference latency:
- Single inference (P50, P95, P99)
- Batch processing throughput
Run:
python benchmarks/latency_benchmark.pyOutput:
- Console: Latency statistics
- File:
benchmarks/results/latency_report.json
Expected Results:
- P95 latency: <20ms (CPU)
- Throughput: ~50-100 texts/sec
Standard test set with 20 cases covering:
- Violations: Familial status, age, religion, economic
- Compliant: Neutral descriptions, accessibility features
- Severity: High, medium, low
Format:
{
"text": "No kids under 12 allowed",
"expected": "NON_COMPLIANT",
"category": "familial_status",
"severity": "high"
}| Metric | Target | Current |
|---|---|---|
| Accuracy | >90% | ~94% |
| P95 Latency | <20ms | ~18ms |
| Throughput | >50 texts/sec | ~60 texts/sec |
# Run all benchmarks
python benchmarks/accuracy_comparison.py
python benchmarks/latency_benchmark.py
# View results
cat benchmarks/results/accuracy_report.json
cat benchmarks/results/latency_report.json- >95%: Excellent
- 90-95%: Good
- 85-90%: Acceptable
- <85%: Needs improvement
- <10ms: Excellent (edge device ready)
- 10-20ms: Good (production ready)
- 20-50ms: Acceptable
- >50ms: Needs optimization
To add new benchmarks:
- Create script in
benchmarks/ - Save results to
benchmarks/results/ - Update this README
- Add to CI (
.github/workflows/ci.yaml)