statistical-evaluation

Here are 2 public repositories matching this topic...

jsp2195 / frontier-evals-harness

frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.

reproducible-research model-evaluation llm-evaluation llm-benchmarking statistical-evaluation evaluation-harness

Updated Feb 19, 2026
Python

robin-ck / data_mining_bug_classification

Star

Classification models for detecting fake reviews and predicting software bugs. Includes implementations of decision trees, bagging, random forests, logistic regression, and Naive Bayes, with statistical evaluation using McNemar's test.

machine-learning text-mining random-forest naive-bayes feature-selection classification logistic-regression tf-idf decision-tree bagging opinion-spam bug-prediction mcnemar-test statistical-evaluation

Updated Jun 28, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the statistical-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the statistical-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly