llm-benchmark

Star

Here are 75 public repositories matching this topic...

TsinghuaC3I / AdsQA

Star

[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621

benchmark video-understanding video-question-answering advertisement-dataset video-llms llm-benchmark

Updated Oct 30, 2025
Python

enescingoz / mac-llm-bench

Star

Community benchmark database for running LLMs on Apple Silicon Macs

benchmark inference apple-silicon llm llama-cpp local-llm llm-benchmark tokens-per-second

Updated Apr 22, 2026
Shell

hyeonsangjeon / gdpval-realworks

Star

Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

Updated Apr 10, 2026
Python

BennettSchwartz / ERR-EVAL

Sponsor

Star

Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Mar 22, 2026
Python

GalenChen320 / Otter

Star

An agent evaluation framework with native multi-turn feedback iteration.

agent code-evaluation feedback-driven llm-benchmark

Updated Apr 15, 2026
Python

ishida-lab / capbencher

Star

CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming

contamination-detection llm llm-datasets llm-benchmark leaderboard-hacking

Updated Feb 24, 2026
Python

Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.

benchmark mcp community-driven model-evaluation ai-evaluation llm ollama sycophancy hallucination-detection llm-testing hardware-benchmark ai-trust trust-scoring behavioral-testing llm-benchmark deterministic-scoring

Updated Apr 21, 2026
Python

filipbasara0 / llm-jigsaw

Star

Testing how well LLMs can solve jigsaw puzzles

benchmark vision-language-model llm-reasoning llm-benchmark

Updated Jan 8, 2026
Python

Bluear7878 / AI-Colosseum-Debate

Star

Open-source multi-agent AI debate arena: pit Claude, GPT, Gemini, Ollama & HuggingFace models against each other with frozen-context fairness, evidence-first judging, 20+ personas, code review, and PDF/Markdown reports. CLI + Web UI.

Updated Apr 10, 2026
Python

bejranonda / openclaw-eval

Star

Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.

thai-language thai-nlp gemma model-evaluation openrouter ai-gateway nemotron llm-comparison llm-benchmark free-ai-models openclaw step-flash trinity-mini chatbot-benchmark free-tier-llm

Updated Feb 24, 2026

playeriv65 / EasyLocomo

Star

🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.

refactoring refactor evaluation-framework locomo long-term-memory openai-api long-context llm-benchmark

Updated Feb 12, 2026
Python

nagu-io / agent-settlement-bench

Star

Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).

security distributed-systems benchmark blockchain cryptocurrency alignment ai-safety ai-agents payment-systems llm-evaluation agentic-ai llm-benchmark

Updated Feb 27, 2026
JavaScript

RafaelParonis / jailbench

Star

🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.

python flask analytics openai alignment model-evaluation ai-safety security-testing red-teaming model-robustness anthropic litellm content-safety llm-jailbreaks tool-calling llm-benchmark ai-evals textual-tui

Updated Apr 25, 2026
Python

LiveSecBench / livesecbench.github.io

Star

LiveSecBench（大模型动态安全测评基准）是大模型安全领域的专业、动态、多维度测评基准。我们致力于通过科学、系统、持续演进的测评体系，客观评估与衡量大模型的安全性能，推动大模型技术向更安全、更可靠、更负责任的方向发展，为产业落地和学术研究提供关键的安全标尺。

ai-safety ai-security ai-engineering ai-evaluation llm llm-security llm-evaluation llm-safety llm-benchmark

Updated Oct 29, 2025

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

MarkIvor / officeiq

Star

Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка — здесь. 100 сценариев, 10 критериев, русский корпоративный контекст.

benchmark ai-assistants evaluation-framework russian-nlp corporate-ai llm-evaluation llm-as-a-judge llm-benchmark

Updated Apr 12, 2026
HTML

ALucek / banana-bench

Star

Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams

benchmarks llm llm-benchmark

Updated Dec 30, 2025
HTML

Mr-Dark-debug / RetardBench

Star

RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.

jailbreak large-language-models llm prompt-injection red-teaming-tools ollama llm-evaluation uncensored-llm open-llm-leaderboard llm-jailbreaks prompt-injection-llm-security ai-benchmark ai-red-teaming llm-benchmark

Updated Mar 2, 2026
TypeScript

avin-cyborg / Turing-Arena

Star

Benchmarking LLM decision-making in structured, adversarial environments using game-based evaluation.

python benchmarking machine-learning reinforcement-learning websockets decision-making game-ai multi-agent-systems chess-ai ai-research fastapi ai-evaluation llm llm-benchmark

Updated Apr 16, 2026
Python

vanderheijden86 / showdown-claude-skill

Star

Claude Code skill that pits Claude, ChatGPT, and Gemini against each other, then lets them cross-judge each other blind

gemini ai-tools chatgpt anthropic-claude llm-comparison claude-code llm-benchmark claude-skill claude-plugin

Updated Feb 11, 2026
Shell

Improve this page

Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-benchmark

Here are 75 public repositories matching this topic...

TsinghuaC3I / AdsQA

enescingoz / mac-llm-bench

hyeonsangjeon / gdpval-realworks

BennettSchwartz / ERR-EVAL

GalenChen320 / Otter

ishida-lab / capbencher

Basaltlabs-app / Gauntlet

filipbasara0 / llm-jigsaw

Bluear7878 / AI-Colosseum-Debate

bejranonda / openclaw-eval

playeriv65 / EasyLocomo

nagu-io / agent-settlement-bench

RafaelParonis / jailbench

LiveSecBench / livesecbench.github.io

vibheksoni / jailbench

MarkIvor / officeiq

ALucek / banana-bench

Mr-Dark-debug / RetardBench

avin-cyborg / Turing-Arena

vanderheijden86 / showdown-claude-skill

Improve this page

Add this topic to your repo