[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Community benchmark database for running LLMs on Apple Silicon Macs
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
An agent evaluation framework with native multi-turn feedback iteration.
CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming
Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.
Testing how well LLMs can solve jigsaw puzzles
Open-source multi-agent AI debate arena: pit Claude, GPT, Gemini, Ollama & HuggingFace models against each other with frozen-context fairness, evidence-first judging, 20+ personas, code review, and PDF/Markdown reports. CLI + Web UI.
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.
Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
LiveSecBench(大模型动态安全测评基准)是大模型安全领域的专业、动态、多维度测评基准。我们致力于通过科学、系统、持续演进的测评体系,客观评估与衡量大模型的安全性能,推动大模型技术向更安全、更可靠、更负责任的方向发展,为产业落地和学术研究提供关键的安全标尺。
Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка — здесь. 100 сценариев, 10 критериев, русский корпоративный контекст.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.
Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams
Benchmarking LLM decision-making in structured, adversarial environments using game-based evaluation.
Claude Code skill that pits Claude, ChatGPT, and Gemini against each other, then lets them cross-judge each other blind
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."