Benchmarks
-
The Five Stages to Keeping Benchmarks Useful as Models Evolve
A practical maturity model for taking benchmarks from proof-of-concept to versioned, continuously evolving evaluation that keeps up with models, prompts, and agent workflows.
HumanSignal Team
2026-01-07
-
Building a Quality Estimation Benchmark: The impact of relying on AI judges
What happens when you let AI judge AI? A pioneer benchmark for quality estimation in machine translation.
Sheree Zhang
2025-12-22
-
Evaluating the GPT-5 Series on Custom Benchmarks
GPT-5 is out now -- but how good is it, really? In this post, we'll show you how we created our own custom Benchmark to evaluate GPT-5.
Sheree Zhang
2025-08-08
-
How to Build AI Benchmarks that Evolve with your Models
Designing effective LLM benchmarks means going beyond static tests, this guide walks through scoring methods, strategy evolution, and how to evaluate models as they scale.
Micaela Kaplan
2025-07-21
-
Why Benchmarks Matter for Evaluating LLMs (and Why Most Miss the Mark)
Custom AI benchmarks play a crucial role in the success and scalability of AI systems by providing a standardized approach to running AI evaluations.
Sheree Zhang
2025-07-08
-
Everybody Is (Unintentionally) Cheating
AI benchmarks are breaking under pressure. This blog explores four ways to rebuild trust, governance, transparency, better metrics, and centralized oversight.
Nikolai Liubimov
2025-05-13
-
Never miss an update.
Subscribe to our newsletter.