Why Traditional LLM Evaluation Falls Short - and What's Next
Why do we evaluate LLMs?
In any scientific or commercial application of LLMs, evaluation is a key, if often under appreciated, step. Unlike traditional software, where every sub-component can be unit tested and the whole system behaves predictably and deterministically, LLMs are best tested end-to-end, are non-deterministic by nature, and can behave unpredictably when faced with unexpected inputs. For example, when faced with an unexpected input, a traditional program might crash, while an LLM might produce a factually incorrect, offensive, or otherwise brand-damaging answer.
How do we test LLMs?
The most common approaches to evaluate LLMs include:
- Static tests: ML practitioners rely on standardised benchmarks consisting of collections of questions or tasks with known answers. Static tests can be domain-specific or cover multiple domains, and rely on automated metrics like accuracy, precision/recall, ROUGE (for machine translation) or Pass@1 (for coding) to evaluate the capabilities of the model. For example, tests like evalplus evaluate a model's capability to generate functioning code, while tests like MMLU-Pro assess a model's language understanding capabilities in a number of domains, including maths, law and health, measuring the model's accuracy in answering multiple-choice questions.
- Human evaluations: as automated metrics can fall short in many customer-specific scenarios, humans judgements are often used to measure the quality of a model. When you chat with ChatGPT and it asks you to choose between two answers, that's OpenAI evaluating the model's capabilities with A/B testing. In other cases, such as handling sensitive customer data, companies hire slow and expensive human annotators to manually annotate model outputs. While human evaluations can produce high quality outputs, they come at a cost of speed, cost, or friction in the user experience, if users get asked too often to choose between two answers.
- LLM as a Judge (LLMaJ): LLMaJ is a relatively recent trend where ML practitioners use one LLMs to judge another LLM's outputs. As for human-based evaluations, it addresses the shortcomings of automated metrics, because LLMs can be asked to judge about any aspect of another model's answer - for example, its factual correctness, its helpfulness, its bias, and so on. However, LLMaJ adds cost and time to an evaluation: you need a second model in the loop, with all that that entails - API calls, training custom judge models, hosting or paying for them, and so on.
- LLM Unit testing: another common approach is to write small test sets, still relying on automated metrics, to check for a model's specific behaviour. For example, in a QA scenario, we might want to set up a test set of simple questions (like "What is the capital of France"?), or in an agentic scenario, we set up a simple task (like "call this API"), and we check whether the model performs as expected. This is used for regression testing or smoke testing, that is, checking if any update to the LLM pipeline caused a degradation in performance. While these tests are fast to run, they cover only a narrow set of explicitly defined scenarios, so they are not useful to get an overall picture of the performance of a model.
Why Traditional Testing Falls Short
These evaluation techniques all fall short on many aspects:
- Speed: static tests, LLMaJ, and human annotators are slow. If you have to evaluate a dozen of model checkpoints, or hundreds of prompts, this means waiting hours - or days, for bigger models.
- Cost: inference is expensive. Whether you're hosting models on your infrastructure, using IaaS, or accessing a model through APIs, the cost of repeatedly running inference over a whole dataset to test different hypotheses can quickly stack up. Adding LLMaJ just makes things more expensive.
- Flexibility: whether using static tests or LLMaJ, you typically run the entire dataset to get to the final metric; but good models will always get easy answers right, and bad (or small) models will always get hard answers wrong, wasting lots of time and money on cases that could have been easily skipped.
- Completeness: A/B testing and unit testing can provide quick, actionable insights - but on limited scenarios. Static tests and LLMaJ-based evaluations are still the preferred way to get a holistic view of the capabilities of a model.
Adaptive Evaluation: fast and scalable LLM evaluation
What if there was a quick way to know the performance of a model on a dataset, without all the shortcomings of the current evaluation techniques? At Trismik, we have developed adaptive testing, an LLM evaluation algorithm that, instead of scanning through the whole dataset, dynamically serves questions tailored to the model's ability. Depending on the task, adaptive testing can be up to 98% faster while remaining as rigorous as traditional methods.
Next week, we will publish a deep dive on how adaptive testing works, and we will release our platform for general availability on September 24, 2025. To learn more, sign up for the waitlist here!