Trismik Blog

Adaptive Testing for LLMs: Does It Really Work?

September 15, 2025 · 6 min read

The standard approach to evaluating large language models (LLMs) is simple but inefficient: run models through massive static benchmarks, average the scores, and compare results. The problem is that these benchmarks often require models to process thousands of items, many of which offer little useful information about a model's actual capabilities.

Computerized Adaptive Testing (CAT) has been quietly transforming educational assessments for decades [1, 2, 3]. Rather than using a one-size-fits-all test, CAT adapts question difficulty in real time based on the test-taker’s performance. The concept is intuitive: start with a medium-difficulty question. If the answer is correct, try something harder. If it’s wrong, step back. In this way, the test adapts continually to pinpoint the test-taker’s ability efficiently.

Why Traditional LLM Evaluation Falls Short - and What's Next

September 9, 2025 · 5 min read

Marco Basaldella

Why do we evaluate LLMs?

In any scientific or commercial application of LLMs, evaluation is a key, if often under appreciated, step. Unlike traditional software, where every sub-component can be unit tested and the whole system behaves predictably and deterministically, LLMs are best tested end-to-end, are non-deterministic by nature, and can behave unpredictably when faced with unexpected inputs. For example, when faced with an unexpected input, a traditional program might crash, while an LLM might produce a factually incorrect, offensive, or otherwise brand-damaging answer.

Why do we evaluate LLMs?​

Why do we evaluate LLMs?