Marco Basaldella - One post

Why Traditional LLM Evaluation Falls Short - and What's Next

September 9, 2025 · 5 min read

Why do we evaluate LLMs?

In any scientific or commercial application of LLMs, evaluation is a key, if often under appreciated, step. Unlike traditional software, where every sub-component can be unit tested and the whole system behaves predictably and deterministically, LLMs are best tested end-to-end, are non-deterministic by nature, and can behave unpredictably when faced with unexpected inputs. For example, when faced with an unexpected input, a traditional program might crash, while an LLM might produce a factually incorrect, offensive, or otherwise brand-damaging answer.

Why do we evaluate LLMs?​

Why do we evaluate LLMs?