Loading…
Loading…
Systematic testing of what an AI model can and cannot reliably do across the range of tasks and inputs it will encounter in production. Capability evaluations measure accuracy, reasoning quality, language abilities, handling of edge cases, and performance degradation under distribution shift. Distinct from safety evaluation — which assesses what the model should not do — capability evaluation focuses on whether the model is competent enough for its intended use. Published benchmarks rarely reflect real-world task distributions, making task-specific capability evaluation essential before deployment.
Why this matters for your team
Before deploying any AI vendor's model, run capability evaluations on your actual use cases and data — not just the vendor's published benchmarks. Benchmarks rarely match your domain. A 30-minute test on 50 real examples will tell you more than any spec sheet.
Before deploying a contract analysis AI, a legal team runs capability evaluations using 50 real contract samples, measuring the model's accuracy on clause identification and flagging — finding that performance drops 30% on contracts longer than 20 pages.