Measuring Calibration: Why We Publish Error Rates

A claim prediction is useless if you do not know how often the model is wrong. Publishing error rates forces a transition from guessing to actual risk management.

By David H. Silver
Head of AI
10 June 2026
5 min read

TL;DR — Calibration means a model predicting a 70% chance of escalation is right exactly seven times out of ten. Demanding honest error rates and conformal prediction ranges is the only way to trust machine learning for reserving.

A claims executive logs into a new software platform and sees an $800,000 projected settlement for a severe bodily injury claim. The interface is clean. The number is bold. The user has absolutely no way to know if that figure is a statistical certainty or a wild guess. Presenting a single dollar amount without a measurable error bound is not machine learning. It is a liability. A point estimate assumes a deterministic world. Litigation is stochastic. If you do not know the exact probability that a model is wrong, you cannot use it to set reserves.

Consider the physical reality of a modern claims file. It contains pleadings, unstructured medical records, demand letters, and internal correspondence, often running thousands of pages deep. When software vendors attempt to process this volume, they typically feed the entire file into a single neural network and ask it for a dollar amount. This guarantees failure. Generative AI is an exceptional tool for reading and structuring unstructured information. We use it to process those thousands of pages and map the facts. It does not predict the future. Asking a large language model to guess a settlement value results in hallucinated confidence. LLMs are trained to output plausible text sequences, not calibrated probabilities. Prediction requires an entirely different architecture. We use separate mathematical and geometric machine-learning models, trained strictly on massive datasets of resolved cases with known outcomes, to calculate the actual risk. The generative layer does the reading. The geometric layer does the math.

The Geometry of Calibration

Calibration is a strict mathematical property, and it is the only metric that matters for a forecasting platform. If a model predicts a 30 percent chance of litigation escalation across one thousand claims, exactly three hundred of those claims must escalate. Uncalibrated models fail this test entirely. They suffer from severe overconfidence, clustering their predictions near zero or one hundred percent. They act certain when they are mathematically blind. When facing the extreme volatility of social inflation and third-party litigation funding, an overconfident model will quickly destroy a balance sheet. It will recommend inadequate reserves on day one, ignore the mounting signals of a severe outcome, and fail to detect the trajectory of a nuclear verdict until the defense is already out of options. An uncalibrated prediction is worse than no prediction at all.

We abandon the single point guess in favor of conformal prediction. Conformal prediction is a rigorous mathematical framework that produces a valid range of outcomes rather than a fragile estimate. When we output a settlement range, we are calculating a mathematically guaranteed interval based on the geometry of the historical data. If the model operates at a 90 percent confidence level, the final settlement will fall within our stated range 90 times out of 100. This is not a heuristic. It is a strict boundary drawn around the data. As the underlying facts of a case become more complex or ambiguous, the range mechanically widens. The model is forced to be honest about its own uncertainty. It maps the exact limits of its knowledge and presents those limits clearly to the user.

This honesty is entirely absent from vendor marketing across the insurance technology sector. The industry standard is to hide behind aggregate accuracy metrics that fail to map to financial reality. A vendor might claim a low mean absolute error by training a model to simply predict the historical median for every case. That model looks highly accurate on paper because most claims settle for average amounts. In practice, it completely misses the severe outliers. It fails precisely when you need it most. Aggregate accuracy is a vanity metric designed to sell software. Claims executives need to know the error rate at the edges of the distribution, where the actual financial danger lives. A model must prove its reliability on the claims that deviate from the mean.

Why We Publish the Misses

We publish our error rates and calibration metrics because you cannot manage risk without them. Claims professionals must know exactly when and where the forecasting platform struggles. If a specific jurisdiction has a highly volatile jury pool, our geometric models will detect that variance in the historical data. The conformal settlement range will expand. The escalation probability will carry a wider margin of error. This is not a failure of the system. This is the system working exactly as designed. The widening range is a highly specific, mathematically derived signal. It tells the claims team that the local environment is fundamentally unstable, prompting them to allocate defense spend differently or pursue an aggressive early settlement before the variance compounds.

Structuring the Defense

Transparency extends to the variables driving the math. A calibrated range is useless if it exists in an opaque black box. Our architecture ensures that every output is directly traceable to the source documents. The generative layer reads the file and extracts the structured facts. The predictive layer calculates the reserve delta versus your current reserve, along with the escalation probability, based strictly on those extracted facts. When a claims professional reviews the reserve delta, they are not looking at an arbitrary mandate. They are looking at a structured argument. The platform highlights the specific medical codes, the plaintiff attorney history, and the exact comparable resolved cases that justify the settlement range. If a plaintiff attorney has a known history of pushing similar claims to trial, the model surfaces the comparable resolved cases that prove it. You see the drivers. You see the math. You see the error bounds.

Setting realistic reserves on day one requires strict mathematical discipline. You cannot negotiate from data if you do not trust the boundaries of your own model. By separating text processing from mathematical prediction, we eliminate the false certainty inherent in generative text. By enforcing strict calibration, we ensure that a 70 percent probability means exactly 70 percent. You stop relying on gut instinct, you stop reacting to plaintiff demands, and you start treating claims management as an exact science. You allocate your defense spend based on measurable risk.

Uncertainty is a permanent feature of litigation; measuring it is the only way to survive it.

Want to talk to an executive?

Press, partners, investors, candidates — the inbox is monitored. Tell us who you are and we'll route it to the right person within two business days.

Book a Demo See Open Roles

Measuring Calibration: Why We Publish Error Rates

The Geometry of Calibration

Why We Publish the Misses

Structuring the Defense

Related articles.

Want to talk to an executive?